Categories
Uncategorized

Comparing US baby names

Here I analyse the US Social Security Baby Name catalogue, which reports the name given to male and female newborns for every year since 1880.

First of all, I download the data sets from the US website and create a new folder “Names”. Simply writing ls Names/ breaks down all the contents within that folder.
It is often easier to see the data when it’s laid out in a pandas data frame or table like this. I slice out all the baby names recorded in the year 2007 – the file name is ‘yob2007’. Then I create another column, for the year.
Now I create a variable “allyears” so that I can manipulate all the data much more easily.

I need to load all the tables and concatenate them in a single data frame. To avoid confusing data from different years, we can prepare the individual data frames by adding a new column that specifies the year. To do it on the fly, directly from the output of “read CSV”, by chaining a method, we can use the data frame assign. 

We managed to load the file in a one liner, so you can see that I’m going to use a comprehension to concatenate all the data frames. 

This piece of code does several things. We loop over all the years between 1880 and 2018. We build up the file name using an f-string, and feed that into “read CSV”. We specify the column names, and we add the column that gives the correct year from the loop variable. Finally, we pass all the resulting data frames to pd concat, or pandas concat.

Here is what the data frame looks like.
In the top graph, we can see the popularity of the name (for boys) ‘Alex’ over time, from 1880 to 2020. Notice how Matplot-lib automatically uses the index to set the x-axis. It probably makes sense also to consider the frequency of a name as a fraction of the number of babies born in a year. So in the bottom graph, I measure the proportion of babies born on any given year who are named Mary. To get that, we can apply “group by” on the un-indexed data frame and take the sum. Then we can normalize Mary by all the newborns in every year. So as a percentage of all babies, Mary was actually more popular at the beginning of the 20th century. But there were altogether more Marys born in the 1920s and 50s.
Here I compare the popularity of multiple names. In the bottom graph, where I look at female names, my own name (Georgina) seems strikingly unpopular over time – the red line grazes the bottom!

What if I want to look at the variance of the same name, like Claire?

Here I look at the results for the names Claire, Clare, Clara, Chiara and Ciara. For instance, there are two spellings of Claire. There’s an older version Clara, and an Italian and Irish spelling for the pronunciation Ciara. Here’s the plot. Notice how Metro-lib tries to put the legend out of the way. Claire is now dominant, but Clara is having a resurgence after having been the dominant variant at the beginning of the 20th century. We can also make a slightly different cumulative or stacked plot that adds up the counts on top of each other (see the next graph).
Here, I’m searching for all the boys’ names given to babies in the year 2018. This is because I want to find out which are the ten most popular boys names in that year.
This data frame has sorted the values, with Liam being the most popular boys’ name in 2018.

Yearly top ten names: tracking the popularity of a name across years

Here I select the data for the given index (Male, Year). To select all records for a given index, we use .loc followed by brackets, not parentheses, with the index value. This is a multi index.loc. Chaining pandas allows us to see the top ten, and get rid of the index. If we are to build a table of the top 10 names over multiple years, we should get rid of the index with Reset Index, and select the name, Column Only.
This is the equivalent for girls’ names.

Plotting a graph to analyse the change in popularity over time

2018 top ten.
The top graph looks at the popularity of girls’ baby names across the entire database of records. We can see that Evelyn peaked in the 1920s period. As for the boys, Liam is at the top in recent years. William and James are classic favourites.

All-time favourite baby names

We select females, grouping by name, sort their values, and then take the top ten. If we look at the popularity over time of these names, we see that they’ve gained their spots in the first half of the 20th century except for Jennifer. Now that given the structure of the all-time f data frame, I’m looping over the index rather than the value.

Top ten unisex baby names

We’ll load our data set as usual. We need to compute the total number of boys and girls for a given name. This seems a good place to use group by, which lets us segment the data before applying an aggregation, in this case, the sum of the number of babies. So we use group by over sex and name, we select the number column and we take the sum. From this list with a multi-index, we can grab the males and females respectively, using dot lock. As you see, the two indices are going to be different. Nevertheless, we can combine the two series and pandas will align the indices for us. The results would be none where either series doesn’t have an element. For instance we check where the ratio between males and females is less than two. We can certainly get rid of those nones with drop in A. Now, remember the definition of unisex names as those with a ratio between .5 and two. This is a good expression for fancy indexing, and after we apply it, we see that 1660 names pass the test. Here, I’ve taken the index, because we don’t actually need the ratio itself, but just the names.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s