A little bit of Monica in my life

Mar 4, 2018 11 min read about me

One analysis in the R community that caught my attention is Hilary Parker’s analysis of the most poisoned baby name in US history. I was surprised that my own name didn’t show up in the analysis. If Hilary had a huge loss in 1993, what happened to Monica in 1999?

For a period when I was a kid the name ‘Monica’ is what made the adults turn down the NPR broadcast. Its entrance into popular culture due to the Clinton Impeachment, a shoutout in Mambo No. 5, and also being name of the the least likable roommate on Friends (#bossy)¹, wasn’t exactly the kind of material that was going to make me cool.

But now I am fond of my name again and I’m also looking back on that cultural moment with a new lens thanks to Monica Lewinsky’s awesome talk and essay. Still, with the babynames package on CRAN, I had to take a look at where ‘Monica’ falls on the poisoned names list.

🎙 One, two, three, four, five 🎙

The `babynames` package

Since Hilary conducted her analysis, it’s much easier to get the baby names data because it is now available as a package on CRAN. The data frame includes the year, sex, name, and frequency of the name. It also includes the proportion, prop, of people of that gender and name born in that year. One other difference is that we can now calculate the relative risk using the tidyverse.

library(babynames)
head(babynames)

## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

str(babynames)

## tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
##  $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ sex : chr [1:1924665] "F" "F" "F" "F" ...
##  $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
##  $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...

In Hilary’s analysis she looked at the top 1000 names in a given year, so I will follow that methodology.

Data wrangling

First I will limit the data set to the top 1000 names in a year and then calculate the relative risk and percentage loss.

As Hilary explains in her post, the relative risk is a measure used to compare proportions. In public health we use it to compare the proportion of people who get a disease who were exposed to something to the proportion of people who get a disease who were unexposed to something. For example, if 10 out of 100 people (10%) who get a flu shot end up getting the flu in a given year, and 15 out of 100 people (15%) who don’t get a flu show end up getting the flu in a given year, then we can divide .10 by .15 to get the relative risk of 0.67. This means that people who get the flu shot have 0.67 times the risk of getting the flu compared to those who don’t get the flu shot. In Hilary’s analysis, she calculates the relative risk as a loss percent to help us think of it as a decrease. In the flu example, the loss percent would be (1-0.67)*100 = 33% less likely.

Here’s how we can calculate this in the tidyverse.

library(tidyverse)

top1000 <- babynames %>%
  group_by(sex, year) %>%
  arrange(desc(n)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 1000) %>%
  group_by(sex, name) %>%
  arrange(year) %>%
  mutate(relrisk = prop/lag(prop),
         loss_pct = (1-relrisk)*100)

The only problem with doing it like this is that if there are gaps that the name made it in the top 1000, the calculation is off. For example, look at the name Aarush. Aarush was in the top 1000 in 2010, then dipped below the top 1000 in 2011 to 2014, and then back in the top 1000 in 2015.

library(knitr)

top1000 %>%
  filter(name == "Aarush") %>%
  kable()

year	sex	name	n	prop	rank	relrisk	loss_pct
2010	M	Aarush	227	0.0001106	900	NA	NA
2015	M	Aarush	211	0.0001035	974	0.9358163	6.418369

Ok, let’s try this again.

top1000 <- babynames %>%
  group_by(sex, year) %>%
  arrange(desc(n)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 1000) %>%
  group_by(sex, name) %>%
  arrange(year) %>%
  mutate(relrisk = ifelse(year == lag(year)+1, prop/lag(prop), NA),
         loss_pct = (1-relrisk)*100) %>%
  ungroup()

It works!

top1000 %>%
  group_by(sex, name) %>%
  filter(year != lag(year)+1) %>%
  arrange(name, year, sex)

## # A tibble: 13,573 x 8
## # Groups:   sex, name [4,213]
##     year sex   name       n      prop  rank relrisk loss_pct
##    <dbl> <chr> <chr>  <int>     <dbl> <int>   <dbl>    <dbl>
##  1  2017 M     Aaden    240 0.000122    888      NA       NA
##  2  2015 M     Aarush   211 0.000104    974      NA       NA
##  3  1882 M     Ab         5 0.0000410   943      NA       NA
##  4  1885 M     Ab         6 0.0000518   836      NA       NA
##  5  1887 M     Ab         5 0.0000457   921      NA       NA
##  6  1890 M     Abb        6 0.0000501   884      NA       NA
##  7  1891 M     Abbie      5 0.0000458   964      NA       NA
##  8  1937 F     Abbie     52 0.0000472   964      NA       NA
##  9  1942 F     Abbie     65 0.0000468   968      NA       NA
## 10  1957 F     Abbie    120 0.0000572   938      NA       NA
## # … with 13,563 more rows

Biggest percent drops

OK, let’s see who has the biggest drops! Did I reproduce the list?

top1000 %>%
  arrange(desc(loss_pct)) %>%
  filter(sex == "F") %>%
  filter(row_number() <= 14) %>%
  select(name, loss_pct, year, n) %>%
  kable()

name	loss_pct	year	n
Farrah	78.08533	1978	332
Dewey	74.43853	1899	24
Catina	73.58773	1974	329
Khadijah	72.48679	1995	438
Deneen	71.88842	1965	421
Hilary	70.19101	1993	343
Katina	69.31745	1974	765
Renata	69.02648	1981	224
Iesha	68.91567	1992	581
Clementine	68.82256	1881	6
Minna	67.88056	1883	7
Ashanti	67.84702	2003	962
Infant	67.48330	1991	187
Tennille	66.79955	1978	141

These rankings are slightly different than Hilary’s but pretty darn close! I think the difference comes from the fact that I didn’t round the loss percent. Close enough!

🎺 Jump up and down and move it all around 🎺

So where does Monica stand? Let’s take a look at its loss percent compared to the other names.

Finding Monica in `babynames`

First I filtered the data set to see how the loss percent for Monica compares to the list of the top poisoned names.

top1000 %>%
  filter(name == "Monica") %>%
  select(name, year, loss_pct, n) %>%
  arrange(desc(loss_pct))

## # A tibble: 136 x 4
##    name    year loss_pct     n
##    <chr>  <dbl>    <dbl> <int>
##  1 Monica  1999     34.2  2133
##  2 Monica  1902     32.0    33
##  3 Monica  1998     24.7  3229
##  4 Monica  1914     21.8   156
##  5 Monica  1899     18.9    30
##  6 Monica  1919     18.3   178
##  7 Monica  1888     18.0    13
##  8 Monica  1918     17.3   223
##  9 Monica  1936     17.1   273
## 10 Monica  2013     16.8   597
## # … with 126 more rows

Only a 34% loss for Monica in 1999 compared to 70% for Hilary in 1993 and 78% for Farrah in 1978.

Graphing it

Let’s take a look at the names side by side using ggplot2.

top1000 %>%
  mutate(percent = prop*100) %>%
  filter(name == "Monica" | name == "Hilary" | name == "Hillary") %>%
  ggplot(aes(year, percent, colour = name)) +
  geom_line() +
  labs(title = "Percent of girls given the name Hilary/Hillary or Monica over time",
       y = "Percent",
       x = "Year")

Wow, that really changes my perspective. Hilary may have the largest loss percent in a given year, but was the impact of the drop in Monica’s greater because there were more Monica’s to begin with? It’s hard to tell. It looks like there was already a downward trend after the 1970s, but that the trend started to level off in the 1990s (maybe Monica from Friends wasn’t so unpopular after all?!), before plummeting in ’98 and ’99.

🎵 And if it looks like this then you’re doing it right 🎵

The risk difference

In public health we use both the relative risk² and the risk difference. They provide two different perspectives on the same information, and the usefulness of each measure depends on what question you are hoping to answer³. When the question is about the population-level impact of a factor on an outcome, then the risk difference is a more useful measure. When the question is about the strength of an association, the relative risk is the best option. The question about which baby name was the most poisoned–that is, which name had the strongest drop in popularity–is a question about the strength of an association.

However, when I looked at the plot above, a new question came to mind: did the drop in the baby name Monica have a greater impact in terms of the overall number of babies being named in the 90s?

In more detail, The risk difference (RD) is the difference in proportions. It tells us what the excess risk of disease is among those who have been exposed to something vs. those who have not been exposed. In a study on alcohol use and breast cancer⁴, a 40-year-old woman has an absolute risk of 1.45 of developing breast cancer in the next 10 years. If she is a light drinker, this risk becomes 1.51 percent. The risk ratio is 1.51/1.45 = 1.04, or a 4% increase in risk, which seems like a non-negligible amount. However, the risk difference is 1.51-1.45 = 0.06, or an excess risk of .06 per 100. That’s 6 people for every 10,000 light drinkers, which is a pretty low population impact. When you put it in absolute terms, the RD brings a different perspective to certain types of risk.

The risk difference in `babynames`

What would the risk difference say about these poisoned names? I’ll calculate the risk differences in the babynames data set. In this analysis, we can think of the year as the “exposure” and the name as the “outcome.” We can then calculate the “risk” or probability of being named Monica in any given year compared to the prior year. To make this more interpretable, I also calculated the excess risk, or how many people were named Monica in a year per 10,000 people named compared to the previous year, by multiplying the risk difference by 10,000.

rds <- top1000 %>%
  group_by(sex, name) %>%
  arrange(year) %>%
  mutate(riskdiff= ifelse(year == lag(year)+1, prop - lag(prop), NA),
         excessrisk_per10000 = round(riskdiff*10000,1)) %>%
  ungroup()

rds %>%
  filter(sex == "F") %>%
  arrange(riskdiff) %>%
  filter(row_number() <= 14) %>%
  select(year, name, n, relrisk, loss_pct, riskdiff, excessrisk_per10000) %>%
  kable()

year	name	n	relrisk	loss_pct	riskdiff	excessrisk_per10000
1937	Shirley	26816	0.7459047	25.409533	-0.0082912	-82.9
1936	Shirley	35159	0.8372019	16.279809	-0.0063452	-63.5
1950	Linda	80432	0.8821419	11.785807	-0.0061105	-61.1
1951	Linda	73972	0.8755442	12.445582	-0.0056921	-56.9
1985	Jennifer	42650	0.8238794	17.612060	-0.0049392	-49.4
1952	Linda	67088	0.8807161	11.928385	-0.0047766	-47.8
1970	Lisa	38964	0.8326375	16.736249	-0.0042753	-42.8
1957	Deborah	40070	0.8224526	17.754744	-0.0041239	-41.2
1954	Linda	55381	0.8757818	12.421825	-0.0039456	-39.5
1958	Cynthia	31003	0.8008784	19.912163	-0.0037328	-37.3
1883	Mary	8012	0.9475668	5.243321	-0.0036927	-36.9
1977	Amy	26731	0.8150250	18.497499	-0.0036882	-36.9
1938	Shirley	23769	0.8556230	14.437703	-0.0035140	-35.1
1953	Linda	61275	0.9006528	9.934718	-0.0035037	-35.0

So in absolute terms, the biggest drops were among names that have higher frequencies. The overall impact is greater. There is a decrease of 83 Shirley’s per 10,000 babies born in the year 1937 compared to the prior year. Sorry Shirley!

Hilary/Hillary and Monica are nowhere near the top losses if we look at the risk difference.

But what if we compare Hilary to Monica? Who has the biggest drop in absolute terms?

rds %>%
  filter(sex == "F") %>%
  filter(name == "Monica" | name == "Hilary" | name == "Hillary",
         year > 1990) %>%
  arrange(riskdiff) %>%
  filter(row_number() <= 10) %>%
  select(year, name, n, riskdiff, excessrisk_per10000) %>%
  kable()

year	name	n	riskdiff	excessrisk_per10000
1993	Hillary	1064	-0.0007180	-7.2
1999	Monica	2133	-0.0005702	-5.7
1998	Monica	3229	-0.0005462	-5.5
1993	Hilary	343	-0.0004097	-4.1
1994	Hillary	408	-0.0003304	-3.3
1993	Monica	3900	-0.0003101	-3.1
1991	Monica	4156	-0.0001239	-1.2
2000	Monica	1990	-0.0000984	-1.0
2003	Monica	1613	-0.0000964	-1.0
2001	Monica	1793	-0.0000920	-0.9

And Hillary wins with an excess risk of -7.2 per 10,000 babies in 1993! In 1993, there were about 7 fewer babies named Hillary than in 1992 for every 10,000 babies born. It’s even larger if you combine Hilary and Hillary. Interestingly, Hillary had a larger risk difference compared to Hilary in 1993. Using the risk ratio and risk difference, Hilary is the more poisoned name.

P.S. Seriously, watch this talk or read this story by Monica Lewinsky. It’s inspiring and adds a lot to current conversations about harassment and bullying.

I’ll admit that I’ve never actually watched Friends, but this is the impression I got from random clips!↩︎
The relative risk is also referred to as the risk ratio or the cumulative incidence ratio in epidemiology. I know, why don’t we just stick to calling it one thing?!↩︎
The usefulness of these measures is often the source of a lot of confusion in the media when there are stories about the benefit of certain treatments and screenings, for example, the benefits of mammography or the harms of alcohol use and birth control. One reason I love The Upshot is because they do an excellent job of explaining this to a non-epidemiologist audience.↩︎
https://www.nytimes.com/2017/11/10/upshot/health-alcohol-cancer-research.html ↩︎

babynames

A little bit of Monica in my life

🎙 One, two, three, four, five 🎙

The `babynames` package

Data wrangling

Biggest percent drops

🎺 Jump up and down and move it all around 🎺

Finding Monica in `babynames`

Graphing it

🎵 And if it looks like this then you’re doing it right 🎵

The risk difference

The risk difference in `babynames`

Monica Gerber

Data Scientist in Public Health

Related

A little bit of Monica in my life

🎙 One, two, three, four, five 🎙

The babynames package

Data wrangling

Biggest percent drops

🎺 Jump up and down and move it all around 🎺

Finding Monica in babynames

Graphing it

🎵 And if it looks like this then you’re doing it right 🎵

The risk difference

The risk difference in babynames

Monica Gerber

Data Scientist in Public Health

Related

The `babynames` package

Finding Monica in `babynames`

The risk difference in `babynames`