Welcome to this training session on the programming language/software environment R! This document will have all of the code we are going to use in these sessions, which can be copied and pasted from here directly into your RStudio window to try them out.
Most of your work will take place in the code editor window. This is where you can write R scripts (a script is just a set of programming instructions), and save them to be run or edited later. When you run a script or part of a script, R inserts it into the console, which is where R processes code. You can also run code by entering it directly into the console. But this is mostly used for making sure a command does what you want it to, or quickly examining some of your data, as code entered here isn’t saved into one document the way code in your code editor is.
You can see what I mean with a simple R program. Enter the command below into your console.
print("Hello World")
## [1] "Hello World"
Congrats! You’ve written your first R program. Note that R does what you told it — it prints “hello world” in the console. You can also run this same line of code by writing it into your code editor, highlighting it, and hitting Ctrl - Enter if you are on a Windows or Linux machine, or Cmd - Enter if you are on a Mac. When you run code from the code editor like this, note that the output of the code still shows up down in the console.
Most of what you do in R will not be analysis in the sense of running regressions or other statistical tests, or even making graphs. Most of the work is what makes those things possible. This work is called data carpentry or data cleaning. Data carpentry involves everything from subsetting data to recoding variables to transforming data to merging datasets. It is the fundamental work that makes data analysis possible. It is also what will probably create the most headaches for you.
To learn data carpentry, we’re going to focus on two datasets. First, the Correlates of State Policy Project dataset that we looked at above, and the General Social Survey. You already know how to load the CSPP dataset. To load the GSS, load the “foreign” library and use its read.dta function to store the dataset.
library(foreign)
read.dta("~/Stats/Data/GSS/GSS7218_R1.dta") -> df.gss
There are multiple ways to do anything you want to do in R. Sometimes one way is genuinely better, and sometimes it’s purely a matter of aesthetics which one you choose. In this section, we’re going to look at some basic data carpentry using “vanilla” R, and then we will move into data carpentry with the tidyverse, a set of packages that work together extremely well to allow you to write flexible and readable R code.
So far, we’ve seen how to subset a dataset using square brackets. But the techniques of subsetting we’ve looked at haven’t been very useful. Often, you want to subset based on a variable value. For example, you only want to look at black respondents in a survey. Or you’re only interested in years after 1989 in some panel data. This can be done via square brackets fairly simply.
Let’s look at the CSPP data (since we deleted all of our objects, you’ll need to rerun the code to save the data). If you browse the data frame, you can see it begins in the year 1900. Let’s say we’re only interested in data after 1970. In that case, we’ll want to drop all observations (i.e. rows) in which the year is before 1970. To do that, we need to use logical elements.
Logical elements comparable to numeric or string elements, but they can only take two values: TRUE and FALSE. There are lots of expressions you can put into R and have it evaluate whether they are true or false. Three of the most common are >, <, and ==.
5>1
## [1] TRUE
1>5
## [1] FALSE
1==5
## [1] FALSE
5==5
## [1] TRUE
To subset our dataset to only include 1970 and later, we want to produce an expression that will be TRUE for for those dates, and false for any others. We’re also going to drop all the columns except
df.cspp[df.cspp$year>1969, c("year", "st", "unemployment")]
## year st unemployment
## 3571 1970 AL NA
## 3572 1970 AK NA
## 3573 1970 AZ NA
## 3574 1970 AR NA
## 3575 1970 CA NA
## 3576 1970 CO NA
## 3577 1970 CT NA
## 3578 1970 DE NA
## 3579 1970 DC NA
## 3580 1970 FL NA
## 3581 1970 GA NA
## 3582 1970 HI NA
## 3583 1970 ID NA
## 3584 1970 IL NA
## 3585 1970 IN NA
## 3586 1970 IA NA
## 3587 1970 KS NA
## 3588 1970 KY NA
## 3589 1970 LA NA
## 3590 1970 ME NA
## 3591 1970 MD NA
## 3592 1970 MA NA
## 3593 1970 MI NA
## 3594 1970 MN NA
## 3595 1970 MS NA
## 3596 1970 MO NA
## 3597 1970 MT NA
## 3598 1970 NE NA
## 3599 1970 NV NA
## 3600 1970 NH NA
## 3601 1970 NJ NA
## 3602 1970 NM NA
## 3603 1970 NY NA
## 3604 1970 NC NA
## 3605 1970 ND NA
## 3606 1970 OH NA
## 3607 1970 OK NA
## 3608 1970 OR NA
## 3609 1970 PA NA
## 3610 1970 RI NA
## 3611 1970 SC NA
## 3612 1970 SD NA
## 3613 1970 TN NA
## 3614 1970 TX NA
## 3615 1970 UT NA
## 3616 1970 VT NA
## 3617 1970 VA NA
## 3618 1970 WA NA
## 3619 1970 WV NA
## 3620 1970 WI NA
Try running the expression we used to subset.
df.cspp$year>1969
The result is a vector of logical elements the same length as the vector df.cspp$year. For each element in that vector, R evaluates whether it is greater than 1969, and outputs either TRUE or FALSE depending on the answer. When this vector is placed inside the brackets, R goes through the rows of the dataset, and checks whether the vector’s corresponding row (1st row of vector compared with 1st row of dataset, 2nd with 2nd, etc) is either TRUE or FALSE. If it’s TRUE, R keeps it. If it’s FALSE, R drops it.
You can also use ==. == is different from =. You will forget this a lot, and use = when you should use ==. If we wanted to drop all observations except the year 1999, we do the following.
df.cspp[df.cspp$year==1999, c("year", "st", "unemployment")]
## year st unemployment
## 5050 1999 AL 4.8
## 5051 1999 AK 6.4
## 5052 1999 AZ 4.4
## 5053 1999 AR 4.5
## 5054 1999 CA 5.2
## 5055 1999 CO 2.9
## 5056 1999 CT 3.2
## 5057 1999 DE 3.5
## 5058 1999 DC NA
## 5059 1999 FL 3.9
## 5060 1999 GA 4.0
## 5061 1999 HI 5.6
## 5062 1999 ID 5.2
## 5063 1999 IL 4.3
## 5064 1999 IN 3.0
## 5065 1999 IA 2.5
## 5066 1999 KS 3.0
## 5067 1999 KY 4.5
## 5068 1999 LA 5.1
## 5069 1999 ME 4.1
## 5070 1999 MD 3.5
## 5071 1999 MA 3.2
## 5072 1999 MI 3.8
## 5073 1999 MN 2.8
## 5074 1999 MS 5.1
## 5075 1999 MO 3.4
## 5076 1999 MT 5.2
## 5077 1999 NE 2.9
## 5078 1999 NV 4.4
## 5079 1999 NH 2.7
## 5080 1999 NJ 4.6
## 5081 1999 NM 5.6
## 5082 1999 NY 5.2
## 5083 1999 NC 3.2
## 5084 1999 ND 3.4
## 5085 1999 OH 4.3
## 5086 1999 OK 3.4
## 5087 1999 OR 5.7
## 5088 1999 PA 4.4
## 5089 1999 RI 4.1
## 5090 1999 SC 4.5
## 5091 1999 SD 2.9
## 5092 1999 TN 4.0
## 5093 1999 TX 4.6
## 5094 1999 UT 3.7
## 5095 1999 VT 3.0
## 5096 1999 VA 2.8
## 5097 1999 WA 4.7
## 5098 1999 WV 6.6
## 5099 1999 WI 3.0
## 5100 1999 WY 4.9
The opposite of == is != (does not equal). If we wanted to dump Illinois from our dataset, to spite the flatlanders, we would use this code.
df.cspp[df.cspp$st != "IL", c("year", "st", "unemployment")]
We can also use the & operator to combine two conditions. R will then evaluate both of them, and return TRUE only if both of them are true. Let’s say we wanted to look at Wisconsin after 1980.
df.cspp[df.cspp$st=="WI" & df.cspp$year>1979, c("year", "st", "unemployment")]
## year st unemployment
## 4130 1980 WI 7.0
## 4181 1981 WI 7.8
## 4232 1982 WI 10.7
## 4283 1983 WI 10.4
## 4334 1984 WI 7.3
## 4385 1985 WI 7.2
## 4436 1986 WI 7.0
## 4487 1987 WI 6.1
## 4538 1988 WI 4.2
## 4589 1989 WI 4.4
## 4640 1990 WI 4.4
## 4691 1991 WI 5.4
## 4742 1992 WI 5.1
## 4793 1993 WI 4.7
## 4844 1994 WI 4.7
## 4895 1995 WI NA
## 4946 1996 WI 3.5
## 4997 1997 WI 3.7
## 5048 1998 WI 3.4
## 5099 1999 WI 3.0
## 5150 2000 WI 3.5
## 5201 2001 WI 4.6
## 5252 2002 WI 5.5
## 5303 2003 WI 5.6
## 5354 2004 WI 5.0
## 5405 2005 WI 4.7
## 5456 2006 WI 4.7
## 5507 2007 WI 4.9
## 5558 2008 WI 4.9
## 5609 2009 WI 8.6
## 5660 2010 WI 8.7
## 5711 2011 WI 7.8
## 5762 2012 WI 7.0
## 5813 2013 WI 6.7
## 5864 2014 WI 5.4
## 5915 2015 WI 4.6
## 5966 2016 WI 4.1
## 6017 2017 WI 3.3
## 6068 2018 WI NA
## 6119 2019 WI NA
Finally, there’s the “or” operator, | (that’s shift + the backslash key). If we wanted to look just at the years 1990 and 2000, we could use this operator. Note that | is used in combination with two other expressions that can be evaluated as TRUE or FALSE.
df.cspp[df.cspp$year==1990 | df.cspp$year==2000, c("year", "st", "unemployment")]
## year st unemployment
## 4591 1990 AL 6.8
## 4592 1990 AK 6.9
## 4593 1990 AZ 5.3
## 4594 1990 AR 6.9
## 4595 1990 CA 5.6
## 4596 1990 CO 4.9
## 4597 1990 CT 5.1
## 4598 1990 DE 5.1
## 4599 1990 DC NA
## 4600 1990 FL 5.9
## 4601 1990 GA 5.4
## 4602 1990 HI 2.8
## 4603 1990 ID 5.8
## 4604 1990 IL 6.2
## 4605 1990 IN 5.3
## 4606 1990 IA 4.2
## 4607 1990 KS 4.0
## 4608 1990 KY 5.8
## 4609 1990 LA 6.2
## 4610 1990 ME 5.1
## 4611 1990 MD 4.6
## 4612 1990 MA 6.0
## 4613 1990 MI 7.5
## 4614 1990 MN 4.8
## 4615 1990 MS 7.5
## 4616 1990 MO 5.7
## 4617 1990 MT 5.8
## 4618 1990 NE 2.2
## 4619 1990 NV 4.9
## 4620 1990 NH 5.6
## 4621 1990 NJ 5.0
## 4622 1990 NM 6.3
## 4623 1990 NY 5.2
## 4624 1990 NC 4.1
## 4625 1990 ND 3.9
## 4626 1990 OH 5.7
## 4627 1990 OK 5.6
## 4628 1990 OR 5.5
## 4629 1990 PA 5.4
## 4630 1990 RI 6.7
## 4631 1990 SC 4.7
## 4632 1990 SD 3.7
## 4633 1990 TN 5.2
## 4634 1990 TX 6.2
## 4635 1990 UT 4.3
## 4636 1990 VT 5.0
## 4637 1990 VA 4.3
## 4638 1990 WA 4.9
## 4639 1990 WV 8.3
## 4640 1990 WI 4.4
## 4641 1990 WY 5.4
## 5101 2000 AL 4.6
## 5102 2000 AK 6.6
## 5103 2000 AZ 3.9
## 5104 2000 AR 4.4
## 5105 2000 CA 4.9
## 5106 2000 CO 2.7
## 5107 2000 CT 2.3
## 5108 2000 DE 4.0
## 5109 2000 DC NA
## 5110 2000 FL 3.6
## 5111 2000 GA 3.7
## 5112 2000 HI 4.3
## 5113 2000 ID 4.9
## 5114 2000 IL 4.4
## 5115 2000 IN 3.2
## 5116 2000 IA 2.6
## 5117 2000 KS 3.7
## 5118 2000 KY 4.1
## 5119 2000 LA 5.5
## 5120 2000 ME 3.5
## 5121 2000 MD 3.9
## 5122 2000 MA 2.6
## 5123 2000 MI 3.6
## 5124 2000 MN 3.3
## 5125 2000 MS 5.7
## 5126 2000 MO 3.5
## 5127 2000 MT 4.9
## 5128 2000 NE 3.0
## 5129 2000 NV 4.1
## 5130 2000 NH 2.8
## 5131 2000 NJ 3.8
## 5132 2000 NM 4.9
## 5133 2000 NY 4.6
## 5134 2000 NC 3.6
## 5135 2000 ND 3.0
## 5136 2000 OH 4.1
## 5137 2000 OK 3.0
## 5138 2000 OR 4.9
## 5139 2000 PA 4.2
## 5140 2000 RI 4.1
## 5141 2000 SC 3.9
## 5142 2000 SD 2.3
## 5143 2000 TN 3.9
## 5144 2000 TX 4.2
## 5145 2000 UT 3.2
## 5146 2000 VT 2.9
## 5147 2000 VA 2.2
## 5148 2000 WA 5.2
## 5149 2000 WV 5.5
## 5150 2000 WI 3.5
## 5151 2000 WY 3.9
Time for some exercises!
What if we wanted every year in our dataset except the 1980s (Reaganism, hair metal, the rise of Jay Leno — there are lots of reasons to want to forget the 80s!). What code would do it?
Let’s say we just wanted to see state-years in which unemployment was above 9%. What code would do it?
What if wanted to see the 1990s, but not 1997?
Let’s compare two states. Say, North Carolina and South Carolina. What code would subset the data to only those two states?
As you can see, the code to subset based on more than one condition gets pretty ugly pretty fast. When you combine that with other operations (for example, let’s say you wanted to a vector of unemployment rates converted to decimal notation, which is done by dividing them by 100), things can get downright unreadable.
The tidyverse is a set of libraries designed to fix this problem. To install it, use:
install.packages("tidyverse")
Then load it using the library() function.
The most fundamental change the tidyverse introduces is the pipe operator. Pipes look like this: %>%. What they allow you to do is places functions in a line, instead of nested inside one another.
To make this concrete, let’s say we wanted to look at unemployment in Wisconsin. And let’s say we wanted to look at the change in the unemployment rate each year. We can use the diff() function for that, which takes a numeric vector as an argument, and calculates the difference from row to row. Finally, let’s say that we’re pretending to be economists, and so we always convert our variables to logarithmic scales to make ourselves feel smart. Since some of the changes are negative, we’ll need to make them all positive, which we can do by adding 10 to all of them.
In base R, that looks like this.
log(diff(df.cspp[df.cspp$st=="WI", "unemployment"])+10)
## [1] NA NA NA NA NA NA NA NA
## [9] NA NA NA NA NA NA NA NA
## [17] NA NA NA NA NA NA NA NA
## [25] NA NA NA NA NA NA NA NA
## [33] NA NA NA NA NA NA NA NA
## [41] NA NA NA NA NA NA NA NA
## [49] NA NA NA NA NA NA NA NA
## [57] NA NA NA NA NA NA NA NA
## [65] NA NA NA NA NA NA NA NA
## [73] NA NA NA 2.163323 2.230014 2.322388 2.240710 2.525729
## [81] 2.379546 2.557227 2.272126 1.931521 2.292535 2.282382 2.208274 2.091864
## [89] 2.322388 2.302585 2.397895 2.272126 2.261763 2.302585 NA NA
## [97] 2.322388 2.272126 2.261763 2.351375 2.406945 2.388763 2.312535 2.240710
## [105] 2.272126 2.302585 2.322388 2.302585 2.617396 2.312535 2.208274 2.219203
## [113] 2.272126 2.163323 2.219203 2.251292 2.219203 NA NA
It’s pretty hideous to read. What object are you actually starting with here? It takes a few seconds to even figure it out. This is where pipes come to the rescue.
df.cspp[df.cspp$st=="WI", "unemployment"] %>% #note: you can put lots of pipes on the same line, but it's often more readable not to
diff() %>%
+ 10 %>%
log()
## [1] NA NA NA NA NA NA NA NA
## [9] NA NA NA NA NA NA NA NA
## [17] NA NA NA NA NA NA NA NA
## [25] NA NA NA NA NA NA NA NA
## [33] NA NA NA NA NA NA NA NA
## [41] NA NA NA NA NA NA NA NA
## [49] NA NA NA NA NA NA NA NA
## [57] NA NA NA NA NA NA NA NA
## [65] NA NA NA NA NA NA NA NA
## [73] NA NA NA 2.163323 2.230014 2.322388 2.240710 2.525729
## [81] 2.379546 2.557227 2.272126 1.931521 2.292535 2.282382 2.208274 2.091864
## [89] 2.322388 2.302585 2.397895 2.272126 2.261763 2.302585 NA NA
## [97] 2.322388 2.272126 2.261763 2.351375 2.406945 2.388763 2.312535 2.240710
## [105] 2.272126 2.302585 2.322388 2.302585 2.617396 2.312535 2.208274 2.219203
## [113] 2.272126 2.163323 2.219203 2.251292 2.219203 NA NA
Note that these two codes produce the exact same output. But the second chunk of code is much easier to read, because you can easily separate out every step of what you’re doing, instead of trying to find the innermost function and work backwards from there. From here on out, I’ll be using pipes for basically everything.
The tidyverse doesn’t just give us a nicer way to write our code here. It also gives us functions that allow us to subset a lot more effectively. The two key functions are select() and filter().
Select is simpler. It simply allows you to select which columns from a dataframe you’d like to keep. We’ve been using some rather clunky base R nomenclature to keep just the state, year, and unemployment columns from df.cspp. Using select, it’s a lot easier to read.
df.cspp %>% select(year,
st,
unemployment)
## year st unemployment
## 1 1900 AL NA
## 2 1900 AK NA
## 3 1900 AZ NA
## 4 1900 AR NA
## 5 1900 CA NA
## 6 1900 CO NA
## 7 1900 CT NA
## 8 1900 DE NA
## 9 1900 DC NA
## 10 1900 FL NA
## 11 1900 GA NA
## 12 1900 HI NA
## 13 1900 ID NA
## 14 1900 IL NA
## 15 1900 IN NA
## 16 1900 IA NA
## 17 1900 KS NA
## 18 1900 KY NA
## 19 1900 LA NA
## 20 1900 ME NA
## 21 1900 MD NA
## 22 1900 MA NA
## 23 1900 MI NA
## 24 1900 MN NA
## 25 1900 MS NA
## 26 1900 MO NA
## 27 1900 MT NA
## 28 1900 NE NA
## 29 1900 NV NA
## 30 1900 NH NA
## 31 1900 NJ NA
## 32 1900 NM NA
## 33 1900 NY NA
## 34 1900 NC NA
## 35 1900 ND NA
## 36 1900 OH NA
## 37 1900 OK NA
## 38 1900 OR NA
## 39 1900 PA NA
## 40 1900 RI NA
## 41 1900 SC NA
## 42 1900 SD NA
## 43 1900 TN NA
## 44 1900 TX NA
## 45 1900 UT NA
## 46 1900 VT NA
## 47 1900 VA NA
## 48 1900 WA NA
## 49 1900 WV NA
## 50 1900 WI NA
You can also use select to rename variables. This is often necessary, because a lot of datasets use totally non-descriptive names like VCF01094 for their variables. In this case, let’s say we wanted to change “st” to “state.”
df.cspp %>% select(year,
state=st,
unemployment)
When renaming, the new name goes before the equal sign, and the old name after. Note that in select(), unlike when using square brackets, you don’t need to tell R that it’s using strings. Since select() is only used for selecting column names, R just assumes that whatever you’re entering into it is a string.
The filter() function then allows us to also subset based on variable values. Its logic is fundamentally the same as square brackets. It evaluates statements as TRUE or FALSE and then drops rows for which the condition(s) are false.
df.cspp %>% select(year,
state=st,
unemployment) %>%
filter(year==1999)
## year state unemployment
## 1 1999 AL 4.8
## 2 1999 AK 6.4
## 3 1999 AZ 4.4
## 4 1999 AR 4.5
## 5 1999 CA 5.2
## 6 1999 CO 2.9
## 7 1999 CT 3.2
## 8 1999 DE 3.5
## 9 1999 DC NA
## 10 1999 FL 3.9
## 11 1999 GA 4.0
## 12 1999 HI 5.6
## 13 1999 ID 5.2
## 14 1999 IL 4.3
## 15 1999 IN 3.0
## 16 1999 IA 2.5
## 17 1999 KS 3.0
## 18 1999 KY 4.5
## 19 1999 LA 5.1
## 20 1999 ME 4.1
## 21 1999 MD 3.5
## 22 1999 MA 3.2
## 23 1999 MI 3.8
## 24 1999 MN 2.8
## 25 1999 MS 5.1
## 26 1999 MO 3.4
## 27 1999 MT 5.2
## 28 1999 NE 2.9
## 29 1999 NV 4.4
## 30 1999 NH 2.7
## 31 1999 NJ 4.6
## 32 1999 NM 5.6
## 33 1999 NY 5.2
## 34 1999 NC 3.2
## 35 1999 ND 3.4
## 36 1999 OH 4.3
## 37 1999 OK 3.4
## 38 1999 OR 5.7
## 39 1999 PA 4.4
## 40 1999 RI 4.1
## 41 1999 SC 4.5
## 42 1999 SD 2.9
## 43 1999 TN 4.0
## 44 1999 TX 4.6
## 45 1999 UT 3.7
## 46 1999 VT 3.0
## 47 1999 VA 2.8
## 48 1999 WA 4.7
## 49 1999 WV 6.6
## 50 1999 WI 3.0
## 51 1999 WY 4.9
If you want to subset based on multiple conditions with filter(), just separate them with a comma. Let’s say we want to see state-years after 1980 where unemployment was below 5%.
df.cspp %>% select(year,
state=st,
unemployment) %>%
filter(year>1980,
unemployment<5)
Now let’s do the same exercises we did in base R using the tidyverse.
What if we wanted every year in our dataset except the 1980s (Reaganism, hair metal, the rise of Jay Leno — there are lots of reasons to want to forget the 80s!). What code would do it?
Let’s say we just wanted to see state-years in which unemployment was above 9%. What code would do it?
What if wanted to see the 1990s, but not 1997?
Let’s compare two states. Say, North Carolina and South Carolina. What code would subset the data to only those two states?
Subsetting data is a very important skill, and you’ll probably use either filtering on a variable value or selecting only certain variables to work with on every data set you ever analyze. Most of all you’ll need to know is contained above.
You’ll spend much more time transforming data. Data can be transformed in lots of ways. Variables can be recoded. You can take a numerical variable, like income, and recode it into buckets (low income, middle income, high income), making it categorical. You can generate new variables based on the values of several existing variables (is a survey respondent white and making an income in the bottom third of the income distribution? Make a new variable (white working class) that gives those people a 1, and everyone else a 0). You can pivot data around a variable value (for example, if you have individual level survey data with a geographic residence variable (like state), you can transform your data into state-level summaries of other variables). You can also make cross-tabs, which display the distribution of two variables values across each other. We’re only going to be able to skim the surface of data transformation here, but we’ll cover the fundamentals.
One place to begin is is with the tibble. A tibble is kind of dataframe specific to the tidyverse. Pretty much any other dataframe can be made into a tibble. Here is mtcars as a tibble, via the function as_tibble().
mtcars %>%
as_tibble()
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
As you can see, tibbles have two very nice properties. One, when you receive them as output in the console, they only print the first ten rows, and they only print enough columns to fill your console screen horizontally. That way, you’re not looking at 500 rows of data where you can only see some column names some of the time. Second, they have little labels in pointy brackets telling you what class your variables are.
There are a good number of variable classes in R. The four basic ones are numeric (which can be doubles, having decimals, and integers, which don’t), character (which are strings), logical (TRUE or FALSE), and factors. Factors are the most complicated. A factor is a variable which R is treating as a categorical variable, with defined levels. For example, in the GSS, the race variable is coded is “black,” “white,” and “other.” If that were imported into R as a factor in a dataset, you couldn’t just add “asian america” or something like that to it, which you could if it were a string. Instead, you’d have to tell R that you’re adding a new level to this factor. A lot of very minor headaches in R come from forgetting whether a variable is currently coded as a character or a factor variable. This is one reason tibbles are so nice. They remind you!
From here out, I’m going to convert all dataframes that aren’t tibbles into tibbles. Again, the nice thing about the tidyverse is that it’s very easy to just throw this is in a pipe!
Very often, we want to summarize a lot of data in a table. The simplest form of this is summarizing a vector. The table() function does this. To play with this function, let’s switch over to the GSS.
Examine it in the terminal by entering the object’s name (I saved mine as df.gss). As you can see from the summary in the upper right, df.gss has about 64,000 observations and over 6,000 variables. We are definitely going to want to select only the variables we are interested in if we are analyzing GSS data!
Let’s start by turning our GSS object into a tibble.
df.gss %>%
as_tibble() -> df.gss
Picking a variable more or less at random, let’s say we wanted to look at a summary of people’s answer to the religion question. The basic GSS religion variable is relig. The table() function will summarize it for us.
df.gss$relig %>% table()
## .
## protestant catholic jewish
## 37117 15674 1285
## none other buddhism
## 7797 1086 198
## hinduism OTHER EASTERN MOSLEM/ISLAM
## 100 39 153
## ORTHODOX-CHRISTIAN christian NATIVE AMERICAN
## 118 791 31
## INTER-NONDENOMINATIONAL DK IAP
## 136 0 0
## NA
## 0
Lots of Protestants and Catholics, and a sprinkling of others. Since these are survey data, raw counts like this aren’t terribly informative. Percentages are what we really want (we are going to ignore issues of survey weighting in this tutorial, since it’s a bit more of a substantive statistical topic). This is what the prop.table() function is for. It takes a table, and renders it as percentages.
df.gss$relig %>% table() %>% prop.table()
## .
## protestant catholic jewish
## 0.5752344053 0.2429135994 0.0199147617
## none other buddhism
## 0.1208368849 0.0168306858 0.0030685781
## hinduism OTHER EASTERN MOSLEM/ISLAM
## 0.0015497869 0.0006044169 0.0023711740
## ORTHODOX-CHRISTIAN christian NATIVE AMERICAN
## 0.0018287485 0.0122588144 0.0004804339
## INTER-NONDENOMINATIONAL DK IAP
## 0.0021077102 0.0000000000 0.0000000000
## NA
## 0.0000000000
The GSS as a whole runs from 1972 to 2018, so knowing the distribution of religions across that entire period isn’t terribly informative. It’s probably more useful to get a table for a subset of years, or even a single wave of the survey.
df.gss %>%
filter(year==2000) %>%
select(relig) %>%
table() %>%
prop.table()
## .
## protestant catholic jewish
## 0.5407038749 0.2413793103 0.0223960185
## none other buddhism
## 0.1414859581 0.0149306790 0.0060433701
## hinduism OTHER EASTERN MOSLEM/ISLAM
## 0.0028439389 0.0003554924 0.0042659083
## ORTHODOX-CHRISTIAN christian NATIVE AMERICAN
## 0.0042659083 0.0138642019 0.0014219694
## INTER-NONDENOMINATIONAL DK IAP
## 0.0060433701 0.0000000000 0.0000000000
## NA
## 0.0000000000
Note that in this example I started with the dataset as a whole, rather than starting with the single vector as I did in the previous code chunk. This is because I need to have the year variable in order to filter based on it.
What about a table of two variables? This is called a crosstab. The table() function can also be used to make one. In this case, let’s look at the relationship between marital status and race. Because table() needs to take two arguments here, it might be easiest to do things without a pipe.
table(df.gss$race, df.gss$marital)
##
## married widowed divorced separated NEVER MARRIED NA
## white 29313 5088 6682 1297 9636 0
## black 3097 953 1285 771 3073 0
## other 1719 159 412 174 1128 0
## IAP 0 0 0 0 0 0
The disadvantage of this simple line of code is that it’s not terribly easy to do it for only a single year. To do that, it’s simplest to store your subsetted dataset as a new object. Let’s also look at proportions here, since again, raw counts aren’t terribly useful in a survey.
df.gss %>%
filter(year==2000) -> df.gss.2000
table(df.gss.2000$race, df.gss.2000$marital) %>% prop.table()
##
## married widowed divorced separated NEVER MARRIED
## white 0.383167614 0.077059659 0.128196023 0.021661932 0.175781250
## black 0.042968750 0.017045455 0.021306818 0.015269886 0.055752841
## other 0.027698864 0.002840909 0.007102273 0.002840909 0.021306818
## IAP 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##
## NA
## white 0.000000000
## black 0.000000000
## other 0.000000000
## IAP 0.000000000
Note the output from this isn’t immediately easy to interpret. To read a table like this, you need to know whether the first row, first column cell tells us the percentage of whites who are married, or the percentage of married people who are white. The default in prop.table() is to tell us neither; instead, it tells us the total share of observations represented by each cell. For example, 38% of respondents are white and married, while about 5% are black and never married.
To display marginal shares instead, we need to tell prop.table() we want to calculate based on row margins. Take a look at the help file for prop.table().
?prop.table()
We can see that prop.table() takes two arguments: the table generated by table(), and margin. To calculate row percentages (ie the percent of whites who are married), we enter a 1. Note: Like a lot of functions, prop.table() has essential arguments (the table) and optional arguments (margin). In a pipe, R will always try to send the pipe output into the essential argument. If we want to add an option argument, we just add it without specifying the essential argument.
table(df.gss.2000$race, df.gss.2000$marital) %>% prop.table(1)
##
## married widowed divorced separated NEVER MARRIED NA
## white 0.48757343 0.09805694 0.16312698 0.02756439 0.22367826 0.00000000
## black 0.28205128 0.11188811 0.13986014 0.10023310 0.36596737 0.00000000
## other 0.44827586 0.04597701 0.11494253 0.04597701 0.34482759 0.00000000
## IAP
If we wanted column marginals (percentage of married people who are white), we would simply enter 2 instead of 1.
How has the distribution of educational attainment changed in the US from the year 1990 to the year 2010? Use the GSS variable “degree” to answer this question. (Hint: make two tables and compare them)
How are religious fundamentalism and confidence in American business related since 2010? Use the GSS variables “conbus” and “fund” to answer.
Do people who were born outside of the US place a different value on hard work from people who weren’t? Answer using data from 2018, and the variables “born” and “workhard”.
So far, we’ve been looking at data transformations that summarize and compare chunks of existing data. We can also make pivot tables, which are tables that pivot data around certain variables. For example, in our df.cspp dataset, we have unemployment, state, and year data. We can pivot that data on the year variable to find average unemployment for the country as a whole. In the tidyverse, the group_by() and summarize() functions are used for this.
df.cspp %>%
as_tibble() %>%
group_by(year) %>%
filter(is.na(unemployment)==FALSE) %>%
summarise(avg_unemployment = weighted.mean(unemployment,
poptotal,
na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 42 x 2
## year avg_unemployment
## <int> <dbl>
## 1 1975 8.50
## 2 1976 7.66
## 3 1977 6.96
## 4 1978 6.02
## 5 1979 5.80
## 6 1980 7.07
## 7 1981 7.66
## 8 1982 9.70
## 9 1983 9.66
## 10 1984 7.63
## # … with 32 more rows
group_by() is an incredibly useful function, and one we’ll be returning to shortly when we talk about recoding variables. It splits your data up into groups, and then does whatever operation you’re telling R to do on each group. In this case, we’ve told R that each year is a distinct group. Then, the summarise() function tells R to collapse each group into a single observation using a function. Here, we’ve told R to find the weighted mean (we need to use a weighted mean here because the unemployment rate in New York will have a bigger impact on the average than the unemployment rate in Wisconsin) of unemployment in each group. The na.rm=TRUE tells R to throw out observations with missing data. Some functions do this automatically, others need to be told how to handle it. As always, use the help function ( ?weighted.mean() ) to figure out what’s necessary.
You can also group based on more than one variable. In that case, groups will be formed by the discrete combinations of values of the variables. For example, if you grouped based on a race variable that was black, white, and other, and on a education variable that was college-educated and non-college educated, you would have six groups (3 X 2).
We can use this to look at race and gender in voting in 2016.
df.gss %>%
filter(year==2018) %>%
group_by(race, sex, PRES16) %>%
tally()
## # A tibble: 40 x 4
## # Groups: race, sex [6]
## race sex PRES16 n
## <fct> <fct> <fct> <int>
## 1 white male iap 217
## 2 white male Clinton 188
## 3 white male Trump 290
## 4 white male Other candidate (specify) 42
## 5 white male Didn't vote for president 9
## 6 white male Don't know 8
## 7 white male No answer 15
## 8 white female iap 314
## 9 white female Clinton 278
## 10 white female Trump 257
## # … with 30 more rows
The tally() function is similar to summarise(), except simpler. It just counts how many observations are in each group that you’ve indicated. Once again, since this is a survey, raw counts aren’t very useful. But using the mutate() function, which we will cover in more depth soon, you can easily turn this into percentages.
df.gss %>%
filter(year==2018) %>%
group_by(race, sex, PRES16) %>%
tally() %>%
mutate(vote.share = prop.table(n))
## # A tibble: 40 x 5
## # Groups: race, sex [6]
## race sex PRES16 n vote.share
## <fct> <fct> <fct> <int> <dbl>
## 1 white male iap 217 0.282
## 2 white male Clinton 188 0.244
## 3 white male Trump 290 0.377
## 4 white male Other candidate (specify) 42 0.0546
## 5 white male Didn't vote for president 9 0.0117
## 6 white male Don't know 8 0.0104
## 7 white male No answer 15 0.0195
## 8 white female iap 314 0.340
## 9 white female Clinton 278 0.301
## 10 white female Trump 257 0.278
## # … with 30 more rows
As you can see, these answers include people who didn’t vote and people for whom the question is inapplicable. How would you alter this code so that your data only included people who answered Clinton or Trump?
In the CSPP, the variable “right2work” tracks the passage of right to work laws. Make a pivot table that shows us the number of states that do and don’t have right to work laws in each year. (Hint: this is similar to counting how many people voted for each presidential option).
What are the average personal incomes (“realrinc”) by race (“race”) for each year in the GSS?
Our data often fail us. The thing we want is measured wrong, or needs to be constructed from other measures. This is where recoding data comes in. Of all the things you can do to data that we’ve talked about today, you’ll probably spend the most time recoding.
In the tidyverse, the fundamental recoding function is mutate(). mutate() can be used either to add a new variable or transform an existing variable. For example, in the GSS, if we wanted to take the log of household income (“realinc”), we would use mutate as follows.
df.gss %>%
mutate(loginc = log(realinc))
## # A tibble: 64,814 x 6,109
## year id wrkstat HRS1 HRS2 evwork occ prestige wrkslf wrkgovt commute
## <int> <int> <fct> <int> <int> <fct> <int> <int> <fct> <fct> <int>
## 1 1972 1 WORKIN… NA NA <NA> 205 50 SOMEO… <NA> NA
## 2 1972 2 retired NA NA yes 441 45 SOMEO… <NA> NA
## 3 1972 3 WORKIN… NA NA <NA> 270 44 SOMEO… <NA> NA
## 4 1972 4 WORKIN… NA NA <NA> 1 57 SOMEO… <NA> NA
## 5 1972 5 KEEPIN… NA NA yes 385 40 SOMEO… <NA> NA
## 6 1972 6 WORKIN… NA NA <NA> 281 49 SOMEO… <NA> NA
## 7 1972 7 WORKIN… NA NA <NA> 522 41 SOMEO… <NA> NA
## 8 1972 8 WORKIN… NA NA <NA> 314 36 SOMEO… <NA> NA
## 9 1972 9 WORKIN… NA NA <NA> 912 26 SOMEO… <NA> NA
## 10 1972 10 WORKIN… NA NA <NA> 984 18 SOMEO… <NA> NA
## # … with 64,804 more rows, and 6,098 more variables: industry <int>,
## # OCC80 <fct>, PRESTG80 <int>, INDUS80 <int>, INDUS07 <int>, occonet <dbl>,
## # found <fct>, OCC10 <int>, occindv <fct>, occstatus <fct>, occtag <fct>,
## # PRESTG10 <int>, PRESTG105PLUS <int>, INDUS10 <int>, indstatus <fct>,
## # indtag <fct>, marital <fct>, martype <fct>, agewed <int>, divorce <fct>,
## # widowed <fct>, spwrksta <fct>, SPHRS1 <int>, SPHRS2 <int>, spevwork <fct>,
## # cowrksta <fct>, cowrkslf <fct>, coevwork <fct>, COHRS1 <int>, COHRS2 <int>,
## # spocc <int>, sppres <int>, spwrkslf <fct>, spind <int>, SPOCC80 <int>,
## # SPPRES80 <int>, SPIND80 <int>, SPOCC10 <int>, spoccindv <fct>,
## # spoccstatus <fct>, spocctag <fct>, SPPRES10 <int>, SPPRES105PLUS <int>,
## # SPIND10 <fct>, spindstatus <fct>, spindtag <fct>, COOCC10 <int>,
## # COIND10 <fct>, PAOCC16 <int>, PAPRES16 <int>, pawrkslf <fct>,
## # PAIND16 <int>, PAOCC80 <fct>, PAPRES80 <int>, PAIND80 <fct>, PAOCC10 <int>,
## # paoccindv <fct>, paoccstatus <fct>, paocctag <fct>, PAPRES10 <int>,
## # PAPRES105PLUS <int>, PAIND10 <int>, paindstatus <fct>, paindtag <fct>,
## # MAOCC80 <fct>, MAPRES80 <int>, mawrkslf <fct>, MAIND80 <fct>,
## # MAOCC10 <int>, maoccindv <fct>, maoccstatus <fct>, maocctag <fct>,
## # MAPRES10 <int>, MAPRES105PLUS <int>, MAIND10 <fct>, maindstatus <fct>,
## # maindtag <fct>, sibs <int>, childs <int>, age <int>, agekdbrn <int>,
## # educ <int>, paeduc <int>, maeduc <int>, speduc <int>, coeduc <int>,
## # codeg <fct>, degree <fct>, padeg <fct>, madeg <fct>, spdeg <fct>,
## # MAJOR1 <fct>, MAJOR2 <fct>, dipged <fct>, spdipged <fct>, codipged <fct>,
## # cosector <fct>, whenhs <int>, whencol <int>, sector <fct>, …
We can compare our new variable with our old using select to only view those columns.
df.gss %>%
mutate(loginc = log(realinc)) %>%
select(realinc, loginc)
## # A tibble: 64,814 x 2
## realinc loginc
## <dbl> <dbl>
## 1 18951 9.85
## 2 24366 10.1
## 3 24366 10.1
## 4 30458 10.3
## 5 50763 10.8
## 6 43994 10.7
## 7 37226 10.5
## 8 13537 9.51
## 9 2707 7.90
## 10 18951 9.85
## # … with 64,804 more rows
Here, we tell R to create a variable called loginc, and define it as the log of the variable realinc. Note that in doing so, we haven’t actually changed the object df.gss. To do that, we need to add a store instruction.
df.gss %>%
mutate(loginc = log(realinc)) -> df.gss
We can also use mutate to combine two variables. For example, we can divide household income by the square root of the number of household members (“hompop”), which is the Census Bureau’s recommended method of adjusting household income for household size.
df.gss %>%
mutate(adj.realinc = realinc/sqrt(hompop)) -> df.gss
These are some pretty simple mathematical transformations of variables. R can do things that are much more interesting. For example, we can turn a continuous variable, like income, into a categorical variable that is easier to work with in some cases, like tables.
Do something here on transforming a continuous variable to categorical for use in tables. Essentially, this means taking something continuous, like income, and chopping it up into a few buckets. In the tidyverse, the ntile() function does this extremely nicely. Let’s say we want a variable of income quintiles.
df.gss %>%
group_by(year) %>% #quintiles should be calculated with reference to the income distribution from the year of the survey
mutate(realrinc.quintiles = ntile(realrinc, 5)) -> df.gss
Since we stored this transformation, we can now make some nice tables using income, which we couldn’t do before.
table(df.gss$degree, df.gss$realrinc.quintiles) %>% prop.table(1)
##
## 1 2 3 4 5
## LT HIGH SCHOOL 0.33639083 0.27258225 0.18683948 0.12801595 0.07617149
## HIGH SCHOOL 0.22293596 0.23374166 0.22268698 0.18548949 0.13514590
## JUNIOR COLLEGE 0.14003795 0.20417457 0.23377609 0.24060721 0.18140417
## bachelor 0.12097137 0.11317644 0.17928347 0.25828212 0.32828661
## graduate 0.06883532 0.06476910 0.10078420 0.24339239 0.52221900
## dk
## iap
## na
A lot of recoding that you’ll want to do is based on “if then” type statements. For example, if someone is white and their income is in the bottom third of the distribution, we’d like to label them as white working class. Or if someone has said they are “very conservative” or “somewhat conservative,” we’d like to simplify that into a variable where either answer marks someone as “conservative.” For these situations, we want case_when().
case_when() typically gets inserted inside of a mutate() command. It has a basic two part structure. The first part is an expression that can be evaluated as TRUE or FALSE. If the statement is TRUE for a given row, then R will throw in the second part as a new value.
For example, here we do a very simple recode that just gives us a 1 if someone makes above the median income, and a 0 if they don’t.
df.gss %>%
group_by(year) %>%
filter(is.na(realrinc)==FALSE) %>% #This line drops all observations where realrinc is NA
mutate(abov.med = case_when(realrinc > median(realrinc) ~ 1,
realrinc <= median(realrinc) ~ 0)) %>%
select(realrinc, abov.med)
## Adding missing grouping variables: `year`
## # A tibble: 37,887 x 3
## # Groups: year [30]
## year realrinc abov.med
## <int> <dbl> <dbl>
## 1 1974 4935 0
## 2 1974 43178 1
## 3 1974 18505 0
## 4 1974 22206 1
## 5 1974 55515 1
## 6 1974 4935 0
## 7 1974 4935 0
## 8 1974 18505 0
## 9 1974 11103 0
## 10 1974 30841 1
## # … with 37,877 more rows
The tilde (“~”) separates the two arguments here. realrinc > median(realrinc) can be evaluated as TRUE or FALSE. If it’s TRUE, R puts a 1 in the new variable abov.med. If it’s false, R goes down to the next condition in your case_when() call, and checks if it’s true or false. Since here our conditions are cumulatively exhaustive, we only need two. Because of this fact, when you’re writing a case_when() call, you never actually need to specify your last condition. If everything not covered by one of your preceding conditions falls into the same category, you can just throw a TRUE in for your last one, and everything that’s left will get the value you associate with it.
df.gss %>%
group_by(year) %>%
filter(is.na(realrinc)==FALSE) %>%
mutate(abov.med = case_when(realrinc > median(realrinc) ~ 1,
TRUE ~ 0)) %>%
select(realrinc, abov.med)
## Adding missing grouping variables: `year`
## # A tibble: 37,887 x 3
## # Groups: year [30]
## year realrinc abov.med
## <int> <dbl> <dbl>
## 1 1974 4935 0
## 2 1974 43178 1
## 3 1974 18505 0
## 4 1974 22206 1
## 5 1974 55515 1
## 6 1974 4935 0
## 7 1974 4935 0
## 8 1974 18505 0
## 9 1974 11103 0
## 10 1974 30841 1
## # … with 37,877 more rows
Here’s how you’d use case_when() to code people as white working class or not.
df.gss %>%
group_by(year) %>%
filter(is.na(realrinc)==FALSE) %>%
mutate(inctile = ntile(realinc,3)) %>% #income into terciles by year
mutate(wwc = case_when(race=="white" & inctile==1 ~ 1, #White and lowest tercile
TRUE ~ 0)) %>%
select(year, inctile, race, wwc)
## # A tibble: 37,887 x 4
## # Groups: year [30]
## year inctile race wwc
## <int> <int> <fct> <dbl>
## 1 1974 1 white 1
## 2 1974 2 white 0
## 3 1974 1 white 1
## 4 1974 2 white 0
## 5 1974 3 white 0
## 6 1974 1 white 1
## 7 1974 3 black 0
## 8 1974 2 white 0
## 9 1974 1 black 0
## 10 1974 NA white 0
## # … with 37,877 more rows
Use case_when() to create a new variable that tells us whether someone’s height (“height”) is greater than average. (Bonus: use group_by() to make this comparison year and sex specific.)
The GSS has an unnecessarily complicated political party identification scheme. Take a look at it:
df.gss$partyid %>% table()
## .
## STRONG DEMOCRAT NOT STR DEMOCRAT IND,NEAR DEM independent
## 10378 13294 7792 9888
## IND,NEAR REP NOT STR REPUBLICAN STRONG REPUBLICAN OTHER PARTY
## 5721 9933 6318 1072
## DK NA
## 0 0
Why not simplify it into three categories: liberal, conservative, and independent?
The GSS asks a number of questions about national spending priorities. Let’s look at two, “natheal”, which looks at healthcare, and “nateduc”, which looks at education. Let’s say we wanted to group respondents into three categories: cutters, who want to reduce spending on both; spenders, who want to increase spending on both; and midroaders, who want something in between. How would you use case_when() to create a new variable, spending.attitude, with these categories?
Sometimes we want variables from multiple datasets. This is one area R has a huge advantage over programs like STATA or SPSS, because it’s very simple in R to work with multiple datasets at once.
The function in the tidyverse for merging datsets is inner_join(). Inner join takes three arguments: your first dataset, your second dataset, and then the columns that specify unique observations. For example, in our CSPP dataset, the combination of state and year is sufficient to identify a unique row. So if we were merging another dataset with the full CSPP dataset, it would also need to have a variable called state and a variable called year. Merging those two datasets would look something like this.
inner_join(df.cspp, second.df, by=c("state", "year"))
To illustrate how this works, I’ve prepared two very simple and processed datasets that can be easily merged. One comes from CSPP, and has states, years, and right to work status. The other comes from the American National Elections Survey, and has states, years, and respondents’ “feeling thermometer” for big business. You can download them both from my github here and here. Just use ctrl + s (or cmd + s if you’re on mac) to save the files, and then load them in R and assign them names using read.csv.
Once you’ve done that, merge them, and save the resulting dataset.
inner_join(name1, name2, by=c("state", "year")) -> df.merged
Examine the resulting dataset. As you can see, it has about 34,000 rows. This is because effectively what we have now are the ANES data, but with an extra variable added that isn’t in that survey: whether the respondent lived in a state that was right to work in the year of the study.
From here, it’s very simple to turn this into a pivot table that shows us the average feeling thermometer for respondents in right to work and non-right to work states by year.
df.merged %>%
group_by(year, right2work) %>%
summarise(avg.temp = mean(big.bus))
Note: all the real work of making this table consisted of processing the datasets to make them easy to merge. You can see the R script I used to do that, and download it and run it on your computer, here.
Merging datasets will always be a pain. Two variables that measure the same thing will be named different things, or will have a slight difference in their coding that you have to identify and fix before you can merge them. You should never expect it to work the first time you try to merge two datasets.
Words divide, pictures unite. - Otto Neurath
Data visualization is the most effective way to present data. A table of numbers will always take your reader more effort to decipher than a graphical plot will. A lot of the tables that we made in the previous section would be much better as plots. So let’s plot some data.
The best way to make plots in R is with the library ggplot2, which is part of the tidyverse. Since you already installed the tidyverse, you already have ggplot2.
ggplot2 is based around two fundamental concepts aesthetics and geometries. Aesthetics are variables in your dataset mapped to concepts in your plot. For example, if you place a variable like education on the x axis, and a variable like income on your y axis, each of those is an aesthetic. There are lots of other kinds of aesthetics besides x and y axis that can be employed, but for now, let’s stick with those two.
Geometries are how your data are represented. Line graph, bar graph, heat map, box plot, histogram — all of these are geometries in ggplot2.
Here’s a simple example of what it looks like, using mtcars.
mtcars %>%
ggplot(aes(x=wt, y=mpg)) +
geom_point()
As you can see, the two parts of ggplot are contained in two different functions. First, we have the ggplot() function. Inside this function, we place the aes() function, whose arguments are the different aesthetics we want to map. Then, our geometry function, geom_point, is added with a “+” sign. Note that it is NOT a pipe. Once you feed your data into ggplot, subsequent ggplot functions to change your graph are added with “+”, not “%>%”.
We can add another aesthetic to this plot easily. Let’s say we want to also be able to see how the number of cylinders fits in to this relationship. One way to do that is by having the points be different colors based on the number of cylinders.
mtcars %>%
ggplot(aes(x=wt, y=mpg, color=as.factor(cyl))) + #Turn a continuous variable into a categorical one.
geom_point()
Note that I transformed cyl, a numeric variable, into a categorical one using the function as.factor(). This didn’t change the dataset at all. Instead, it told R to transform the data in the process of plotting it. Color can also be used to represent numeric variables, but in this case, since cyl is a variable with three levels, it made more sense to tell R to treat it as categorical, rather than as numeric.
Another aesthetic that can be used is size. This time, let’s use it to look at a continuous variable — horsepower.
mtcars %>%
ggplot(aes(x=wt, y=mpg, size=hp)) +
geom_point()
These plots aren’t very pretty. The default grey ggplot2 backdrop is ugly (IMO), and the axis labels aren’t very informative to anyone who doesn’t know the dataset. There are two functions I use a lot to clean up plots: labs() and theme_bw(). labs() adds labels, and theme_bw() offers a more modern looking backdrop.
mtcars %>%
ggplot(aes(x=wt, y=mpg, size=hp)) +
geom_point() +
labs(title="Weight, Mileage, and Horsepower",
subtitle = "You can put a subtitle here",
x = "Weight",
y = "Miles Per Gallon",
size = "Horsepower",
caption = "You can put a caption here") +
theme_bw()
Take a look at the dataset diamonds, which is included with ggplot2. Make a plot with carat on the x axis, price on the y axis, and then choose one other variable to map to either color or size. Is the variable numeric or categorical? Is it, like cyl, coded as numeric, but actually functions more like a categorical variable, and therefore should be transformed? Make your plot and show it off!
There are lots of different kinds of graphs. In this tutorial, we’re going to focus on a few of the most common, while highlighting a few of the different tricks you can use to represent data in different ways. In these examples, we will mostly work with variables that aren’t going to require a lot of recoding or other carpentry on our part. But keep in mind, most of the time you make graphs, most of the work will be data carpentry to get the data in the shape you want it to make a graph, not making the graph itself (though sometimes that can be tricky too!).
One of the simplest kind of graphs is a histogram. Histograms represent the distribution of a numeric variable. Here’s a histogram of height in the GSS, for all years. (Note: you may be wondering how I know what variables are included in the 6,000 GSS variables and 2,000 CSPP variables. The answer is that I went through the codebooks, which you have to do with basically any data set you use. The codebook for the GSS is here, and the codebook for the CSPP is here.)
df.gss %>%
select(height) %>%
ggplot(aes(x=height)) + geom_histogram()
## Adding missing grouping variables: `year`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 62182 rows containing non-finite values (stat_bin).
What’s up with that big spike just over 70 inches? Seems statistically improbable!
As you can see in the code, histograms only require one aesthetic (i.e. one variable). That variable is mapped on to the X axis, and the Y axis is then automatically the count, or frequency, or observations at that value of the X axis variable.
There are two primary ways you’d change histograms. First, you can alter how fine-grained the binning of the X axis is. This affects how many bars are in your plot. You alter this with the “bins” argument in the geom_histogram() function. The default is 30 bins, which in this case actually lines up with our observed range (about 45 to 85) pretty well.
df.gss %>%
select(height) %>%
ggplot(aes(x=height)) + geom_histogram(bins=60)
## Adding missing grouping variables: `year`
## Warning: Removed 62182 rows containing non-finite values (stat_bin).
The other primary way to alter a histogram is by adding another aesthetic (i.e. another variable). Here, the obvious one to add is sex.
df.gss %>%
select(height, sex) %>%
ggplot(aes(x=height, fill=sex)) + geom_histogram(position = "dodge")
## Adding missing grouping variables: `year`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 62182 rows containing non-finite values (stat_bin).
Here, we need to add sex to our select() call, since we need it in the data we’re sending into ggplot(). Then, we add the “fill” aesthetic in our aes() function. Note that this is different from the “color” aesthetic we used above. With points and lines, you use color. With bar graphs and other polygon plots, you use “fill”. For polygon plots, “color” only affects the outline of the polygons, not their internal colors. We also need to add the ‘position = “dodge”’ argument to geom_histogram(), which tells R to put the bars side by side for each height, rather than on top of one another.
Choose a numeric variable from the GSS (you can use the data explorer to search for variables, or just ask me for suggestions). Create a histogram of it, using a second variable to split the data (I’d suggest something simple and dichotomous or trichotomous, like race (“race”), or sex (“sex”), or divorce status (“divorce”), or ever unemployed in last ten years (“unemp”)).
Use the filter() function to subset the GSS based on some variable, and then create a histogram of a variable from the subsetted data.
Scatter plots are good for dealing with data with lots of observations, like surveys or panel data. They work best when both x and y variables are continuous. For example, in the CSPP, we can look at the evangelical share of state population, and a state’s per capita corrections spending (we have expenditure spending, and population, and so can easily construct that measure).
df.cspp %>%
as_tibble() %>%
select(st,
year,
poptotal,
exp_correction,
evangelical_pop) %>%
mutate(pcec = (exp_correction*1000)/poptotal) %>% #Creating the per capita corrections spending variable
ggplot(aes(x=evangelical_pop, y=pcec)) + geom_point()
## Warning: Removed 6020 rows containing missing values (geom_point).
We could add information to this plot by mapping a variable to the color aesthetic. The CSPP uses a four region coding scheme, in which the south is 1, the west is 2, the midwest is 3, and the northeast is 4.
df.cspp %>%
as_tibble() %>%
select(st,
year,
poptotal,
exp_correction,
evangelical_pop,
region) %>% #adding region to our chosen variables
mutate(pcec = (exp_correction*1000)/poptotal) %>%
ggplot(aes(x=evangelical_pop, y=pcec, color=as.factor(region))) + #Once again, converting a numeric variable to categorical
geom_point()
## Warning: Removed 6020 rows containing missing values (geom_point).
There are a number of other aesthetics you can use with geom_point(). Shape and size are the two most useful. Here’s the same plot, but using shape instead of color to represent region.
df.cspp %>%
as_tibble() %>%
select(st,
year,
poptotal,
exp_correction,
evangelical_pop,
region) %>% #adding region to our chosen variables
mutate(pcec = (exp_correction*1000)/poptotal) %>%
ggplot(aes(x=evangelical_pop, y=pcec, shape=as.factor(region))) + #Once again, converting a numeric variable to categorical
geom_point()
## Warning: Removed 6020 rows containing missing values (geom_point).
If you wanted, you could combine all three of them, to represent five different variables’ relationships on one plot. That would make for a very difficult to read plot. In practice, you probably only want to map 3 variables (maybe 4 if you have a very good reason).
For data like this, our x and y scales work pretty well. There are a lot fewer observations than there are ticks on the x axis. In some datasets, that won’t be the case. For example, the GSS has about 2,000 observations for each wave. For a variable like education, this ends up looking pretty ugly and uninformative if we use geom_point().
df.gss %>%
filter(year==2018) %>%
ggplot(aes(x=educ, y=realrinc)) + geom_point()
## Warning: Removed 986 rows containing missing values (geom_point).
For variables like this, geom_jitter() is a very helpful function. It spreads out the points a little, making the density of observations at each value of the x axis a lot clearer.
df.gss %>%
filter(year==2018) %>%
ggplot(aes(x=educ, y=realrinc)) + geom_jitter()
## Warning: Removed 986 rows containing missing values (geom_point).
For plots like this, it can be helpful to draw lines of best fit. The function geom_smooth() lets us do this.
df.gss %>%
filter(year==2018) %>%
ggplot(aes(x=educ, y=realrinc)) + geom_jitter() +
geom_smooth(se=FALSE) # se=FALSE tells it not to draw a confidence interval
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 986 rows containing non-finite values (stat_smooth).
## Warning: Removed 986 rows containing missing values (geom_point).
By default, geom_smooth() uses lowess smoothing, an algorithm that calculates locally weighted means of the y variable for every value of the x variable. But you can also have it draw a linear regression line, equivalent to the line \(y=a + bx\) if you fit a bivariate regression.
df.gss %>%
filter(year==2018) %>%
ggplot(aes(x=educ, y=realrinc)) + geom_jitter() +
geom_smooth(method = "lm", #Draw a linear regression line
se=FALSE) # se=FALSE tells it not to draw a confidence interval
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 986 rows containing non-finite values (stat_smooth).
## Warning: Removed 986 rows containing missing values (geom_point).
Sometimes, it’s useful to plot words, rather than dots. For example, if we are making a plot where our observations are the 50 states, we might want state abbreviations to be mapped to the variable values, rather than colored dots. This is accomplished with geom_text(). This function needs to the aesthetic “label” to be specified in your ggplot() call.
df.cspp %>%
select(st, year, healthfree, infantmortality) %>%
filter(year==2001) %>%
ggplot(aes(x=healthfree, y=infantmortality, label=st)) + geom_text() +
geom_smooth(method = "lm",
se=FALSE) +
theme_bw() +
labs(title = "Health Insurance Freedom and Infant Mortality in the States, 2001",
x = "Health Insurance Freedom Index",
y = "Infant Mortality Rate (deathss per 1,000 live births",
caption = "Data: Correlates of State Policy Project, Michigan State University")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_text).
Use CSPP for the year 2001 to make a scatter plot of state education spending per student (“edinstruct_expend_pstud”) and state high school drop out rate (“eddropoutrate”), with total enrolled students ("enrollstudents) mapped to the size aesthetic. Use theme_bw() and labs() to make it look nice!
Line plots work for data that are grouped, such that there are a defined number of observations for each value on the x axis. The simplest way to do this is with a group of 1. For example, we could look at how average years of education has changed over time in the GSS.
df.gss %>%
group_by(year) %>%
summarise(avg_ed = mean(educ, na.rm=TRUE)) %>%
ggplot(aes(x=year,y=avg_ed)) + geom_line()
## `summarise()` ungrouping output (override with `.groups` argument)
We can also break this down by gender by mapping it to color.
df.gss %>%
group_by(year, sex) %>%
summarise(avg_ed = mean(educ, na.rm=TRUE)) %>%
ggplot(aes(x=year,y=avg_ed, color=sex)) + geom_line()
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
As you can see, line graphs are often good for representing pivot tables.
One alternative to color as an aesthetic is linetype (useful when submitting plots for journals that only publish in greyscale).
df.gss %>%
group_by(year, sex) %>%
summarise(avg_ed = mean(educ, na.rm=TRUE)) %>%
ggplot(aes(x=year,y=avg_ed, linetype=sex)) + geom_line()
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
Finally, you can also add points to a line plot.
df.gss %>%
group_by(year, sex) %>%
summarise(avg_ed = mean(educ, na.rm=TRUE)) %>%
ggplot(aes(x=year,y=avg_ed, linetype=sex)) + geom_line() + geom_point()
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
Use mutate() and case_when() to make a new, numeric version of variable “abany” (support abortion being legal for any reason). Then, use group_by(), summarise(), and mean() to make a pivot table of the average answer to this question for each year in the GSS. Plot it using geom_line(). Bonus: break it down by gender!
Bar plots can be used in some of the same situations that line plots can. For example, here are the data from the previous plot, but with geom_bar().
df.gss %>%
group_by(year, sex) %>%
summarise(avg_ed = mean(educ, na.rm=TRUE)) %>%
ggplot(aes(x=year,y=avg_ed, fill=sex)) +
geom_bar(stat="identity",
position="dodge")
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
In this case, a line plot is clearly the superior way to present the data. Notice a few changes: first, I had to change linetype to fill, since linetype isn’t an aesthetic available to bar plots. Second, geom_bar() needs two argument. First, stat=“identity” is necessary to tell R that I’m giving it a ready made table, and it doesn’t have to do any counting. Second, position=“dodge” tells R to put the two values of sex side by side, instead of stacking them.
Without stat=“identity”, geom_bar()’s default behavior is to count observations, and plot the counts. For example, here is a bar plot of “relig” for the year 2010.
df.gss %>%
filter(year==2010) %>%
ggplot(aes(x=relig)) + geom_bar()
The data here are identical to using table() on “relig” for the year 2010. Here, we don’t need position=“dodge”, because we don’t have any fill aesthetic to decide whether to stack or place side by side.
Like line plots, bar plots are good for pivot tables. Here are average educational expenditures per student by region for the year 2008.
df.cspp %>%
filter(year==2008) %>%
group_by(region) %>%
summarise(avg_spending = mean(edinstruct_expend_pstud, na.rm=TRUE)) %>%
ggplot(aes(x=region, y= avg_spending)) + geom_bar(stat="identity")
## `summarise()` ungrouping output (override with `.groups` argument)
cheatsheet: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Faceting draws a series of graphs, based on the values of an additional variable. In that sense, you can think of faceting like an additional aesthetic, except unlike other aesthetics, it’s not specified in the ggplot() call. There are two ways to facet: facet_wrap() and facet_grid(). We’ll focus on facet_wrap() here.
facet_wrap() takes a variable as an argument, and depending whether you put a tilde (“~”) before or after it, makes graphs that proceed in either rows or columns through the values of the variable you’ve put down. For example, here is the education/income scatterplot we looked at above, made for every year in the GSS.
df.gss %>%
ggplot(aes(x=educ, y=realrinc)) + geom_jitter() +
geom_smooth(method = "lm",
se=FALSE) + # se=FALSE tells it not to draw a confidence interval
facet_wrap(~year)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 26969 rows containing non-finite values (stat_smooth).
## Warning: Removed 26969 rows containing missing values (geom_point).
In this case, the number of graphs makes faceting not terribly useful for a image of this size. But the basic principle is always the same. You could, for example, make a graph of the relationship between income vote choice, and draw a different graph for voters by race.
One cool use for ggplot2 is maps. Since ggplot2 can make polygons, it can draw maps. Making maps can be a little tricky, because it always involves merging data, and merging data almost always involves some data cleaning. But once you get the hang of it, it can be a really nice way to represent data.
In this tutorial, I’m going to focus on US maps, though you can also download packages that let you draw world maps for international data. My preferred map package is “usmap”. Install it and load it now.
To draw a map, we need a data frame that tells R where to draw lines. We get that dataframe with the following code.
us_map(regions="states") %>% as_tibble()
## # A tibble: 12,999 x 9
## x y order hole piece group fips abbr full
## <dbl> <dbl> <int> <lgl> <int> <chr> <chr> <chr> <chr>
## 1 1091779. -1380695. 1 FALSE 1 01.1 01 AL Alabama
## 2 1091268. -1376372. 2 FALSE 1 01.1 01 AL Alabama
## 3 1091140. -1362998. 3 FALSE 1 01.1 01 AL Alabama
## 4 1090940. -1343517. 4 FALSE 1 01.1 01 AL Alabama
## 5 1090913. -1341006. 5 FALSE 1 01.1 01 AL Alabama
## 6 1090796. -1334480. 6 FALSE 1 01.1 01 AL Alabama
## 7 1090235. -1304438. 7 FALSE 1 01.1 01 AL Alabama
## 8 1089944. -1289528. 8 FALSE 1 01.1 01 AL Alabama
## 9 1089451. -1265304. 9 FALSE 1 01.1 01 AL Alabama
## 10 1089305. -1258363. 10 FALSE 1 01.1 01 AL Alabama
## # … with 12,989 more rows
As you can see, this data frame has, for each state, a series of x and y values and a number of other variables. To use it, we should save it as data frame.
us_map(regions="states") %>% as_tibble() -> df.map
Now we need some data to map. Let’s use the CSPP unemployment data we’ve been using. To do this, we need to merge the data, and to do that, we need a variable with the same name (easy to fix) that’s coded the exact same way (can be a major pain) in both datasets.
There’s a shortcut to checking this. It involves the unique() function, which takes a vector or dataframe as an argument, and outputs the unique elements (or, in a data frame, rows) in it. This lets us check whether the coding in the two variables is the same. In this case, let’s look at the abbreviated state names, since those will be easier to change if they need to be.
unique(df.cspp$st) == unique(df.map$abbr)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE
In this case, the two are identical, so we only need to change the name of one of them to make them mergeable. Let’s take only the variables we need from our measured data to reduce clutter. I’ll also filter by year to give us only one year of data, though the procedure would be identical if I were merging with all years.
df.cspp %>%
filter(year==2010) %>%
select(abbr = st, #changing the name of this variable to match the one in df.map
unemployment) -> df.donor
Now it’s simply a matter of merging the data sets, and sending the resulting data set into ggplot().
inner_join(df.map, df.donor, by="abbr") %>%
ggplot(aes(x=x,y=y,fill=unemployment, group=group)) + #Note you need group=group here.
geom_polygon(color="black") + #color="black" draws state lines in black
coord_fixed() #This tells R to preserve the aspect ratio, no matter the size of the image
In the above example, we used a continuous variable (unemployment) for the fill aesthetic. We can also use categorical variables. In the next example, we’ll use a categorical variable, “fsspr”, which rates the strength of state’s development planning role as weak, significant, or substantial. Since the data in the dataframe are stored as 1, 2, 3, I have to recode it with the labels using the function factor().
df.cspp %>%
filter(year==2009) %>% #Data stop in 2009 for this variable
select(abbr = st, #changing the name of this variable to match the one in df.map
fsspr) %>% #selecting fsspr rather than unemployment
mutate(fsspr2 = case_when(fsspr==1 ~ "weak",
fsspr==2 ~ "significant",
fsspr==3 ~ "substantial")) -> df.donor
inner_join(df.map, df.donor, by="abbr") %>%
ggplot(aes(x=x,y=y,fill=fsspr2, group=group)) + #Note you need group=group here.
geom_polygon(color="black") + #color="black" draws state lines in black
coord_fixed() #This tells R to preserve the aspect ratio, no matter the size of the image
An extension of ggplot2, the library gganimate, has the ability to create animated graphics. While obviously these animations can’t be used in papers, they can be quite nice for presentations and other forms of public scholarship.
Fundamentally, animations do the same thing that faceting does. It takes a variable, and splits your plot into multiple plots based on values of that variable. Instead of using facet_wrap() or facet_grid(), however, you use one of gganimate’s transition functions. To see what I mean, let’s look at diamonds first faceted by color, and then animated.
diamonds %>% ggplot(aes(x=carat,y=price, color=clarity)) + geom_point() + facet_wrap(~color)
diamonds %>% ggplot(aes(x=carat,y=price, color=clarity)) + geom_point() + transition_states(color)
As you can see, the only difference in the code is substituting transition_states() for facet_wrap(). As you saw when running this code, animation takes time. Because of this, it’s often helpful to test your code using facet_wrap(), and only move it to an animation when you’re sure all the other parts of it are working correctly.
Let’s do one with GSS data. First, we’ll make a pivot table that plots average income by year, highest education degree, and sex.
df.gss %>%
group_by(year, degree, sex) %>%
summarise(avg_income = mean(realrinc, na.rm=TRUE))%>%
ungroup() %>%
filter(complete.cases(.)) -> df.edsex
## `summarise()` regrouping output by 'year', 'degree' (override with `.groups` argument)
Then, we can make an animation that shows the relationship by year.
df.edsex %>%
ggplot(aes(x=degree,y=avg_income, color=sex, group=sex)) + geom_line() + #we need group here because R doesn't know whether to group by year or sex at this point.
transition_states(year)
This doesn’t look great. The animation moves too quickly to really be interpretable. By default, gganimate() makes a gif of 100 frames that plays at 10 frames per second. Since we have 30 years of data, that’s about 1/3 of a second per year. We can change that with the function animate(), which can take the arguments nframes, fps, and duration (obviously inputing two of those arguments automatically determines the third). To use animate, we need to save our animation, and then add animate to it.
df.edsex %>%
ggplot(aes(x=degree,y=avg_income, color=sex, group=sex)) + geom_line() + #we need group here because R doesn't know whether to group by year or sex at this point.
transition_states(year) -> p1
animate(p1,
nframes=300,
fps=10)
Looking better! Now let’s add some decent labeling. One really nice thing that transition_states() does is give us the object {closest_state}, which we can insert in any string, and R will put in the value of the variable we have mapped to transition_state() (in our case, year) for that point in the animation.
df.edsex %>%
ggplot(aes(x=degree,y=avg_income, color=sex, group=sex)) + geom_line() + #we need group here because R doesn't know whether to group by year or sex at this point.
transition_states(year) +
theme_bw() +
labs(title = "Education, Income, and Gender, 1974-2018",
subtitle = "Year: {closest_state}",
x="Educational Degree",
y="Average Income",
caption = "Data: General Social Survey")-> p1
animate(p1,
nframes=300,
fps=10)
Make an animation of a bar plot that shows the changing distribution of the variable “partyid” by year (don’t worry about futzing with fps and the like. Just try and get an animation made!).
Take the variables “st”, “year”, and “right2work” from the CSPP, and merge them with your df.map dataframe. Then, make an animation that shows the passage of right to work laws by year.