Although many fundamental data manipulation functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together. This leads to difficult-to-read nested functions and/or choppy code. R Studio is driving a lot of new packages to collate data management tasks and better integrate them with other analysis activities. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways, which leads to:
dplyr
is one such package which was built for the sole purpose of simplifying the process of manipulating, sorting, summarizing, and joining data frames. This tutorial serves to introduce you to the basic functions offered by the dplyr
package. These fundamental functions of data transformation that the dplyr package offers includes:
select()
selects variablesfilter()
provides basic filtering capabilitiesgroup_by()
groups data by categorical levelssummarise()
summarizes data by functions of choicearrange()
orders datajoin()
joins separate dataframesmutate()
creates new variablesinstall.packages("dplyr")
library(dplyr)
For the examples that follow, we’ll use the following census data which includes the K-12 public school expenditures by state. This data frame currently is 50x16 and includes expenditure data for 14 unique years.
## Division State X1980 X1990 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011
## 1 6 Alabama 1146713 2275233 4176082 4354794 4444390 4657643 4812479 5164406 5699076 6245031 6832439 6683843 6670517 6592925
## 2 9 Alaska 377947 828051 1183499 1229036 1284854 1326226 1354846 1442269 1529645 1634316 1918375 2007319 2084019 2201270
## 3 8 Arizona 949753 2258660 4288739 4846105 5395814 5892227 6071785 6579957 7130341 7815720 8403221 8726755 8482552 8340211
## 4 7 Arkansas 666949 1404545 2380331 2505179 2822877 2923401 3109644 3546999 3808011 3997701 4156368 4240839 4459910 4578136
## 5 9 California 9172158 21485782 38129479 42908787 46265544 47983402 49215866 50918654 53436103 57352599 61570555 60080929 58248662 57526835
## 6 8 Colorado 1243049 2451833 4401010 4758173 5151003 5551506 5666191 5994440 6368289 6579053 7338766 7187267 7429302 7409462
Although not required, the tidyr and dplyr packages make use of the pipe operator %>%
developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%
.
This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:
Both functions complete the same task and the benefit of using %>%
is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. For more info check out the %>%
tutorial.
Objective: Reduce dataframe size to only desired variables for current task
Description: When working with a sizable dataframe, often we desire to only assess specific variables. The select()
function allows you to select and/or rename variables.
Function: select(data, ...)
Same as: data %>% select(...)
Arguments:
data: data frame
...: call variables by name or by function
Special functions:
starts_with(x, ignore.case = TRUE): names starts with x
ends_with(x, ignore.case = TRUE): names ends in x
contains(x, ignore.case = TRUE): selects all variables whose name contains x
matches(x, ignore.case = TRUE): selects all variables whose name matches the regular expression x
Example
Let’s say our goal is to only assess the 5 most recent years worth of expenditure data. Applying the select()
function we can select only the variables of concern.
sub.exp <- expenditures %>% select(Division, State, X2007:X2011)
head(sub.exp) # for brevity only display first 6 rows
## Division State X2007 X2008 X2009 X2010 X2011
## 1 6 Alabama 6245031 6832439 6683843 6670517 6592925
## 2 9 Alaska 1634316 1918375 2007319 2084019 2201270
## 3 8 Arizona 7815720 8403221 8726755 8482552 8340211
## 4 7 Arkansas 3997701 4156368 4240839 4459910 4578136
## 5 9 California 57352599 61570555 60080929 58248662 57526835
## 6 8 Colorado 6579053 7338766 7187267 7429302 7409462
We can also apply some of the special functions within select()
. For instance we can select all variables that start with ‘X’:
head(expenditures %>% select(starts_with("X")))
## X1980 X1990 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011
## 1 1146713 2275233 4176082 4354794 4444390 4657643 4812479 5164406 5699076 6245031 6832439 6683843 6670517 6592925
## 2 377947 828051 1183499 1229036 1284854 1326226 1354846 1442269 1529645 1634316 1918375 2007319 2084019 2201270
## 3 949753 2258660 4288739 4846105 5395814 5892227 6071785 6579957 7130341 7815720 8403221 8726755 8482552 8340211
## 4 666949 1404545 2380331 2505179 2822877 2923401 3109644 3546999 3808011 3997701 4156368 4240839 4459910 4578136
## 5 9172158 21485782 38129479 42908787 46265544 47983402 49215866 50918654 53436103 57352599 61570555 60080929 58248662 57526835
## 6 1243049 2451833 4401010 4758173 5151003 5551506 5666191 5994440 6368289 6579053 7338766 7187267 7429302 7409462
You can also de-select variables by using “-“ prior to name or function. The following produces the inverse of functions above
expenditures %>% select(-X1980:-X2006)
expenditures %>% select(-starts_with("X"))
Objective: Reduce rows/observations with matching conditions
Description: Filtering data is a common task to identify/select observations in which a particular variable matches a specific value/condition. The filter()
function provides this capability.
Function: filter(data, ...)
Same as: data %>% filter(...)
Arguments:
data: data frame
...: conditions to be met
Examples
Continuing with our sub.exp dataframe which includes only the recent 5 years worth of expenditures, we can filter by Division:
sub.exp %>% filter(Division == 3)
## Division State X2007 X2008 X2009 X2010 X2011
## 1 3 Illinois 20326591 21874484 23495271 24695773 24554467
## 2 3 Indiana 9497077 9281709 9680895 9921243 9687949
## 3 3 Michigan 17013259 17053521 17217584 17227515 16786444
## 4 3 Ohio 18251361 18892374 19387318 19801670 19988921
## 5 3 Wisconsin 9029660 9366134 9696228 9966244 10333016
We can apply multiple logic rules in the filter()
function such as:
< Less than != Not equal to
> Greater than %in% Group membership
== Equal to is.na is NA
<= Less than or equal to !is.na is not NA
>= Greater than or equal to &,|,! Boolean operators
For instance, we can filter for Division 3 and expenditures in 2011 that were greater than $10B. This results in Indiana, which is in Division 3, being excluded since its expenditures were < $10B (FYI - the raw census data are reported in units of $1,000).
# Raw census data are in units of $1,000
sub.exp %>% filter(Division == 3, X2011 > 10000000)
## Division State X2007 X2008 X2009 X2010 X2011
## 1 3 Illinois 20326591 21874484 23495271 24695773 24554467
## 2 3 Michigan 17013259 17053521 17217584 17227515 16786444
## 3 3 Ohio 18251361 18892374 19387318 19801670 19988921
## 4 3 Wisconsin 9029660 9366134 9696228 9966244 10333016
Objective: Group data by categorical variables
Description: Often, observations are nested within groups or categories and our goals is to perform statistical analysis both at the observation level and also at the group level. The group_by()
function allows us to create these categorical groupings.
Function: group_by(data, ...)
Same as: data %>% group_by(...)
Arguments:
data: data frame
...: variables to group_by
*Use ungroup(x) to remove groups
Example
The group_by()
function is a silent function in which no observable manipulation of the data is performed as a result of applying the function. Rather, the only change you’ll notice is, if you print the dataframe you will notice underneath the Source information and prior to the actual dataframe, an indicator of what variable the data is grouped by will be provided. The real magic of the group_by()
function comes when we perform summary statistics which we will cover shortly.
group.exp <- sub.exp %>% group_by(Division)
head(group.exp)
## Source: local data frame [6 x 7]
## Groups: Division
##
## Division State X2007 X2008 X2009 X2010 X2011
## 1 6 Alabama 6245031 6832439 6683843 6670517 6592925
## 2 9 Alaska 1634316 1918375 2007319 2084019 2201270
## 3 8 Arizona 7815720 8403221 8726755 8482552 8340211
## 4 7 Arkansas 3997701 4156368 4240839 4459910 4578136
## 5 9 California 57352599 61570555 60080929 58248662 57526835
## 6 8 Colorado 6579053 7338766 7187267 7429302 7409462
Objective: Perform summary statistics on variables
Description: Obviously the goal of all this data wrangling is to be able to perform statistical analysis on our data. The summarise()
function allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.
Function: summarise(data, ...)
Same as: data %>% summarise(...)
Arguments:
data: data frame
...: Name-value pairs of summary functions like min(), mean(), max() etc.
*Developer is from New Zealand...can use "summarise(x)" or "summarize(x)"
Examples
Lets get the mean expenditure value across all states in 2011
sub.exp %>% summarise(Mean_2011 = mean(X2011))
## Mean_2011
## 1 10513678
Not too bad, lets get some more summary stats
sub.exp %>% summarise(Min = min(X2011, na.rm=TRUE),
Median = median(X2011, na.rm=TRUE),
Mean = mean(X2011, na.rm=TRUE),
Var = var(X2011, na.rm=TRUE),
SD = sd(X2011, na.rm=TRUE),
Max = max(X2011, na.rm=TRUE),
N = n())
## Min Median Mean Var SD Max N
## 1 1049772 6527404 10513678 1.48619e+14 12190938 57526835 50
This information is useful, but being able to compare summary statistics at multiple levels is when you really start to gather some insights. This is where the group_by()
function comes in. First, let’s group by Division and see how the different regions compared in by 2010 and 2011.
sub.exp %>%
group_by(Division)%>%
summarise(Mean_2010 = mean(X2010, na.rm=TRUE),
Mean_2011 = mean(X2011, na.rm=TRUE))
## Source: local data frame [9 x 3]
##
## Division Mean_2010 Mean_2011
## 1 1 5121003 5222277
## 2 2 32415457 32877923
## 3 3 16322489 16270159
## 4 4 4672332 4672687
## 5 5 10975194 11023526
## 6 6 6161967 6267490
## 7 7 14916843 15000139
## 8 8 3894003 3882159
## 9 9 15540681 15468173
Now we’re starting to see some differences pop out. How about we compare states within a Division? We can start to apply multiple functions we’ve learned so far to get the 5 year average for each state within Division 3.
sub.exp %>%
gather(Year, Expenditure, X2007:X2011) %>% # this turns our wide data to a long format
filter(Division == 3) %>% # we only want to compare states within Division 3
group_by(State) %>% # we want to summarize data at the state level
summarise(Mean = mean(Expenditure),
SD = sd(Expenditure))
## Source: local data frame [5 x 3]
##
## State Mean SD
## 1 Illinois 22989317 1867527.7
## 2 Indiana 9613775 238971.6
## 3 Michigan 17059665 180245.0
## 4 Ohio 19264329 705930.2
## 5 Wisconsin 9678256 507461.2
Objective: Order variable values
Description: Often, we desire to view observations in rank order for a particular variable(s). The arrange()
function allows us to order data by variables in accending or descending order.
Function: arrange(data, ...)
Same as: data %>% arrange(...)
Arguments:
data: data frame
...: Variable(s) to order
*use desc(x) to sort variable in descending order
Examples
For instance, in the summarise example we compared the the mean expenditures for each division. We could apply the arrange()
function at the end to order the divisions from lowest to highest expenditure for 2011. This makes it easier to see the significant differences between Divisions 8,4,1 & 6 as compared to Divisions 5,7,9,3 & 2.
sub.exp %>%
group_by(Division)%>%
summarise(Mean_2010 = mean(X2010, na.rm=TRUE),
Mean_2011 = mean(X2011, na.rm=TRUE)) %>%
arrange(Mean_2011)
## Source: local data frame [9 x 3]
##
## Division Mean_2010 Mean_2011
## 1 8 3894003 3882159
## 2 4 4672332 4672687
## 3 1 5121003 5222277
## 4 6 6161967 6267490
## 5 5 10975194 11023526
## 6 7 14916843 15000139
## 7 9 15540681 15468173
## 8 3 16322489 16270159
## 9 2 32415457 32877923
We can also apply an descending argument to rank-order from highest to lowest. The following shows the same data but in descending order by applying desc()
within the arrange()
function.
sub.exp %>%
group_by(Division)%>%
summarise(Mean_2010 = mean(X2010, na.rm=TRUE),
Mean_2011 = mean(X2011, na.rm=TRUE)) %>%
arrange(desc(Mean_2011))
## Source: local data frame [9 x 3]
##
## Division Mean_2010 Mean_2011
## 1 2 32415457 32877923
## 2 3 16322489 16270159
## 3 9 15540681 15468173
## 4 7 14916843 15000139
## 5 5 10975194 11023526
## 6 6 6161967 6267490
## 7 1 5121003 5222277
## 8 4 4672332 4672687
## 9 8 3894003 3882159
Objective: Join two datasets together
Description: Often we have separate dataframes that can have common and differing variables for similar observations and we wish to join these dataframes together. The multiple xxx_join()
functions provide multiple ways to join dataframes.
Description: Join two datasets
Function:
inner_join(x, y, by = NULL)
left_join(x, y, by = NULL)
right_join(x, y, by = NULL)
full_join(x, y, by = NULL)
semi_join(x, y, by = NULL)
anti_join(x, y, by = NULL)
Arguments:
x,y: data frames to join
by: a character vector of variables to join by. If NULL, the default, join will do a natural join, using all
variables with common names across the two tables.
Example
Our public education expenditure data represents then-year dollars. To make any accurate assessments of longitudinal trends and comparison we need to adjust for inflation. I have the following dataframe which provides inflation adjustment factors for base-year 2012 dollars (obviously I should use 2014 values but I had these easily accessable and it only serves for illustrative purposes).
## Year Annual Inflation
## 28 2007 207.342 0.9030811
## 29 2008 215.303 0.9377553
## 30 2009 214.537 0.9344190
## 31 2010 218.056 0.9497461
## 32 2011 224.939 0.9797251
## 33 2012 229.594 1.0000000
To join to my expenditure data I obviously need to get my expenditure data in the proper form that allows my to join these two dataframes. I can apply the following functions to accomplish this:
long.exp <- sub.exp %>%
gather(Year, Expenditure, X2007:X2011) %>% # turn to long format
separate(Year, into=c("x", "Year"), sep="X") %>% # separate "X" from year value
select(-x) # remove "x" column
long.exp$Year <- as.numeric(long.exp$ Year) # convert from character to numeric
head(long.exp)
## Division State Year Expenditure
## 1 6 Alabama 2007 6245031
## 2 9 Alaska 2007 1634316
## 3 8 Arizona 2007 7815720
## 4 7 Arkansas 2007 3997701
## 5 9 California 2007 57352599
## 6 8 Colorado 2007 6579053
I can now apply the left_join()
function to join the inflation data to the expenditure data. This aligns the data in both dataframes by the Year variable and then joins the remaining inflation data to the expenditure dataframe as new variables.
join.exp <- long.exp %>% left_join(inflation)
head(join.exp)
## Division State Year Expenditure Annual Inflation
## 1 6 Alabama 2007 6245031 207.342 0.9030811
## 2 9 Alaska 2007 1634316 207.342 0.9030811
## 3 8 Arizona 2007 7815720 207.342 0.9030811
## 4 7 Arkansas 2007 3997701 207.342 0.9030811
## 5 9 California 2007 57352599 207.342 0.9030811
## 6 8 Colorado 2007 6579053 207.342 0.9030811
To illustrate the other joining methods we can use these two simple dateframes:
Dataframe “x”:
## name instrument
## 1 John guitar
## 2 Paul bass
## 3 George guitar
## 4 Ringo drums
## 5 Stuart bass
## 6 Pete drums
Dataframe “y”:
## name band
## 1 John TRUE
## 2 Paul TRUE
## 3 George TRUE
## 4 Ringo TRUE
## 5 Brian FALSE
inner_join()
: Include only rows in both x and y that have a matching value
inner_join(x,y)
## name instrument band
## 1 John guitar TRUE
## 2 Paul bass TRUE
## 3 George guitar TRUE
## 4 Ringo drums TRUE
left_join()
: Include all of x, and matching rows of y
left_join(x,y)
## name instrument band
## 1 John guitar TRUE
## 2 Paul bass TRUE
## 3 George guitar TRUE
## 4 Ringo drums TRUE
## 5 Stuart bass <NA>
## 6 Pete drums <NA>
semi_join()
: Include rows of x that match y but only keep the columns from x
semi_join(x,y)
## name instrument
## 1 John guitar
## 2 Paul bass
## 3 George guitar
## 4 Ringo drums
anti_join()
: Opposite of semi_join
anti_join(x,y)
## name instrument
## 1 Pete drums
## 2 Stuart bass
Objective: Creates new variables
Description: Often we want to create a new variable that is a function of the current variables in our dataframe or even just add a new variable. The mutate()
function allows us to add new variables while preserving the existing variables.
Function: mutate(data, ...)
Same as: data %>% mutate(...)
Arguments:
data: data frame
...: Expression(s)
Examples
If we go back to our previous join.exp dataframe, remember that we joined inflation rates to our non-inflation adjusted expenditures for public schools. The dataframe looks like:
## Division State Year Expenditure Annual Inflation
## 1 6 Alabama 2007 6245031 207.342 0.9030811
## 2 9 Alaska 2007 1634316 207.342 0.9030811
## 3 8 Arizona 2007 7815720 207.342 0.9030811
## 4 7 Arkansas 2007 3997701 207.342 0.9030811
## 5 9 California 2007 57352599 207.342 0.9030811
## 6 8 Colorado 2007 6579053 207.342 0.9030811
If we wanted to adjust our annual expenditures for inflation we can use mutate()
to create a new inflation adjusted cost variable which we’ll name inflation_adj:
inflation_adj <- join.exp %>% mutate(Adj_Exp = Expenditure/Inflation)
head(inflation_adj)
## Division State Year Expenditure Annual Inflation Adj_Exp
## 1 6 Alabama 2007 6245031 207.342 0.9030811 6915249
## 2 9 Alaska 2007 1634316 207.342 0.9030811 1809711
## 3 8 Arizona 2007 7815720 207.342 0.9030811 8654505
## 4 7 Arkansas 2007 3997701 207.342 0.9030811 4426735
## 5 9 California 2007 57352599 207.342 0.9030811 63507696
## 6 8 Colorado 2007 6579053 207.342 0.9030811 7285119
Lets say we wanted to create a variable that rank-orders state-level expenditures (inflation adjusted) for the year 2010 from the highest level of expenditures to the lowest.
rank_exp <- inflation_adj %>%
filter(Year == 2010) %>%
arrange(desc(Adj_Exp)) %>%
mutate(Rank = 1:length(Adj_Exp))
head(rank_exp)
## Division State Year Expenditure Annual Inflation Adj_Exp Rank
## 1 9 California 2010 58248662 218.056 0.9497461 61330774 1
## 2 2 New York 2010 50251461 218.056 0.9497461 52910417 2
## 3 7 Texas 2010 42621886 218.056 0.9497461 44877138 3
## 4 3 Illinois 2010 24695773 218.056 0.9497461 26002501 4
## 5 2 New Jersey 2010 24261392 218.056 0.9497461 25545135 5
## 6 5 Florida 2010 23349314 218.056 0.9497461 24584797 6
If you wanted to assess the percent change in cost for a particular state you can use the lag()
function within the mutate()
function:
inflation_adj %>%
filter(State == "Ohio") %>%
mutate(Perc_Chg = (Adj_Exp-lag(Adj_Exp))/lag(Adj_Exp))
## Division State Year Expenditure Annual Inflation Adj_Exp Perc_Chg
## 1 3 Ohio 2007 18251361 207.342 0.9030811 20210102 NA
## 2 3 Ohio 2008 18892374 215.303 0.9377553 20146378 -0.003153057
## 3 3 Ohio 2009 19387318 214.537 0.9344190 20747992 0.029862103
## 4 3 Ohio 2010 19801670 218.056 0.9497461 20849436 0.004889357
## 5 3 Ohio 2011 19988921 224.939 0.9797251 20402582 -0.021432441
You could also look at what percent of all US expenditures each state made up in 2011. In this case we use mutate()
to take each state’s inflation adjusted expenditure and divide by the sum of the entire inflation adjusted expenditure column. We also apply a second function within mutate()
that provides the cummalative percent in rank-order. This shows that in 2011, the top 8 states with the highest expenditures represented over 50% of the total U.S. expenditures in K-12 public schools. (I remove the non-inflation adjusted Expenditure, Annual & Inflation columns so that the columns don’t wrap on the screen view)
perc.of.whole <- inflation_adj %>%
filter(Year == 2011) %>%
arrange(desc(Adj_Exp)) %>%
mutate(Perc_of_Total = Adj_Exp/sum(Adj_Exp),
Cum_Perc = cumsum(Perc_of_Total)) %>%
select(-Expenditure, -Annual, -Inflation)
head(perc.of.whole, 8)
## Division State Year Adj_Exp Perc_of_Total Cum_Perc
## 1 9 California 2011 58717324 0.10943237 0.1094324
## 2 2 New York 2011 52575244 0.09798528 0.2074177
## 3 7 Texas 2011 43751346 0.08154005 0.2889577
## 4 3 Illinois 2011 25062609 0.04670957 0.3356673
## 5 5 Florida 2011 24364070 0.04540769 0.3810750
## 6 2 New Jersey 2011 24128484 0.04496862 0.4260436
## 7 2 Pennsylvania 2011 23971218 0.04467552 0.4707191
## 8 3 Ohio 2011 20402582 0.03802460 0.5087437