An application-programming interface (API) in a nutshell is a method of communication between software programs. APIs allow programs to interact and use each other’s functions by acting as a middle man. Why is this useful? Lets say you want to pull weather data from the NOAA. You have a few options:
rnoaa
package which allows you to send specific instructions to the NOAA API via R, the API will then perform the action requested and return the desired information. The benefit of this strategy is if the NOAA changes its website structure it won’t impact the API data retreival structure which means no impact to your code. Standing ovation!Consequently, APIs provide consistency in data retrieval processes which can be essential for recurring analyses. Luckily, the use of APIs by organizations that collect data are growing exponentially. This is great for you and I as more and more data continues to be at our finger tips. So what do you need to get started?
Each API is unique; however, there are a few fundamental pieces of information you’ll need to work with an API. First, the reason you’re using an API is to request specific types of data from a specific data set from a specific organization. You at least need to know a little something about each one of these:
httr
you’ll need to specify.In addition to these key components you will also, typically, need to provide a form of identification and/or authorization. This is done via:
httr
package has simplified this greatly which we demonstrate at the end of this section.Rather than dwell on these components, they’ll likely become clearer as we progress through examples. So, let’s move on to the fun stuff.
Like everything else you do in R, when looking to work with an API your first question should be “Is there a package for that?” R has an extensive list of packages in which API data feeds have been hooked into R. You can find a slew of them scattered throughout the CRAN Task View: Web Technologies and Services web page, on the rOpenSci web page, and some more here.
To give you a taste for how these packages typically work, I’ll quickly cover three packages:
blsAPI
for pulling U.S. Bureau of Labor Statistics datarnoaa
for pulling NOAA climate datartimes
for pulling data from multiple APIs offered by the New York TimesThe blsAPI
allows users to request data for one or multiple series through the U.S. Bureau of Labor Statistics API. To use the blsAPI
app you only need knowledge on the data; no key or OAuth are required. I lllustrate by pulling Mass Layoff Statistics data but you will find all the available data sets and their series code information here.
The key information you will be concerned about is contained in the series identifier. For the Mass Layoff data the the series ID code is MLUMS00NN0001003. Each component of this series code has meaning and can be adjusted to get specific Mass Layoff data. The BLS provides this breakdown for what each component means along with the available list of codes for this data set. For instance, the S00 (MLUMS00NN0001003) component represents the division/state. S00 will pull for all states but I could change to D30 to pull data for the Midwest or S39 to pull for Ohio. The N0001 (MLUMS00NN0001003) component represents the industry/demographics. N0001 pulls data for all industries but I could change to N0008 to pull data for the food industry or C00A2 for all persons age 30-44.
I simply call the series identifier in the blsAPI()
function which pulls the JSON data object. We can then use the fromJSON()
function from the rjson
package to convert to an R data object (a list in this case). You can see that the raw data pull provides a list of 4 items. The first three provide some metadata info (status, response time, and message if applicable). The data we are concerned about is in the 4th (Results$series$data) list item which contains 31 observations.
library(rjson)
library(blsAPI)
# supply series identifier to pull data (initial pull is in JSON data)
layoffs_json <- blsAPI('MLUMS00NN0001003')
# convert from JSON into R object
layoffs <- fromJSON(layoffs_json)
List of 4
$ status : chr "REQUEST_SUCCEEDED"
$ responseTime: num 38
$ message : list()
$ Results :List of 1
..$ series:List of 1
.. ..$ :List of 2
.. .. ..$ seriesID: chr "MLUMS00NN0001003"
.. .. ..$ data :List of 31
.. .. .. ..$ :List of 5
.. .. .. .. ..$ year : chr "2013"
.. .. .. .. ..$ period : chr "M05"
.. .. .. .. ..$ periodName: chr "May"
.. .. .. .. ..$ value : chr "1383"
One of the inconveniences of an API is we do not get to specify how the data we receive is formatted. This is a minor price to pay considering all the other benefits APIs provide. Once we understand the received data format we can typically re-format using a little list subsetting and for
looping.
# create empty data frame to fill
layoff_df <- data.frame(NULL)
# extract data of interest from each nested year-month list
for(i in seq_along(layoffs$Results$series[[1]]$data)) {
df <- data.frame(layoffs$Results$series[[1]]$data[i][[1]][1:4])
layoff_df <- rbind(layoff_df, df)
}
head(layoff_df)
## year period periodName value
## 1 2013 M05 May 1383
## 2 2013 M04 April 1174
## 3 2013 M03 March 1132
## 4 2013 M02 February 960
## 5 2013 M01 January 1528
## 6 2012 M13 Annual 17080
blsAPI
also allows you to pull multiple data series and has optional arguments (i.e. start year, end year, etc.). You can see other options at help(package = blsAPI)
.
The rnoaa
package allows users to request climate data from multiple data sets through the National Climatic Data Center API. Unlike blsAPI
, the rnoaa
app requires you to have an API key. To request a key go here and provide your email; a key will immediately be emailed to you.
key <- "vXTdwNoAVx..." # truncated
With the key in hand, we can begin pulling data. The NOAA provides a comprehensive metadata library to familiarize yourself with the data available. Let’s start by pulling all the available NOAA climate stations near my residence. I live in Montgomery county Ohio so we can find all the stations in this county by inserting the FIPS code. Furthermore, I’m interested in stations that provide data for the GHCND
data set which contains records on numerous daily variables such as “maximum and minimum temperature, total daily precipitation, snowfall, and snow depth; however, about two thirds of the stations report precipitation only.” See ?ncdc_stations
for other data sets available via rnoaa
.
library(rnoaa)
stations <- ncdc_stations(datasetid='GHCND',
locationid='FIPS:39113',
token = key)
stations$data
## Source: local data frame [23 x 9]
##
## elevation mindate maxdate latitude
## (dbl) (chr) (chr) (dbl)
## 1 294.1 2009-02-09 2014-06-25 39.6314
## 2 251.8 2009-03-01 2016-01-16 39.6807
## 3 295.7 2009-03-25 2012-09-08 39.6252
## 4 298.1 2009-08-24 2012-07-20 39.8070
## 5 304.5 2010-04-02 2016-01-12 39.6949
## 6 283.5 2012-07-01 2016-01-16 39.7373
## 7 301.4 2012-07-29 2016-01-16 39.8795
## 8 317.3 2012-09-08 2016-01-12 39.8329
## 9 298.1 2012-09-07 2016-01-15 39.6247
## 10 250.5 2012-09-11 2016-01-08 39.7180
## .. ... ... ... ...
## Variables not shown: name (chr), datacoverage (dbl), id (chr),
## elevationUnit (chr), longitude (dbl)
So we see that several stations are available from which to pull data. To actually pull data from one of these stations we need the station ID. The station I want to pull data from is the Dayton International Airport station. We can see that this station provides data from 1948-present and I can get the station ID as illustrated. Note that I use some dplyr
for data manipulation here you will learn about in a later section but this just illustrates the fact that we received the data via the API.
library(dplyr)
stations$data %>%
filter(name == "DAYTON INTERNATIONAL AIRPORT, OH US") %>%
select(mindate, maxdate, id)
## Source: local data frame [1 x 3]
##
## mindate maxdate id
## (chr) (chr) (chr)
## 1 1948-01-01 2016-01-15 GHCND:USW00093815
To pull all available GHCND data from this station we’ll use ncdc()
. We simply supply the data to pull, the start and end dates (ncdc
restricts you to a one year limit), station ID, and your key. We can see that this station provides a full range of data types.
climate <- ncdc(datasetid='GHCND',
startdate = '2015-01-01',
enddate = '2016-01-01',
stationid='GHCND:USW00093815',
token = key)
climate$data
## Source: local data frame [25 x 8]
##
## date datatype station value fl_m fl_q
## (chr) (chr) (chr) (int) (chr) (chr)
## 1 2015-01-01T00:00:00 AWND GHCND:USW00093815 72
## 2 2015-01-01T00:00:00 PRCP GHCND:USW00093815 0
## 3 2015-01-01T00:00:00 SNOW GHCND:USW00093815 0
## 4 2015-01-01T00:00:00 SNWD GHCND:USW00093815 0
## 5 2015-01-01T00:00:00 TAVG GHCND:USW00093815 -38 H
## 6 2015-01-01T00:00:00 TMAX GHCND:USW00093815 28
## 7 2015-01-01T00:00:00 TMIN GHCND:USW00093815 -71
## 8 2015-01-01T00:00:00 WDF2 GHCND:USW00093815 240
## 9 2015-01-01T00:00:00 WDF5 GHCND:USW00093815 240
## 10 2015-01-01T00:00:00 WSF2 GHCND:USW00093815 130
## .. ... ... ... ... ... ...
## Variables not shown: fl_so (chr), fl_t (chr)
Since we recently had some snow here let’s pull data on snow fall for 2015. We adjust the limit argument (by default ncdc
limits results to 25) and identify the data type we want. By sorting we see what days experienced the greatest snowfall (don’t worry, the results are reported in mm!).
snow <- ncdc(datasetid='GHCND',
startdate = '2015-01-01',
enddate = '2015-12-31',
limit = 365,
stationid='GHCND:USW00093815',
datatypeid = 'SNOW',
token = key)
snow$data %>%
arrange(desc(value))
## Source: local data frame [365 x 8]
##
## date datatype station value fl_m fl_q
## (chr) (chr) (chr) (int) (chr) (chr)
## 1 2015-03-01T00:00:00 SNOW GHCND:USW00093815 114
## 2 2015-02-21T00:00:00 SNOW GHCND:USW00093815 109
## 3 2015-01-25T00:00:00 SNOW GHCND:USW00093815 71
## 4 2015-01-06T00:00:00 SNOW GHCND:USW00093815 66
## 5 2015-02-16T00:00:00 SNOW GHCND:USW00093815 30
## 6 2015-02-18T00:00:00 SNOW GHCND:USW00093815 25
## 7 2015-02-14T00:00:00 SNOW GHCND:USW00093815 23
## 8 2015-01-26T00:00:00 SNOW GHCND:USW00093815 20
## 9 2015-02-04T00:00:00 SNOW GHCND:USW00093815 20
## 10 2015-02-12T00:00:00 SNOW GHCND:USW00093815 20
## .. ... ... ... ... ... ...
## Variables not shown: fl_so (chr), fl_t (chr)
This is just an intro to rnoaa
as the package offers a slew of data sets to pull from and functions to apply. It even offers built in plotting functions. Use help(package = "rnoaa")
to see all that rnoaa
has to offer.
The rtimes
package provides an interface to Congress, Campaign Finance, Article Search, and Geographic APIs offered by the New York Times. The data libraries and documentation for the several APIs available can be found here. To use the Times’ API you’ll need to get an API key here.
article_key <- "4f23572d8..." # truncated
cfinance_key <- "ee0b7cef..." # truncated
congress_key <- "57b3e8a3..." # truncated
Lets start by searching NY Times articles. With the presendential elections upon us, we can illustrate by searching the least controversial candidate…Donald Trump. We can see that there are 4,566 article hits for the term “Trump”. We can get more information on a particular article by subsetting.
library(rtimes)
# article search for the term 'Trump'
articles <- as_search(q = "Trump",
begin_date = "20150101",
end_date = '20160101',
key = article_key)
# summary
articles$meta
## hits time offset
## 1 4565 28 0
# pull info on 3rd article
articles$data[3]
## [[1]]
## <NYTimes article>Donald Trumpâs Strongest Supporters: A Certain Kind of Democrat
## Type: News
## Published: 2015-12-31T00:00:00Z
## Word count: 1469
## URL: http://www.nytimes.com/2015/12/31/upshot/donald-trumps-strongest-supporters-a-certain-kind-of-democrat.html
## Snippet: In a survey, he also excels among low-turnout voters and among the less affluent and the less educated, so the question is: Will they show up to vote?
We can use the campaign finance API and functions to gain some insight into Trumps compaign income and expenditures. The only special data you need is the FEC ID for the candidate of interest.
trump <- cf_candidate_details(campaign_cycle = 2016,
fec_id = 'P80001571',
key = cfinance_key)
# pull summary data
trump$meta
## id name party
## 1 P80001571 TRUMP, DONALD J REP
## fec_uri
## 1 http://docquery.fec.gov/cgi-bin/fecimg/?P80001571
## committee mailing_address mailing_city
## 1 /committees/C00580100.json 725 FIFTH AVENUE NEW YORK
## mailing_state mailing_zip status total_receipts
## 1 NY 10022 O 1902410.45
## total_from_individuals total_from_pacs total_contributions
## 1 92249.33 0 96298.97
## candidate_loans total_disbursements begin_cash end_cash
## 1 1804747.23 1414674.29 0 487736.16
## total_refunds debts_owed date_coverage_from date_coverage_to
## 1 0 1804747.23 2015-04-02 2015-06-30
## independent_expenditures coordinated_expenditures
## 1 1644396.8 0
rtimes
also allows us to gain some insight into what our locally elected officials are up to with the Congress API. First, I can get some informaton on my Senator and then use that information to see if he’s supporting my interest. For instance, I can pull the most recent bills that he is co-sponsoring.
# pull info on OH senator
senator <- cg_memberbystatedistrict(chamber = "senate",
state = "OH",
key = congress_key)
senator$meta
## id name role gender party
## 1 B000944 Sherrod Brown Senator, 1st Class M D
## times_topics_url twitter_id youtube_id seniority
## 1 SenSherrodBrown SherrodBrownOhio 9
## next_election
## 1 2018
## api_url
## 1 http://api.nytimes.com/svc/politics/v3/us/legislative/congress/members/B000944.json
# use member ID to pull recent bill sponsorship
bills <- cg_billscosponsor(memberid = "B000944",
type = "cosponsored",
key = congress_key)
head(bills$data)
## Source: local data frame [6 x 11]
##
## congress number
## (chr) (chr)
## 1 114 S.2098
## 2 114 S.2096
## 3 114 S.2100
## 4 114 S.2090
## 5 114 S.RES.267
## 6 114 S.RES.269
## Variables not shown: bill_uri (chr), title (chr), cosponsored_date
## (chr), sponsor_id (chr), introduced_date (chr), cosponsors (chr),
## committees (chr), latest_major_action_date (chr),
## latest_major_action (chr)
It looks like the most recent bill Sherrod is co-sponsoring is S.2098 - Student Right to Know Before You Go Act. Maybe I’ll do a NY Times article search with as_search()
to find out more about this bill…an exercise for another time.
So this gives you some flavor of how these API packages work. You typically need to know the data sets and variables requested along with an API key. But once you get these basics its pretty straight forward on requesting the data. Your next question may be, what if the API that I want to get data from does not yet have an R package developed for it?
Although numerous R API packages are available, and cover a wide range of data, you may eventually run into a situation where you want to leverage an organization’s API but an R package does not exist. Enter httr
. httr
was developed by Hadley Wickham to easily work with web APIs. It offers multiple functions (i.e. HEAD()
, POST()
, PATCH()
, PUT()
and DELETE()
); however, the function we are most concerned with today is Get()
. We use the Get()
function to access an API, provide it some request parameters, and receive an output.
To give you a taste for how the httr
package works, I’ll quickly cover how to use it for a basic key-only API and an OAuth-required API:
Key-only API
is illustrated by pulling U.S. Department of Education data available on data.govOAuth-required API
is illustrated by pulling tweets from my personal Twitter feedTo demonstrate how to use the httr
package for accessing a key-only API, I’ll illustrate with the College Scorecard API provided by the Department of Education. First, you’ll need to request your API key.
edu_key <- "fd783wmS3Z..." # truncated
We can now proceed to use httr
to request data from the API with the GET()
function. I went to North Dakota State University (NDSU) for my undergrad so I’m interested in pulling some data for this school. I can use the provided data library and query explanation to determine the parameters required. In this example, the URL
includes the primary path (“https://api.data.gov/ed/collegescorecard/”), the API version (“v1”), and the endpoint (“schools”). The question mark (“?”) at the end of the URL is included to begin the list of query parameters, which only includes my API key and the school of interest.
library(httr)
URL <- "https://api.data.gov/ed/collegescorecard/v1/schools?"
# import all available data for NDSU
ndsu_req <- GET(URL, query = list(api_key = edu_key,
school.name = "North Dakota State University"))
This request provides me with every piece of information collected by the U.S. Department of Education for NDSU. To retrieve the contents of this request I use the content()
function which will output the data as an R object (a list in this case). The data is segmented into two main components: metadata and results. I’m primarily interested in the results.
The results branch of this list provides information on lat-long location, school identifier codes, some basic info on the school (city, number of branches, school website, accreditor, etc.), and then student data for the years 1997-2013.
ndsu_data <- content(ndsu_req)
names(ndsu_data)
## [1] "metadata" "results"
names(ndsu_data$results[[1]])
## [1] "2008" "2009" "2006" "ope6_id" "2007" "2004"
## [7] "2013" "2005" "location" "2002" "2003" "id"
## [13] "1996" "1997" "school" "1998" "2012" "2011"
## [19] "2010" "ope8_id" "1999" "2001" "2000"
To see what kind of student data categories are offered we can assess a single year. You can see that available data includes earnings, academics, student info/demographics, admissions, costs, etc. With such a large data set, which includes many embedded lists, sometimes the easiest way to learn the data structure is to peruse names at different levels.
# student data categories available by year
names(ndsu_data$results[[1]]$`2013`)
## [1] "earnings" "academics" "student" "admissions" "repayment"
## [6] "aid" "cost" "completion"
# cost categories available by year
names(ndsu_data$results[[1]]$`2013`$cost)
## [1] "title_iv" "avg_net_price" "attendance" "tuition"
## [5] "net_price"
# Avg net price cost categories available by year
names(ndsu_data$results[[1]]$`2013`$cost$avg_net_price)
## [1] "other_academic_year" "overall" "program_year"
## [4] "public" "private"
So if I’m interested in comparing the rise in cost versus the rise in student debt I can simply subset for this data once I’ve identified its location and naming structure. Note that for this subsetting we use the magrittr
package and the sapply
function; both are covered in more detail in their relevant sections. This is just meant to illustrate the types of data available through this API.
library(magrittr)
# subset list for annual student data only
ndsu_yr <- ndsu_data$results[[1]][c(as.character(1996:2013))]
# extract median debt data for each year
ndsu_yr %>%
sapply(function(x) x$aid$median_debt$completers$overall) %>%
unlist()
## 1997 1998 1999 2000 2001 2002 2003 2004
## 13388.0 13856.0 14500.0 15125.0 15507.0 15639.0 16251.0 16642.5
## 2005 2006 2007 2008 2009 2010 2011 2012
## 17125.0 17125.0 17125.0 17250.0 19125.0 21500.0 23000.0 24954.5
## 2013
## 25050.0
# extract net price for each year
ndsu_yr %>%
sapply(function(x) x$cost$avg_net_price$overall) %>%
unlist()
## 2009 2010 2011 2012 2013
## 13474 12989 13808 15113 14404
Quite simple isn’t it…at least once you’ve learned how the query requests are formatted for a particular API.
At the outset I mentioned how OAuth is an authorization framework that provides credentials as proof for access. Many APIs are open to the public and only require an API key; however, some APIs require authorization to account data (think personal Facebook & Twitter accounts). To access these accounts we must provide proper credentials and OAuth authentication allows us to do this. This section is not meant to explain the details of OAuth (for that see this, this, and this) but, rather, how to use httr
in times when OAuth is required.
I’ll demonstrate by accessing the Twitter API using my Twitter account. The first thing we need to do is identify the OAuth endpoints used to request access and authorization. To do this we can use oauth_endpoint()
which typically requires a request URL, authorization URL, and access URL. httr
also included some baked-in endpoints to include LinkedIn, Twitter, Vimeo, Google, Facebook, and GitHub. We can see the Twitter endpoints using the following:
twitter_endpts <- oauth_endpoints("twitter")
twitter_endpts
## <oauth_endpoint>
## request: https://api.twitter.com/oauth/request_token
## authorize: https://api.twitter.com/oauth/authenticate
## access: https://api.twitter.com/oauth/access_token
Next, I register my application at https://apps.twitter.com/. One thing to note is during the registration process, it will ask you for the callback url; be sure to use “http://127.0.0.1:1410”. Once registered, Twitter will provide you with keys and access tokens. The two we are concerned about are the API key and API Secret.
twitter_key <- "BZgukbCol..." # truncated
twitter_secret <- "YpB8Xy..." # truncated
We can then bundle the consumer key and secret into one object with oauth_app()
. The first argument, appname
is simply used as a local identifier; it does not need to match the name you gave the Twitter app you developed at https://apps.twitter.com/.
We are now ready to ask for access credentials. Since Twitter uses OAuth 1.0 we use oauth1.0_token()
function and incorporate the endpoints identified and the oauth_app
object we previously named twitter_app
.
twitter_token <- oauth1.0_token(endpoint = twitter_endpts, twitter_app)
Waiting for authentication in browser...
Press Esc/Ctrl + C to abort
Authentication complete.
Once authentication is complete we can now use the API. I can pull all the tweets that show up on my personal timeline using the GET()
function and the access cridentials I stored in twitter_token
. I then use content()
to convert to a list and I can start to analyze the data.
In this case each tweet is saved as an individual list item and a full range of data are provided for each tweet (i.e. id, text, user, geo location, favorite count, etc). For instance, we can see that the first tweet was by FiveThirtyEight concerning American politics and, at the time of this analysis, has been favorited by 3 people.
# request Twitter data
req <- GET("https://api.twitter.com/1.1/statuses/home_timeline.json",
config(token = twitter_token))
# convert to R object
tweets <- content(req)
# available data for first tweet on my timeline
names(tweets[[1]])
[1] "created_at" "id"
[3] "id_str" "text"
[5] "source" "truncated"
[7] "in_reply_to_status_id" "in_reply_to_status_id_str"
[9] "in_reply_to_user_id" "in_reply_to_user_id_str"
[11] "in_reply_to_screen_name" "user"
[13] "geo" "coordinates"
[15] "place" "contributors"
[17] "is_quote_status" "retweet_count"
[19] "favorite_count" "entities"
[21] "extended_entities" "favorited"
[23] "retweeted" "possibly_sensitive"
[25] "possibly_sensitive_appealable" "lang"
# further analysis of first tweet on my timeline
tweets[[1]]$user$name
[1] "FiveThirtyEight"
tweets[[1]]$text
[1] "\U0001f3a7 A History Of Data In American Politics (Part 1): William Jennings Bryan to Barack Obama https://t.co/oCKzrXuRHf https://t.co/6CvKKToxoH"
tweets[[1]]$favorite_count
[1] 3
This provides a fairly simple example of incorporating OAuth authorization. The httr
provides several examples of accessing common social network APIs that require OAuth. I recommend you go through several of these examples to get familiar with using OAuth authorization; see them at demo(package = "httr")
. The most difficult aspect of creating your own connections with APIs is gaining an understanding of the API and the arguments they leverage. This obviously requires time and energy devoted to digging into the API documentation and data library. Next its just a matter of trial and error (likely more the latter than the former) to learn how to translate these arguments into httr
function calls to pull the data of interest.
Also, note that httr
provides several other useful functions not covered here for communicating with APIs (i.e. POST()
, BROWSE()
). For more on these other httr
capabilities see this quickstart vignette.