Scraping Data

Rapid growth of the World Wide Web has significantly changed the way we share, collect, and publish data. Vast amount of information is being stored online, both in structured and unstructured forms. Regarding certain questions or research topics, this has resulted in a new problem - no longer is the concern of data scarcity and inaccessibility but, rather, one of overcoming the tangled masses of online data.

Collecting data from the web is not an easy process as there are many technologies used to distribute web content (i.e. HTML, XML, JSON). Therefore, dealing with more advanced web scraping requires familiarity in accessing data stored in these technologies via R. Through this section I will provide an introduction to some of the fundamental tools required to perform basic web scraping. This includes importing spreadsheet data files stored online, scraping HTML text, scraping HTML table data, and leveraging APIs to scrape data.

My purpose in the following sections is to discuss these topics at a level meant to get you started in web scraping; however, this area is vast and complex and this chapter will far from provide you expertise level insight. To advance your knowledge I highly recommend getting copies of XML and Web Technologies for Data Sciences with R and Automated Data Collection with R.

Note: the examples provided below were performed in 2015. Consequently, if you apply the code provide throughout these examples your outputs may differ due to webpages and their content changing over time.


Importing Spreadsheet Data Files Stored Online

The most basic form of getting data from online is to import tabular (i.e. .txt, .csv) or Excel files that are being hosted online. This is often not considered web scraping1; however, I think its a good place to start introducing the user to interacting with the web for obtaining data. Importing tabular data is especially common for the many types of government data available online. A quick perusal of Data.gov illustrates over 190,000 examples. In fact, we can provide our first example of importing online tabular data by downloading the Data.gov .csv file that lists all the federal agencies that supply data to Data.gov.

# the url for the online CSV
url <- "https://s3.amazonaws.com/bsp-ocsit-prod-east-appdata/datagov/wordpress/agency-participation.csv"

# use read.csv to import
data_gov <- read.csv(url, stringsAsFactors = FALSE)

# for brevity I only display first 6 rows
head(data_gov)
##                                      Agency.Name                 Sub.Agency.Publisher Organization.Type Datasets Last.Entry
## 1                Broadcasting Board of Governors                                          Federal-Other        6 01/12/2014
## 2           Commodity Futures Trading Commission                                          Federal-Other        3 01/12/2014
## 3           Consumer Financial Protection Bureau                                          Federal-Other        2 09/26/2015
## 4           Consumer Financial Protection Bureau Consumer Financial Protection Bureau     Federal-Other        2 09/26/2015
## 5 Corporation for National and Community Service                                          Federal-Other        3 01/12/2014
## 6 Court Services and Offender Supervision Agency                                          Federal-Other        1 01/12/2014

Downloading Excel spreadsheets hosted online can be performed just as easily. Recall that there is not a base R function for importing Excel data; however, several packages exist to handle this capability. One package that works smoothly with pulling Excel data from urls is gdata. With gdata we can use read.xls() to download this Fair Market Rents for Section 8 Housing Excel file from the given url.

library(gdata)

# the url for the online Excel file
url <- "http://www.huduser.org/portal/datasets/fmr/fmr2015f/FY2015F_4050_Final.xls"

# use read.xls to import
rents <- read.xls(url)

rents[1:6, 1:10]
##    fips2000  fips2010 fmr2 fmr0 fmr1 fmr3 fmr4 county State CouSub
## 1 100199999 100199999  788  628  663 1084 1288      1     1  99999
## 2 100399999 100399999  762  494  643 1123 1318      3     1  99999
## 3 100599999 100599999  670  492  495  834  895      5     1  99999
## 4 100799999 100799999  773  545  652 1015 1142      7     1  99999
## 5 100999999 100999999  773  545  652 1015 1142      9     1  99999
## 6 101199999 101199999  599  481  505  791 1061     11     1  99999

Note that many of the arguments covered in the Importing Data section (i.e. specifying sheets to read from, skipping lines) also apply to read.xls(). In addition, gdata provides some useful functions (sheetCount() and sheetNames()) for identifying if multiple sheets exist prior to downloading. Check out ?gdata for more help.


Special note when using gdata on Windows: When downloading excel spreadsheets from the internet, Mac users will be able to install the gdata package, attach the library, and be highly functional right away. Windows users, on the other hand, do not have a Perl interpreter installed by default.

install.packages("gdata")

If you are a Windows user and attempt to attach the gdata library immediately after installation, you will likely receive the warning message given in Figure 1. You will not be able to download excel spreadsheets from the internet without installing some additional software.

gdata without Perl

gdata without Perl

Unfortunately, it’s not as straightforward to fix as the error message would indicate. Running installXLSXsupport() won’t completely solve your problem without Perl software installed. In order for gdata to function properly, you must install ActiveState Perl using the following link: http://www.activestate.com/activeperl/. The download could take up to 10 minutes or so, and when finished, you will need to find where the software was stored on your machine (likely directly on the C:/ Drive).

Once Perl software has been installed, you will need to direct R to find it each time you call the function.

# use read.xls to import
rents <- read.xls(url, perl = "C:/Perl64/bin/perl.exe")

Another common form of file storage is using zip files. For instance, the Bureau of Labor Statistics (BLS) stores their public-use microdata for the Consumer Expenditure Survey in .zip files. We can use download.file() to download the file to your working directory and then work with this data as desired.

url <- "http://www.bls.gov/cex/pumd/data/comma/diary14.zip"

# download .zip file and unzip contents
download.file(url, dest = "dataset.zip", mode = "wb") 
unzip("dataset.zip", exdir = "./")

# assess the files contained in the .zip file which
# unzips as a folder named "diary14"
list.files("diary14")
##  [1] "dtbd141.csv" "dtbd142.csv" "dtbd143.csv" "dtbd144.csv" "dtid141.csv"
##  [6] "dtid142.csv" "dtid143.csv" "dtid144.csv" "expd141.csv" "expd142.csv"
## [11] "expd143.csv" "expd144.csv" "fmld141.csv" "fmld142.csv" "fmld143.csv"
## [16] "fmld144.csv" "memd141.csv" "memd142.csv" "memd143.csv" "memd144.csv"

# alternatively, if we know the file we want prior to unzipping
# we can extract the file without unzipping using unz():
zip_data <- read.csv(unz("dataset.zip", "diary14/expd141.csv"))
zip_data[1:5, 1:10]
##     NEWID ALLOC COST GIFT PUB_FLAG    UCC EXPNSQDY EXPN_QDY EXPNWKDY   EXPN_KDY
## 1 2825371     0 6.26    2        2 190112        1        D        3        D
## 2 2825371     0 1.20    2        2 190322        1        D        3        D
## 3 2825381     0 0.98    2        2  20510        3        D        2        D
## 4 2825381     0 0.98    2        2  20510        3        D        2        D
## 5 2825381     0 2.50    2        2  20510        3        D        2        D

The .zip archive file format is meant to compress files and are typically used on files of significant size. For instance, the Consumer Expenditure Survey data we downloaded in the previous example is over 10MB. Obviously there may be times in which we want to get specific data in the .zip file to analyze but not always permanently store the entire .zip file contents. In these instances we can use the following process proposed by Dirk Eddelbuettel to temporarily download the .zip file, extract the desired data, and then discard the .zip file.

# Create a temp. file name
temp <- tempfile()

# Use download.file() to fetch the file into the temp. file
download.file("http://www.bls.gov/cex/pumd/data/comma/diary14.zip",temp)

# Use unz() to extract the target file from temp. file
zip_data2 <- read.csv(unz(temp, "diary14/expd141.csv"))

# Remove the temp file via unlink()
unlink(temp)

zip_data2[1:5, 1:10]
##     NEWID ALLOC COST GIFT PUB_FLAG    UCC EXPNSQDY EXPN_QDY EXPNWKDY   EXPN_KDY
## 1 2825371     0 6.26    2        2 190112        1        D        3        D
## 2 2825371     0 1.20    2        2 190322        1        D        3        D
## 3 2825381     0 0.98    2        2  20510        3        D        2        D
## 4 2825381     0 0.98    2        2  20510        3        D        2        D
## 5 2825381     0 2.50    2        2  20510        3        D        2        D

One last common scenario I’ll cover when importing spreadsheet data from online is when we identify multiple data sets that we’d like to download but are not centrally stored in a single .zip format or the like. For example there are multiple data files scattered throughout the Maryland State Board of Elections website that we may want to download. The first objective here is to identify the relevant links. The XML package provides the useful getHTMLLinks() function to identify these links.

library(XML)

url <- "http://www.elections.state.md.us/elections/2012/election_data/index.html"
links <- getHTMLLinks(url)
head(links)
## [1] "http://www.maryland.gov/"                              
## [2] "../../../index.html"                                   
## [3] "../../../index.html"                                   
## [4] "https://www.facebook.com/MarylandStateBoardofElections"
## [5] "https://twitter.com/md_sbe"                            
## [6] "http://www.maryland.gov/pages/social_media.aspx"

However, this provides all the links on the page. For this example we will only concentrate on the links that contain .csv files. Here, we see that there are 307 .csv files.

library(stringr)

# extract names for desired links and paste to url
links_data <- links[str_detect(links, ".csv")]

length(links_data)
## [1] 307

If we wanted to we could loop through and download all 307 .csv files. However, for simplicity we’ll just download the first six files and save each one as a separate data frame.

links_data <- head(links_data)

for(i in seq_along(links_data)) {
  
  # step 1: paste .csv portion of link to the base URL
  url <- paste0("http://www.elections.state.md.us/elections/2012/election_data/",
                links_data[i])
  
  # step 2: download .csv file and save as df
  df <- read.csv(url)
  
  # step 3: rename df
  assign(paste0("df", i), df)
}

I now have the six downloaded data frames in my global environment saved as “df1”, “df2”, “df3”, “df4”, “df5”, and “df6”

sapply(paste0("df", 1:6), exists)
##  df1  df2  df3  df4  df5  df6 
## TRUE TRUE TRUE TRUE TRUE TRUE

head(df1)
##   County                            Candidate.Name Party             Office.Name Office.District Winner CONG.08 CONG.04 CONG.07 CONG.01 CONG.03 CONG.02 CONG.05 CONG.06
## 1      0                              Barack Obama   DEM   President - Vice Pres                      Y   30138   37385   38838   15734   26963   22444   34224   26702
## 2      0 Uncommitted To Any Presidential Candidate   DEM   President - Vice Pres                           3186    1815    2797    6007    4959    4408    3640    4991
## 3      0                      Raymond Levi Blagmon   DEM U.S. Senator                                       443     662     672     568     646     594     589     786
## 4      0                                Ben Cardin   DEM U.S. Senator                                 Y   28493   22818   31357   16589   25800   20669   23218   24589
## 5      0                              J. P. Cusick   DEM U.S. Senator                                       350     229     387     686     526     562     612     713
## 6      0                              Chris Garner   DEM U.S. Senator                                       976     694     733    1232     829     838     961    1577

These examples provide the basics required for downloading most tabular and Excel files from online. However, this is just the beginning of importing/scraping data from the web. Next, we’ll start exploring the more conventional forms of scraping text and data stored in HTML webpages.

Exercises

  1. Import the following csv file directly from this url: https://bradleyboehmke.github.io/public/data/reddit.csv.
  2. Import the .xlsx file directly from this url: http://www.huduser.gov/portal/datasets/fmr/fmr2017/FY2017_4050_FMR.xlsx.
  3. Go to this University of Dayton webpage http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm. Download the weather data for a selected city. Can you download data for all the Alabama cities?


Scraping HTML Text

Vast amount of information exists across the interminable webpages that exist online. Much of this information are “unstructured” text that may be useful in our analyses. This section covers the basics of scraping these texts from online sources. Throughout this section I will illustrate how to extract different text components of webpages by dissecting the Wikipedia page on web scraping. However, its important to first cover one of the basic components of HTML elements as we will leverage this information to pull desired information. I offer only enough insight required to begin scraping; I highly recommend XML and Web Technologies for Data Sciences with R and Automated Data Collection with R to learn more about HTML and XML element structures.

HTML elements are written with a start tag, an end tag, and with the content in between: <tagname>content</tagname>. The tags which typically contain the textual content we wish to scrape, and the tags we will leverage in the next two sections, include:

  • <h1>, <h2>,…,<h6>: Largest heading, second largest heading, etc.
  • <p>: Paragraph elements
  • <ul>: Unordered bulleted list
  • <ol>: Ordered list
  • <li>: Individual List item
  • <div>: Division or section
  • <table>: Table

For example, text in paragraph form that you see online is wrapped with the HTML paragraph tag <p> as in:

<p>
This paragraph represents
a typical text paragraph
in HTML form
</p>

It is through these tags that we can start to extract textual components (also referred to as nodes) of HTML webpages.

Scraping HTML Nodes

To scrape online text we’ll make use of the relatively newer rvest package. rvest was created by the RStudio team inspired by libraries such as beautiful soup which has greatly simplified web scraping. rvest provides multiple functionalities; however, in this section we will focus only on extracting HTML text with rvest. Its important to note that rvest makes use of of the pipe operator (%>%) developed through the magrittr package. If you are not familiar with the functionality of %>% I recommend you jump to the section on Simplifying Your Code with %>% so that you have a better understanding of what’s going on with the code.

To extract text from a webpage of interest, we specify what HTML elements we want to select by using html_nodes(). For instance, if we want to scrape the primary heading for the Web Scraping Wikipedia webpage we simply identify the <h1> node as the node we want to select. html_nodes() will identify all <h1> nodes on the webpage and return the HTML element. In our example we see there is only one <h1> node on this webpage.

library(rvest)

scraping_wiki <- read_html("https://en.wikipedia.org/wiki/Web_scraping")

scraping_wiki %>%
        html_nodes("h1")
## {xml_nodeset (1)}
## [1] <h1 id="firstHeading" class="firstHeading" lang="en">Web scraping</h1>

To extract only the heading text for this <h1> node, and not include all the HTML syntax we use html_text() which returns the heading text we see at the top of the Web Scraping Wikipedia page.

scraping_wiki %>%
        html_nodes("h1") %>%
        html_text()
## [1] "Web scraping"

If we want to identify all the second level headings on the webpage we follow the same process but instead select the <h2> nodes. In this example we see there are 10 second level headings on the Web Scraping Wikipedia page.

scraping_wiki %>%
        html_nodes("h2") %>%
        html_text()
##  [1] "Contents"                             
##  [2] "Techniques[edit]"                     
##  [3] "Legal issues[edit]"                   
##  [4] "Notable tools[edit]"                  
##  [5] "See also[edit]"                       
##  [6] "Technical measures to stop bots[edit]"
##  [7] "Articles[edit]"                       
##  [8] "References[edit]"                     
##  [9] "See also[edit]"                       
## [10] "Navigation menu"

Next, we can move on to extracting much of the text on this webpage which is in paragraph form. We can follow the same process illustrated above but instead we’ll select all <p> nodes. This selects the 17 paragraph elements from the web page; which we can examine by subsetting the list p_nodes to see the first line of each paragraph along with the HTML syntax. Just as before, to extract the text from these nodes and coerce them to a character string we simply apply html_text().

p_nodes <- scraping_wiki %>% 
        html_nodes("p")

length(p_nodes)
## [1] 17

p_nodes[1:6]
## {xml_nodeset (6)}
## [1] <p><b>Web scraping</b> (<b>web harvesting</b> or <b>web data extract ...
## [2] <p>Web scraping is closely related to <a href="/wiki/Web_indexing" t ...
## [3] <p/>
## [4] <p/>
## [5] <p>Web scraping is the process of automatically collecting informati ...
## [6] <p>Web scraping may be against the <a href="/wiki/Terms_of_use" titl ...


p_text <- scraping_wiki %>%
        html_nodes("p") %>%
        html_text()

p_text[1]
## [1] "Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Mozilla Firefox."

Not too bad; however, we may not have captured all the text that we were hoping for. Since we extracted text for all <p> nodes, we collected all identified paragraph text; however, this does not capture the text in the bulleted lists. For example, when you look at the Web Scraping Wikipedia page you will notice a significant amount of text in bulleted list format following the third paragraph under the Techniques heading. If we look at our data we’ll see that that the text in this list format are not capture between the two paragraphs:

p_text[5]
## [1] "Web scraping is the process of automatically collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations."

p_text[6]
## [1] "Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.[4] While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. U.S. courts have acknowledged that users of \"scrapers\" or \"robots\" may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site. This case involved automatic placing of bids, known as auction sniping. However, in order to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.[7]"

This is because the text in this list format are contained in <ul> nodes. To capture the text in lists, we can use the same steps as above but we select specific nodes which represent HTML lists components. We can approach extracting list text two ways.

First, we can pull all list elements (<ul>). When scraping all <ul> text, the resulting data structure will be a character string vector with each element representing a single list consisting of all list items in that list. In our running example there are 21 list elements as shown in the example that follows. You can see the first list scraped is the table of contents and the second list scraped is the list in the Techniques section.

ul_text <- scraping_wiki %>%
        html_nodes("ul") %>%
        html_text()

length(ul_text)
## [1] 21

ul_text[1]
## [1] "\n1 Techniques\n2 Legal issues\n3 Notable tools\n4 See also\n5 Technical measures to stop bots\n6 Articles\n7 References\n8 See also\n"

# read the first 200 characters of the second list
substr(ul_text[2], start = 1, stop = 200)
## [1] "\nHuman copy-and-paste: Sometimes even the best web-scraping technology cannot replace a human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the web"

An alternative approach is to pull all <li> nodes. This will pull the text contained in each list item for all the lists. In our running example there’s 146 list items that we can extract from this Wikipedia page. The first eight list items are the list of contents we see towards the top of the page. List items 9-17 are the list elements contained in the “Techniques” section, list items 18-44 are the items listed under the “Notable Tools” section, and so on.

li_text <- scraping_wiki %>%
        html_nodes("li") %>%
        html_text()

length(li_text)
## [1] 147

li_text[1:8]
## [1] "1 Techniques"                      "2 Legal issues"                   
## [3] "3 Notable tools"                   "4 See also"                       
## [5] "5 Technical measures to stop bots" "6 Articles"                       
## [7] "7 References"                      "8 See also"

At this point we may believe we have all the text desired and proceed with joining the paragraph (p_text) and list (ul_text or li_text) character strings and then perform the desired textual analysis. However, we may now have captured more text than we were hoping for. For example, by scraping all lists we are also capturing the listed links in the left margin of the webpage. If we look at the 104-136 list items that we scraped, we’ll see that these texts correspond to the left margin text.

li_text[104:136]
##  [1] "Main page"           "Contents"            "Featured content"   
##  [4] "Current events"      "Random article"      "Donate to Wikipedia"
##  [7] "Wikipedia store"     "Help"                "About Wikipedia"    
## [10] "Community portal"    "Recent changes"      "Contact page"       
## [13] "What links here"     "Related changes"     "Upload file"        
## [16] "Special pages"       "Permanent link"      "Page information"   
## [19] "Wikidata item"       "Cite this page"      "Create a book"      
## [22] "Download as PDF"     "Printable version"   "Català"             
## [25] "Deutsch"             "Español"             "Français"           
## [28] "Íslenska"            "Italiano"            "Latviešu"           
## [31] "Nederlands"          "日本語"              "Српски / srpski"

If we desire to scrape every piece of text on the webpage than this won’t be of concern. In fact, if we want to scrape all the text regardless of the content they represent there is an easier approach. We can capture all the content to include text in paragraph (<p>), lists (<ul>, <ol>, and <li>), and even data in tables (<table>) by using <div>. This is because these other elements are usually a subsidiary of an HTML division or section so pulling all <div> nodes will extract all text contained in that division or section regardless if it is also contained in a paragraph or list.

all_text <- scraping_wiki %>%
        html_nodes("div") %>% 
        html_text()

Scraping Specific HTML Nodes

However, if we are concerned only with specific content on the webpage then we need to make our HTML node selection process a little more focused. To do this we, we can use our browser’s developer tools to examine the webpage we are scraping and get more details on specific nodes of interest. If you are using Chrome or Firefox you can open the developer tools by clicking F12 (Cmd + Opt + I for Mac) or for Safari you would use Command-Option-I. An additional option which is recommended by Hadley Wickham is to use selectorgadget.com, a Chrome extension, to help identify the web page elements you need2.

Once the developer’s tools are opened your primary concern is with the element selector. This is located in the top lefthand corner of the developers tools window.

Developer Tools: Element Selector

Once you’ve selected the element selector you can now scroll over the elements of the webpage which will cause each element you scroll over to be highlighted. Once you’ve identified the element you want to focus on, select it. This will cause the element to be identified in the developer tools window. For example, if I am only interested in the main body of the Web Scraping content on the Wikipedia page then I would select the element that highlights the entire center component of the webpage. This highlights the corresponding element <div id="bodyContent" class="mw-body-content"> in the developer tools window as the following illustrates.

Selecting Content of Interest

I can now use this information to select and scrape all the text from this specific <div> node by calling the ID name (“#mw-content-text”) in html_nodes()3. As you can see below, the text that is scraped begins with the first line in the main body of the Web Scraping content and ends with the text in the See Also section which is the last bit of text directly pertaining to Web Scraping on the webpage. Explicitly, we have pulled the specific text associated with the web content we desire.

body_text <- scraping_wiki %>%
        html_nodes("#mw-content-text") %>% 
        html_text()

# read the first 207 characters
substr(body_text, start = 1, stop = 207)
## [1] "Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web"

# read the last 73 characters
substr(body_text, start = nchar(body_text)-73, stop = nchar(body_text))
## [1] "See also[edit]\n\nData scraping\nData wrangling\nKnowledge extraction\n\n\n\n\n\n\n\n\n"

Using the developer tools approach allows us to be as specific as we desire. We can identify the class name for a specific HTML element and scrape the text for only that node rather than all the other elements with similar tags. This allows us to scrape the main body of content as we just illustrated or we can also identify specific headings, paragraphs, lists, and list components if we desire to scrape only these specific pieces of text:

# Scraping a specific heading
scraping_wiki %>%
        html_nodes("#Techniques") %>% 
        html_text()
## [1] "Techniques"

# Scraping a specific paragraph
scraping_wiki %>%
        html_nodes("#mw-content-text > p:nth-child(20)") %>% 
        html_text()
## [1] "In Australia, the Spam Act 2003 outlaws some forms of web harvesting, although this only applies to email addresses.[20][21]"

# Scraping a specific list
scraping_wiki %>%
        html_nodes("#mw-content-text > div:nth-child(22)") %>% 
        html_text()
## [1] "\n\nApache Camel\nArchive.is\nAutomation Anywhere\nConvertigo\ncURL\nData Toolbar\nDiffbot\nFirebug\nGreasemonkey\nHeritrix\nHtmlUnit\nHTTrack\niMacros\nImport.io\nJaxer\nNode.js\nnokogiri\nPhantomJS\nScraperWiki\nScrapy\nSelenium\nSimpleTest\nwatir\nWget\nWireshark\nWSO2 Mashup Server\nYahoo! Query Language (YQL)\n\n"

# Scraping a specific reference list item
scraping_wiki %>%
        html_nodes("#cite_note-22") %>% 
        html_text()
## [1] "^ \"Web Scraping: Everything You Wanted to Know (but were afraid to ask)\". Distil Networks. 2015-07-22. Retrieved 2015-11-04. "

Cleaning up

With any webscraping activity, especially involving text, there is likely to be some clean up involved. For example, in the previous example we saw that we can specifically pull the list of Notable Tools; however, you can see that in between each list item rather than a space there contains one or more \n which is used in HTML to specify a new line. We can clean this up quickly with a little character string manipulation.

library(magrittr)

scraping_wiki %>%
        html_nodes("#mw-content-text > div:nth-child(22)") %>% 
        html_text()
## [1] "\n\nApache Camel\nArchive.is\nAutomation Anywhere\nConvertigo\ncURL\nData Toolbar\nDiffbot\nFirebug\nGreasemonkey\nHeritrix\nHtmlUnit\nHTTrack\niMacros\nImport.io\nJaxer\nNode.js\nnokogiri\nPhantomJS\nScraperWiki\nScrapy\nSelenium\nSimpleTest\nwatir\nWget\nWireshark\nWSO2 Mashup Server\nYahoo! Query Language (YQL)\n\n"

scraping_wiki %>%
        html_nodes("#mw-content-text > div:nth-child(22)") %>% 
        html_text() %>% 
        strsplit(split = "\n") %>%
        unlist() %>%
        .[. != ""]
##  [1] "Apache Camel"                "Archive.is"                 
##  [3] "Automation Anywhere"         "Convertigo"                 
##  [5] "cURL"                        "Data Toolbar"               
##  [7] "Diffbot"                     "Firebug"                    
##  [9] "Greasemonkey"                "Heritrix"                   
## [11] "HtmlUnit"                    "HTTrack"                    
## [13] "iMacros"                     "Import.io"                  
## [15] "Jaxer"                       "Node.js"                    
## [17] "nokogiri"                    "PhantomJS"                  
## [19] "ScraperWiki"                 "Scrapy"                     
## [21] "Selenium"                    "SimpleTest"                 
## [23] "watir"                       "Wget"                       
## [25] "Wireshark"                   "WSO2 Mashup Server"         
## [27] "Yahoo! Query Language (YQL)"

Similarly, as we saw in our example above with scraping the main body content (body_text), there are extra characters (i.e. \n, \, ^) in the text that we may not want. Using a little regex we can clean this up so that our character string consists of only text that we see on the screen and no additional HTML code embedded throughout the text.

library(stringr)

# read the last 700 characters
substr(body_text, start = nchar(body_text)-700, stop = nchar(body_text))
## [1] " 2010). \"Intellectual Property: Website Terms of Use\". Issue 26: June 2010. LK Shields Solicitors Update. p. 03. Retrieved 2012-04-19. \n^ National Office for the Information Economy (February 2004). \"Spam Act 2003: An overview for business\" (PDF). Australian Communications Authority. p. 6. Retrieved 2009-03-09. \n^ National Office for the Information Economy (February 2004). \"Spam Act 2003: A practical guide for business\" (PDF). Australian Communications Authority. p. 20. Retrieved 2009-03-09. \n^ \"Web Scraping: Everything You Wanted to Know (but were afraid to ask)\". Distil Networks. 2015-07-22. Retrieved 2015-11-04. \n\n\nSee also[edit]\n\nData scraping\nData wrangling\nKnowledge extraction\n\n\n\n\n\n\n\n\n"

# clean up text
body_text %>%
        str_replace_all(pattern = "\n", replacement = " ") %>%
        str_replace_all(pattern = "[\\^]", replacement = " ") %>%
        str_replace_all(pattern = "\"", replacement = " ") %>%
        str_replace_all(pattern = "\\s+", replacement = " ") %>%
        str_trim(side = "both") %>%
        substr(start = nchar(body_text)-700, stop = nchar(body_text))
## [1] "012-04-19. National Office for the Information Economy (February 2004). Spam Act 2003: An overview for business (PDF). Australian Communications Authority. p. 6. Retrieved 2009-03-09. National Office for the Information Economy (February 2004). Spam Act 2003: A practical guide for business (PDF). Australian Communications Authority. p. 20. Retrieved 2009-03-09. Web Scraping: Everything You Wanted to Know (but were afraid to ask) . Distil Networks. 2015-07-22. Retrieved 2015-11-04. See also[edit] Data scraping Data wrangling Knowledge extraction"

So there we have it, text scraping in a nutshell. Although not all encompassing, this section covered the basics of scraping text from HTML documents. Whether you want to scrape text from all common text-containing nodes such as <div>, <p>, <ul> and the like or you want to scrape from a specific node using the specific ID, this section provides you the basic fundamentals of using rvest to scrape the text you need. In the next section we move on to scraping data from HTML tables.

Exercises

  1. See if you can scrape chapter 1 of Philosopher’s Stone located here. To do so follow these steps:
    1. Import the HTML/XML content.
    2. The book text is contained in the <td> nodes so extract these nodes out of the HTML content.
    3. Extract the text out of the `<td> nodes (the text of interest is in the 5th element so you can index for this element).
    4. Bonus: Minor cleaning can be done by removing the “\r\n” throughout the text (this is just embedded HTML code within the text).
  2. Extract all the text from the main body content for the Wikipedia entry on the War in Afghanistan. Can you scrape just the references on this page?


Scraping HTML Table Data

Another common structure of information storage on the Web is in the form of HTML tables. This section reiterates some of the information from the previous section; however, we focus solely on scraping data from HTML tables. The simplest approach to scraping HTML table data directly into R is by using either the rvest package or the XML package. To illustrate, I will focus on the BLS employment statistics webpage which contains multiple HTML tables from which we can scrape data.

Scraping HTML tables with rvest

Recall that HTML elements are written with a start tag, an end tag, and with the content in between: <tagname>content</tagname>. HTML tables are contained within <table> tags; therefore, to extract the tables from the BLS employment statistics webpage we first use the html_nodes() function to select the <table> nodes. In this case we are interested in all table nodes that exist on the webpage. In this example, html_nodes captures 15 HTML tables. This includes data from the 10 data tables seen on the webpage but also includes data from a few additional tables used to format parts of the page (i.e. table of contents, table of figures, advertisements).

library(rvest)

webpage <- read_html("http://www.bls.gov/web/empsit/cesbmart.htm")

tbls <- html_nodes(webpage, "table")

head(tbls)
## {xml_nodeset (6)}
## [1] <table id="main-content-table">&#13;\n\t<tr>&#13;\n\t\t<td id="secon ...
## [2] <table id="Table1" class="regular" cellspacing="0" cellpadding="0" x ...
## [3] <table id="Table2" class="regular" cellspacing="0" cellpadding="0" x ...
## [4] <table id="Table3" class="regular" cellspacing="0" cellpadding="0" x ...
## [5] <table id="Table4" class="regular" cellspacing="0" cellpadding="0" x ...
## [6] <table id="Exhibit1" class="regular" cellspacing="0" cellpadding="0" ...

Remember that html_nodes() does not parse the data; rather, it acts as a CSS selector. To parse the HTML table data we use html_table(), which would create a list containing 15 data frames. However, rarely do we need to scrape every HTML table from a page, especially since some HTML tables don’t catch any information we are likely interested in (i.e. table of contents, table of figures, footers).

More often than not we want to parse specific tables. Lets assume we want to parse the second and third tables on the webpage:

  • Table 2. Nonfarm employment benchmarks by industry, March 2014 (in thousands) and
  • Table 3. Net birth/death estimates by industry supersector, April – December 2014 (in thousands)

This can be accomplished two ways. First, we can assess the previous tbls list and try to identify the table(s) of interest. In this example it appears that tbls list items 3 and 4 correspond with Table 2 and Table 3, respectively. We can then subset the list of table nodes prior to parsing the data with html_table(). This results in a list of two data frames containing the data of interest.

# subset list of table nodes for items 3 & 4
tbls_ls <- webpage %>%
        html_nodes("table") %>%
        .[3:4] %>%
        html_table(fill = TRUE)

str(tbls_ls)
## List of 2
##  $ :'data.frame':	147 obs. of  6 variables:
##   ..$ CES Industry Code : chr [1:147] "Amount" "00-000000" "05-000000" ...
##   ..$ CES Industry Title: chr [1:147] "Percent" "Total nonfarm" ...
##   ..$ Benchmark         : chr [1:147] NA "137,214" "114,989" "18,675" ...
##   ..$ Estimate          : chr [1:147] NA "137,147" "114,884" "18,558" ...
##   ..$ Differences       : num [1:147] NA 67 105 117 -50 -12 -16 -2.8 ...
##   ..$ NA                : chr [1:147] NA "(1)" "0.1" "0.6" ...
##  $ :'data.frame':	11 obs. of  12 variables:
##   ..$ CES Industry Code : chr [1:11] "10-000000" "20-000000" "30-000000" ...
##   ..$ CES Industry Title: chr [1:11] "Mining and logging" "Construction" ...
##   ..$ Apr               : int [1:11] 2 35 0 21 0 8 81 22 82 12 ...
##   ..$ May               : int [1:11] 2 37 6 24 5 8 22 13 81 6 ...
##   ..$ Jun               : int [1:11] 2 24 4 12 0 4 5 -14 86 6 ...
##   ..$ Jul               : int [1:11] 2 12 -3 7 -1 3 35 7 62 -2 ...
##   ..$ Aug               : int [1:11] 1 12 4 14 3 4 19 21 23 3 ...
##   ..$ Sep               : int [1:11] 1 7 1 9 -1 -1 -12 12 -33 -2 ...
##   ..$ Oct               : int [1:11] 1 12 3 28 6 16 76 35 -17 4 ...
##   ..$ Nov               : int [1:11] 1 -10 2 10 3 3 14 14 -22 1 ...
##   ..$ Dec               : int [1:11] 0 -21 0 4 0 10 -10 -3 4 1 ...
##   ..$ CumulativeTotal   : int [1:11] 12 108 17 129 15 55 230 107 266 29 ...

An alternative approach, which is more explicit, is to use the element selector process described in the previous section to call the table ID name.

# empty list to add table data to
tbls2_ls <- list()

# scrape Table 2. Nonfarm employment...
tbls2_ls$Table1 <- webpage %>%
        html_nodes("#Table2") %>% 
        html_table(fill = TRUE) %>%
        .[[1]]

# Table 3. Net birth/death...
tbls2_ls$Table2 <- webpage %>%
        html_nodes("#Table3") %>% 
        html_table() %>%
        .[[1]]

str(tbls2_ls)
## List of 2
##  $ Table1:'data.frame':	147 obs. of  6 variables:
##   ..$ CES Industry Code : chr [1:147] "Amount" "00-000000" "05-000000" ...
##   ..$ CES Industry Title: chr [1:147] "Percent" "Total nonfarm" ...
##   ..$ Benchmark         : chr [1:147] NA "137,214" "114,989" "18,675" ...
##   ..$ Estimate          : chr [1:147] NA "137,147" "114,884" "18,558" ...
##   ..$ Differences       : num [1:147] NA 67 105 117 -50 -12 -16 -2.8 ...
##   ..$ NA                : chr [1:147] NA "(1)" "0.1" "0.6" ...
##  $ Table2:'data.frame':	11 obs. of  12 variables:
##   ..$ CES Industry Code : chr [1:11] "10-000000" "20-000000" "30-000000" ...
##   ..$ CES Industry Title: chr [1:11] "Mining and logging" "Construction" ...
##   ..$ Apr               : int [1:11] 2 35 0 21 0 8 81 22 82 12 ...
##   ..$ May               : int [1:11] 2 37 6 24 5 8 22 13 81 6 ...
##   ..$ Jun               : int [1:11] 2 24 4 12 0 4 5 -14 86 6 ...
##   ..$ Jul               : int [1:11] 2 12 -3 7 -1 3 35 7 62 -2 ...
##   ..$ Aug               : int [1:11] 1 12 4 14 3 4 19 21 23 3 ...
##   ..$ Sep               : int [1:11] 1 7 1 9 -1 -1 -12 12 -33 -2 ...
##   ..$ Oct               : int [1:11] 1 12 3 28 6 16 76 35 -17 4 ...
##   ..$ Nov               : int [1:11] 1 -10 2 10 3 3 14 14 -22 1 ...
##   ..$ Dec               : int [1:11] 0 -21 0 4 0 10 -10 -3 4 1 ...
##   ..$ CumulativeTotal   : int [1:11] 12 108 17 129 15 55 230 107 266 29 ...

One issue to note is when using rvest’s html_table() to read a table with split column headings as in Table 2. Nonfarm employment…. html_table will cause split headings to be included and can cause the first row to include parts of the headings. We can see this with Table 2. This requires a little clean up.


head(tbls2_ls[[1]], 4)
##   CES Industry Code CES Industry Title Benchmark Estimate Differences   NA
## 1            Amount            Percent      <NA>     <NA>          NA <NA>
## 2         00-000000      Total nonfarm   137,214  137,147          67  (1)
## 3         05-000000      Total private   114,989  114,884         105  0.1
## 4         06-000000    Goods-producing    18,675   18,558         117  0.6

# remove row 1 that includes part of the headings
tbls2_ls[[1]] <- tbls2_ls[[1]][-1,]

# rename table headings
colnames(tbls2_ls[[1]]) <- c("CES_Code", "Ind_Title", "Benchmark",
                            "Estimate", "Amt_Diff", "Pct_Diff")

head(tbls2_ls[[1]], 4)
##    CES_Code         Ind_Title Benchmark Estimate Amt_Diff Pct_Diff
## 2 00-000000     Total nonfarm   137,214  137,147       67      (1)
## 3 05-000000     Total private   114,989  114,884      105      0.1
## 4 06-000000   Goods-producing    18,675   18,558      117      0.6
## 5 07-000000 Service-providing   118,539  118,589      -50      (1)

Scraping HTML tables with XML

An alternative to rvest for table scraping is to use the XML package. The XML package provides a convenient readHTMLTable() function to extract data from HTML tables in HTML documents. By passing the URL to readHTMLTable(), the data in each table is read and stored as a data frame. In a situation like our running example where multiple tables exists, the data frames will be stored in a list similar to rvest’s html_table.

library(XML)

url <- "http://www.bls.gov/web/empsit/cesbmart.htm"

# read in HTML data
tbls_xml <- readHTMLTable(url)

typeof(tbls_xml)
## [1] "list"

length(tbls_xml)
## [1] 15

You can see that tbls_xml captures the same 15 <table> nodes that html_nodes captured. To capture the same tables of interest we previously discussed (Table 2. Nonfarm employment… and Table 3. Net birth/death…) we can use a couple approaches. First, we can assess str(tbls_xml) to identify the tables of interest and perform normal list subsetting. In our example list items 3 and 4 correspond with our tables of interest.

head(tbls_xml[[3]])
##          V1                        V2      V3      V4  V5   V6
## 1 00-000000             Total nonfarm 137,214 137,147  67  (1)
## 2 05-000000             Total private 114,989 114,884 105  0.1
## 3 06-000000           Goods-producing  18,675  18,558 117  0.6
## 4 07-000000         Service-providing 118,539 118,589 -50  (1)
## 5 08-000000 Private service-providing  96,314  96,326 -12  (1)
## 6 10-000000        Mining and logging     868     884 -16 -1.8

head(tbls_xml[[4]], 3)
##   CES Industry Code CES Industry Title Apr May Jun Jul Aug Sep Oct Nov Dec
## 1         10-000000 Mining and logging   2   2   2   2   1   1   1   1   0
## 2         20-000000       Construction  35  37  24  12  12   7  12 -10 -21
## 3         30-000000      Manufacturing   0   6   4  -3   4   1   3   2   0
##   CumulativeTotal
## 1              12
## 2             108
## 3              17

Second, we can use the which argument in readHTMLTable() which restricts the data importing to only those tables specified numerically.

# only parse the 3rd and 4th tables
emp_ls <- readHTMLTable(url, which = c(3, 4))

str(emp_ls)
## List of 2
##  $ Table2:'data.frame':	145 obs. of  6 variables:
##   ..$ V1: Factor w/ 145 levels "00-000000","05-000000",..: 1 2 3 4 5 6 7 8 ...
##   ..$ V2: Factor w/ 143 levels "Accommodation",..: 130 131 52 116 102 74 ...
##   ..$ V3: Factor w/ 145 levels "1,010.3","1,048.3",..: 40 35 48 37 145 140 ...
##   ..$ V4: Factor w/ 145 levels "1,008.4","1,052.3",..: 41 34 48 36 144 142 ...
##   ..$ V5: Factor w/ 123 levels "-0.3","-0.4",..: 113 68 71 48 9 19 29 11 ...
##   ..$ V6: Factor w/ 56 levels "-0.1","-0.2",..: 30 31 36 30 30 16 28 14 29 ...
##  $ Table3:'data.frame':	11 obs. of  12 variables:
##   ..$ CES Industry Code : Factor w/ 11 levels "10-000000","20-000000",..:1 ...
##   ..$ CES Industry Title: Factor w/ 11 levels "263","Construction",..: 8 2 ...
##   ..$ Apr               : Factor w/ 10 levels "0","12","2","204",..: 3 7 1 ...
##   ..$ May               : Factor w/ 10 levels "129","13","2",..: 3 6 8 5 7 ...
##   ..$ Jun               : Factor w/ 10 levels "-14","0","12",..: 5 6 7 3 2 ...
##   ..$ Jul               : Factor w/ 10 levels "-1","-2","-3",..: 6 5 3 10 ...
##   ..$ Aug               : Factor w/ 9 levels "-19","1","12",..: 2 3 9 4 8 ...
##   ..$ Sep               : Factor w/ 9 levels "-1","-12","-2",..: 5 8 5 9 1 ...
##   ..$ Oct               : Factor w/ 10 levels "-17","1","12",..: 2 3 6 5 9 ...
##   ..$ Nov               : Factor w/ 8 levels "-10","-15","-22",..: 4 1 7 5 ...
##   ..$ Dec               : Factor w/ 8 levels "-10","-21","-3",..: 4 2 4 7  ...
##   ..$ CumulativeTotal   : Factor w/ 10 levels "107","108","12",..: 3 2 6 4 ...

The third option involves explicitly naming the tables to parse. This process uses the element selector process described in the previous section to call the table by name. We use getNodeSet() to select the specified tables of interest. However, a key difference here is rather than copying the table ID names you want to copy the XPath. You can do this with the following: After you’ve highlighted the table element of interest with the element selector, right click the highlighted element in the developer tools window and select Copy XPath. From here we just use readHTMLTable() to convert to data frames and we have our desired tables.

library(RCurl)

# parse url
url_parsed <- htmlParse(getURL(url), asText = TRUE)

# select table nodes of interest
tableNodes <- getNodeSet(url_parsed, c('//*[@id="Table2"]', '//*[@id="Table3"]'))

# convert HTML tables to data frames
bls_table2 <- readHTMLTable(tableNodes[[1]])
bls_table3 <- readHTMLTable(tableNodes[[2]])

head(bls_table2)
##          V1                        V2      V3      V4  V5   V6
## 1 00-000000             Total nonfarm 137,214 137,147  67  (1)
## 2 05-000000             Total private 114,989 114,884 105  0.1
## 3 06-000000           Goods-producing  18,675  18,558 117  0.6
## 4 07-000000         Service-providing 118,539 118,589 -50  (1)
## 5 08-000000 Private service-providing  96,314  96,326 -12  (1)
## 6 10-000000        Mining and logging     868     884 -16 -1.8

head(bls_table3, 3)
##   CES Industry Code CES Industry Title Apr May Jun Jul Aug Sep Oct Nov Dec
## 1         10-000000 Mining and logging   2   2   2   2   1   1   1   1   0
## 2         20-000000       Construction  35  37  24  12  12   7  12 -10 -21
## 3         30-000000      Manufacturing   0   6   4  -3   4   1   3   2   0
##   CumulativeTotal
## 1              12
## 2             108
## 3              17

A few benefits of XML’s readHTMLTable that are routinely handy include:

  • We can specify names for the column headings
  • We can specify the classes for each column
  • We can specify rows to skip

For instance, if you look at bls_table2 above notice that because of the split column headings on Table 2. Nonfarm employment… readHTMLTable stripped and replaced the headings with generic names because R does not know which variable names should align with each column. We can correct for this with the following:

bls_table2 <- readHTMLTable(tableNodes[[1]], 
                            header = c("CES_Code", "Ind_Title", "Benchmark",
                            "Estimate", "Amt_Diff", "Pct_Diff"))

head(bls_table2)
##    CES_Code                 Ind_Title Benchmark Estimate Amt_Diff Pct_Diff
## 1 00-000000             Total nonfarm   137,214  137,147       67      (1)
## 2 05-000000             Total private   114,989  114,884      105      0.1
## 3 06-000000           Goods-producing    18,675   18,558      117      0.6
## 4 07-000000         Service-providing   118,539  118,589      -50      (1)
## 5 08-000000 Private service-providing    96,314   96,326      -12      (1)
## 6 10-000000        Mining and logging       868      884      -16     -1.8

Also, for bls_table3 note that the net birth/death values parsed have been converted to factor levels. We can use the colClasses argument to correct this.

str(bls_table3)
## 'data.frame':	11 obs. of  12 variables:
##  $ CES Industry Code : Factor w/ 11 levels "10-000000","20-000000",..: 1 2 ...
##  $ CES Industry Title: Factor w/ 11 levels "263","Construction",..: 8 2 7 ...
##  $ Apr               : Factor w/ 10 levels "0","12","2","204",..: 3 7 1 5 ...
##  $ May               : Factor w/ 10 levels "129","13","2",..: 3 6 8 5 7 9 ...
##  $ Jun               : Factor w/ 10 levels "-14","0","12",..: 5 6 7 3 2 7 ...
##  $ Jul               : Factor w/ 10 levels "-1","-2","-3",..: 6 5 3 10 1 7 ...
##  $ Aug               : Factor w/ 9 levels "-19","1","12",..: 2 3 9 4 8 9 5 ...
##  $ Sep               : Factor w/ 9 levels "-1","-12","-2",..: 5 8 5 9 1 1 ...
##  $ Oct               : Factor w/ 10 levels "-17","1","12",..: 2 3 6 5 9 4 ...
##  $ Nov               : Factor w/ 8 levels "-10","-15","-22",..: 4 1 7 5 8 ...
##  $ Dec               : Factor w/ 8 levels "-10","-21","-3",..: 4 2 4 7 4 6 ...
##  $ CumulativeTotal   : Factor w/ 10 levels "107","108","12",..: 3 2 6 4 5 ...

bls_table3 <- readHTMLTable(tableNodes[[2]], 
                            colClasses = c("character","character", 
                                           rep("integer", 10)))

str(bls_table3)
## 'data.frame':	11 obs. of  12 variables:
##  $ CES Industry Code : Factor w/ 11 levels "10-000000","20-000000",..: 1 2 ...
##  $ CES Industry Title: Factor w/ 11 levels "263","Construction",..: 8 2 7 ...
##  $ Apr               : int  2 35 0 21 0 8 81 22 82 12 ...
##  $ May               : int  2 37 6 24 5 8 22 13 81 6 ...
##  $ Jun               : int  2 24 4 12 0 4 5 -14 86 6 ...
##  $ Jul               : int  2 12 -3 7 -1 3 35 7 62 -2 ...
##  $ Aug               : int  1 12 4 14 3 4 19 21 23 3 ...
##  $ Sep               : int  1 7 1 9 -1 -1 -12 12 -33 -2 ...
##  $ Oct               : int  1 12 3 28 6 16 76 35 -17 4 ...
##  $ Nov               : int  1 -10 2 10 3 3 14 14 -22 1 ...
##  $ Dec               : int  0 -21 0 4 0 10 -10 -3 4 1 ...
##  $ CumulativeTotal   : int  12 108 17 129 15 55 230 107 266 29 ...

Between rvest and XML, scraping HTML tables is relatively easy once you get fluent with the syntax and the available options. This section covers just the basics of both these packages to get you moving forward with scraping tables. In the next section we move on to working with application program interfaces (APIs) to get data from the web.

Exercises

  1. Scrape the 2016 GDP nominal rankings table provided by Statistics Times.
  2. Scrape any of the tables here that summarize world economic measures.


Leveraging APIs to Scrape Data

An application-programming interface (API) in a nutshell is a method of communication between software programs. APIs allow programs to interact and use each other’s functions by acting as a middle man. Why is this useful? Lets say you want to pull weather data from the NOAA. You have a few options:

  • You could query the data and download the spreadsheet or manually cut-n-paste the desired data and then import into R. Doesn’t get you any coolness points.
  • You could use some webscraping techniques previously covered to parse the desired data. Golf clap. The downfall of this strategy is if NOAA changes their website structure down the road your code will need to be adjusted.
  • Or, you can use the rnoaa package which allows you to send specific instructions to the NOAA API via R, the API will then perform the action requested and return the desired information. The benefit of this strategy is if the NOAA changes its website structure it won’t impact the API data retreival structure which means no impact to your code. Standing ovation!

Consequently, APIs provide consistency in data retrieval processes which can be essential for recurring analyses. Luckily, the use of APIs by organizations that collect data are growing exponentially. This is great for you and I as more and more data continues to be at our finger tips. So what do you need to get started?

Prerequisites?

Each API is unique; however, there are a few fundamental pieces of information you’ll need to work with an API. First, the reason you’re using an API is to request specific types of data from a specific data set from a specific organization. You at least need to know a little something about each one of these:

  1. The URL for the organization and data you are pulling. Most pre-built API packages already have this connection established but when using httr you’ll need to specify.
  2. The data set you are trying to pull from. Most organizations have numerous data sets to peruse so you need to make yourself familiar with the names of the available data sets.
  3. The data content. You’ll need to specify the specific data variables you want the API to retrieve so you’ll need to be familiar with, or have access to, the data library.

In addition to these key components you will also, typically, need to provide a form of identification and/or authorization. This is done via:

  1. API key (aka token). A key is used to identify the user along with track and control how the API is being used (guard against malicious use). A key is often obtained by supplying basic information (i.e. name, email) to the organization and in return they give you a multi-digit key.
  2. OAuth. OAuth is an authorization framework that provides credentials as proof for access to certain information. Multiple forms of credentials exist and OAuth can actually be a fairly confusing topic; however, the httr package has simplified this greatly which we demonstrate at the end of this section.

Rather than dwell on these components, they’ll likely become clearer as we progress through examples. So, let’s move on to the fun stuff.

Existing API Packages

Like everything else you do in R, when looking to work with an API your first question should be “Is there a package for that?” R has an extensive list of packages in which API data feeds have been hooked into R. You can find a slew of them scattered throughout the CRAN Task View: Web Technologies and Services web page, on the rOpenSci web page, and some more here.

To give you a taste for how these packages typically work, I’ll quickly cover three packages:

  • blsAPI for pulling U.S. Bureau of Labor Statistics data
  • rnoaa for pulling NOAA climate data
  • rtimes for pulling data from multiple APIs offered by the New York Times

blsAPI

The blsAPI allows users to request data for one or multiple series through the U.S. Bureau of Labor Statistics API. To use the blsAPI app you only need knowledge on the data; no key or OAuth are required. I lllustrate by pulling Mass Layoff Statistics data but you will find all the available data sets and their series code information here.

The key information you will be concerned about is contained in the series identifier. For the Mass Layoff data the the series ID code is MLUMS00NN0001003. Each component of this series code has meaning and can be adjusted to get specific Mass Layoff data. The BLS provides this breakdown for what each component means along with the available list of codes for this data set. For instance, the S00 (MLUMS00NN0001003) component represents the division/state. S00 will pull for all states but I could change to D30 to pull data for the Midwest or S39 to pull for Ohio. The N0001 (MLUMS00NN0001003) component represents the industry/demographics. N0001 pulls data for all industries but I could change to N0008 to pull data for the food industry or C00A2 for all persons age 30-44.

I simply call the series identifier in the blsAPI() function which pulls the JSON data object. We can then use the fromJSON() function from the rjson package to convert to an R data object (a list in this case). You can see that the raw data pull provides a list of 4 items. The first three provide some metadata info (status, response time, and message if applicable). The data we are concerned about is in the 4th (Results$series$data) list item which contains 31 observations.

library(rjson)
library(blsAPI)

# supply series identifier to pull data (initial pull is in JSON data)
layoffs_json <- blsAPI('MLUMS00NN0001003') 

# convert from JSON into R object
layoffs <- fromJSON(layoffs_json)                   

List of 4
 $ status      : chr "REQUEST_SUCCEEDED"
 $ responseTime: num 38
 $ message     : list()
 $ Results     :List of 1
  ..$ series:List of 1
  .. ..$ :List of 2
  .. .. ..$ seriesID: chr "MLUMS00NN0001003"
  .. .. ..$ data    :List of 31
  .. .. .. ..$ :List of 5
  .. .. .. .. ..$ year      : chr "2013"
  .. .. .. .. ..$ period    : chr "M05"
  .. .. .. .. ..$ periodName: chr "May"
  .. .. .. .. ..$ value     : chr "1383"

One of the inconveniences of an API is we do not get to specify how the data we receive is formatted. This is a minor price to pay considering all the other benefits APIs provide. Once we understand the received data format we can typically re-format using a little list subsetting and for looping.

# create empty data frame to fill  
layoff_df <- data.frame(NULL)

# extract data of interest from each nested year-month list  
for(i in seq_along(layoffs$Results$series[[1]]$data)) {
        df <- data.frame(layoffs$Results$series[[1]]$data[i][[1]][1:4])
        layoff_df <- rbind(layoff_df, df)
}

head(layoff_df)
##   year period periodName value
## 1 2013    M05        May  1383
## 2 2013    M04      April  1174
## 3 2013    M03      March  1132
## 4 2013    M02   February   960
## 5 2013    M01    January  1528
## 6 2012    M13     Annual 17080

blsAPI also allows you to pull multiple data series and has optional arguments (i.e. start year, end year, etc.). You can see other options at help(package = blsAPI).

rnoaa

The rnoaa package allows users to request climate data from multiple data sets through the National Climatic Data Center API. Unlike blsAPI, the rnoaa app requires you to have an API key. To request a key go here and provide your email; a key will immediately be emailed to you.

key <- "vXTdwNoAVx..." # truncated 

With the key in hand, we can begin pulling data. The NOAA provides a comprehensive metadata library to familiarize yourself with the data available. Let’s start by pulling all the available NOAA climate stations near my residence. I live in Montgomery county Ohio so we can find all the stations in this county by inserting the FIPS code. Furthermore, I’m interested in stations that provide data for the GHCND data set which contains records on numerous daily variables such as “maximum and minimum temperature, total daily precipitation, snowfall, and snow depth; however, about two thirds of the stations report precipitation only.” See ?ncdc_stations for other data sets available via rnoaa.

library(rnoaa)

stations <- ncdc_stations(datasetid='GHCND', 
              locationid='FIPS:39113',
              token = key)

stations$data
## Source: local data frame [23 x 9]
## 
##    elevation    mindate    maxdate latitude
##        (dbl)      (chr)      (chr)    (dbl)
## 1      294.1 2009-02-09 2014-06-25  39.6314
## 2      251.8 2009-03-01 2016-01-16  39.6807
## 3      295.7 2009-03-25 2012-09-08  39.6252
## 4      298.1 2009-08-24 2012-07-20  39.8070
## 5      304.5 2010-04-02 2016-01-12  39.6949
## 6      283.5 2012-07-01 2016-01-16  39.7373
## 7      301.4 2012-07-29 2016-01-16  39.8795
## 8      317.3 2012-09-08 2016-01-12  39.8329
## 9      298.1 2012-09-07 2016-01-15  39.6247
## 10     250.5 2012-09-11 2016-01-08  39.7180
## ..       ...        ...        ...      ...
## Variables not shown: name (chr), datacoverage (dbl), id (chr),
##   elevationUnit (chr), longitude (dbl)

So we see that several stations are available from which to pull data. To actually pull data from one of these stations we need the station ID. The station I want to pull data from is the Dayton International Airport station. We can see that this station provides data from 1948-present and I can get the station ID as illustrated. Note that I use some dplyr for data manipulation here you will learn about in a later section but this just illustrates the fact that we received the data via the API.

library(dplyr)

stations$data %>% 
        filter(name == "DAYTON INTERNATIONAL AIRPORT, OH US") %>% 
        select(mindate, maxdate, id)
## Source: local data frame [1 x 3]
## 
##      mindate    maxdate                id
##        (chr)      (chr)             (chr)
## 1 1948-01-01 2016-01-15 GHCND:USW00093815

To pull all available GHCND data from this station we’ll use ncdc(). We simply supply the data to pull, the start and end dates (ncdc restricts you to a one year limit), station ID, and your key. We can see that this station provides a full range of data types.

climate <- ncdc(datasetid='GHCND', 
            startdate = '2015-01-01', 
            enddate = '2016-01-01', 
            stationid='GHCND:USW00093815',
            token = key)

climate$data
## Source: local data frame [25 x 8]
## 
##                   date datatype           station value  fl_m  fl_q
##                  (chr)    (chr)             (chr) (int) (chr) (chr)
## 1  2015-01-01T00:00:00     AWND GHCND:USW00093815    72            
## 2  2015-01-01T00:00:00     PRCP GHCND:USW00093815     0            
## 3  2015-01-01T00:00:00     SNOW GHCND:USW00093815     0            
## 4  2015-01-01T00:00:00     SNWD GHCND:USW00093815     0            
## 5  2015-01-01T00:00:00     TAVG GHCND:USW00093815   -38     H      
## 6  2015-01-01T00:00:00     TMAX GHCND:USW00093815    28            
## 7  2015-01-01T00:00:00     TMIN GHCND:USW00093815   -71            
## 8  2015-01-01T00:00:00     WDF2 GHCND:USW00093815   240            
## 9  2015-01-01T00:00:00     WDF5 GHCND:USW00093815   240            
## 10 2015-01-01T00:00:00     WSF2 GHCND:USW00093815   130            
## ..                 ...      ...               ...   ...   ...   ...
## Variables not shown: fl_so (chr), fl_t (chr)

At the time of this post we had recently had some snow here so let’s pull data on snow fall for 2015. We adjust the limit argument (by default ncdc limits results to 25) and identify the data type we want. By sorting we see what days experienced the greatest snowfall (don’t worry, the results are reported in mm!).

snow <- ncdc(datasetid='GHCND', 
            startdate = '2015-01-01', 
            enddate = '2015-12-31', 
            limit = 365,
            stationid='GHCND:USW00093815',
            datatypeid = 'SNOW',
            token = key)

snow$data %>% 
        arrange(desc(value))
## Source: local data frame [365 x 8]
## 
##                   date datatype           station value  fl_m  fl_q
##                  (chr)    (chr)             (chr) (int) (chr) (chr)
## 1  2015-03-01T00:00:00     SNOW GHCND:USW00093815   114            
## 2  2015-02-21T00:00:00     SNOW GHCND:USW00093815   109            
## 3  2015-01-25T00:00:00     SNOW GHCND:USW00093815    71            
## 4  2015-01-06T00:00:00     SNOW GHCND:USW00093815    66            
## 5  2015-02-16T00:00:00     SNOW GHCND:USW00093815    30            
## 6  2015-02-18T00:00:00     SNOW GHCND:USW00093815    25            
## 7  2015-02-14T00:00:00     SNOW GHCND:USW00093815    23            
## 8  2015-01-26T00:00:00     SNOW GHCND:USW00093815    20            
## 9  2015-02-04T00:00:00     SNOW GHCND:USW00093815    20            
## 10 2015-02-12T00:00:00     SNOW GHCND:USW00093815    20            
## ..                 ...      ...               ...   ...   ...   ...
## Variables not shown: fl_so (chr), fl_t (chr)

This is just an intro to rnoaa as the package offers a slew of data sets to pull from and functions to apply. It even offers built in plotting functions. Use help(package = "rnoaa") to see all that rnoaa has to offer.

rtimes

The rtimes package provides an interface to Congress, Campaign Finance, Article Search, and Geographic APIs offered by the New York Times. The data libraries and documentation for the several APIs available can be found here. To use the Times’ API you’ll need to get an API key here.

article_key <- "4f23572d8..."     # truncated
cfinance_key <- "ee0b7cef..."     # truncated
congress_key <- "57b3e8a3..."     # truncated

Lets start by searching NY Times articles. With the presendential elections upon us, we can illustrate by searching the least controversial candidate…Donald Trump. We can see that there are 4,566 article hits for the term “Trump”. We can get more information on a particular article by subsetting.

library(rtimes)

# article search for the term 'Trump'
articles <- as_search(q = "Trump", 
                 begin_date = "20150101", 
                 end_date = '20160101',
                 key = article_key)

# summary
articles$meta
##   hits time offset
## 1 4565   28      0

# pull info on 3rd article
articles$data[3]
## [[1]]
## <NYTimes article>Donald Trump’s Strongest Supporters: A Certain Kind of Democrat
##   Type: News
##   Published: 2015-12-31T00:00:00Z
##   Word count: 1469
##   URL: http://www.nytimes.com/2015/12/31/upshot/donald-trumps-strongest-supporters-a-certain-kind-of-democrat.html
##   Snippet: In a survey, he also excels among low-turnout voters and among the less affluent and the less educated, so the question is: Will they show up to vote?

We can use the campaign finance API and functions to gain some insight into Trumps compaign income and expenditures. The only special data you need is the FEC ID for the candidate of interest.

trump <- cf_candidate_details(campaign_cycle = 2016, 
                     fec_id = 'P80001571',
                     key = cfinance_key)

# pull summary data
trump$meta
##          id            name party
## 1 P80001571 TRUMP, DONALD J   REP
##                                             fec_uri
## 1 http://docquery.fec.gov/cgi-bin/fecimg/?P80001571
##                    committee  mailing_address mailing_city
## 1 /committees/C00580100.json 725 FIFTH AVENUE     NEW YORK
##   mailing_state mailing_zip status total_receipts
## 1            NY       10022      O     1902410.45
##   total_from_individuals total_from_pacs total_contributions
## 1               92249.33               0            96298.97
##   candidate_loans total_disbursements begin_cash  end_cash
## 1      1804747.23          1414674.29          0 487736.16
##   total_refunds debts_owed date_coverage_from date_coverage_to
## 1             0 1804747.23         2015-04-02       2015-06-30
##   independent_expenditures coordinated_expenditures
## 1                1644396.8                        0

rtimes also allows us to gain some insight into what our locally elected officials are up to with the Congress API. First, I can get some informaton on my Senator and then use that information to see if he’s supporting my interest. For instance, I can pull the most recent bills that he is co-sponsoring.

# pull info on OH senator
senator <- cg_memberbystatedistrict(chamber = "senate", 
                                    state = "OH", 
                                    key = congress_key)
senator$meta
##        id           name               role gender party
## 1 B000944 Sherrod  Brown Senator, 1st Class      M     D
##   times_topics_url      twitter_id       youtube_id seniority
## 1                  SenSherrodBrown SherrodBrownOhio         9
##   next_election
## 1          2018
##                                                                               api_url
## 1 http://api.nytimes.com/svc/politics/v3/us/legislative/congress/members/B000944.json

# use member ID to pull recent bill sponsorship
bills <- cg_billscosponsor(memberid = "B000944", 
                           type = "cosponsored", 
                           key = congress_key)
head(bills$data)
## Source: local data frame [6 x 11]
## 
##   congress    number
##      (chr)     (chr)
## 1      114    S.2098
## 2      114    S.2096
## 3      114    S.2100
## 4      114    S.2090
## 5      114 S.RES.267
## 6      114 S.RES.269
## Variables not shown: bill_uri (chr), title (chr), cosponsored_date
##   (chr), sponsor_id (chr), introduced_date (chr), cosponsors (chr),
##   committees (chr), latest_major_action_date (chr),
##   latest_major_action (chr)

It looks like the most recent bill Sherrod is co-sponsoring is S.2098 - Student Right to Know Before You Go Act. Maybe I’ll do a NY Times article search with as_search() to find out more about this bill…an exercise for another time.

So this gives you some flavor of how these API packages work. You typically need to know the data sets and variables requested along with an API key. But once you get these basics its pretty straight forward on requesting the data. Your next question may be, what if the API that I want to get data from does not yet have an R package developed for it?

httr for All Things Else

Although numerous R API packages are available, and cover a wide range of data, you may eventually run into a situation where you want to leverage an organization’s API but an R package does not exist. Enter httr. httr was developed by Hadley Wickham to easily work with web APIs. It offers multiple functions (i.e. HEAD(), POST(), PATCH(), PUT() and DELETE()); however, the function we are most concerned with today is Get(). We use the Get() function to access an API, provide it some request parameters, and receive an output.

To give you a taste for how the httr package works, I’ll quickly cover how to use it for a basic key-only API and an OAuth-required API:

Key-only API

To demonstrate how to use the httr package for accessing a key-only API, I’ll illustrate with the College Scorecard API provided by the Department of Education. First, you’ll need to request your API key.

edu_key <- "fd783wmS3Z..."     # truncated

We can now proceed to use httr to request data from the API with the GET() function. I went to North Dakota State University (NDSU) for my undergrad so I’m interested in pulling some data for this school. I can use the provided data library and query explanation to determine the parameters required. In this example, the URL includes the primary path (“https://api.data.gov/ed/collegescorecard/”), the API version (“v1”), and the endpoint (“schools”). The question mark (“?”) at the end of the URL is included to begin the list of query parameters, which only includes my API key and the school of interest.

library(httr)

URL <- "https://api.data.gov/ed/collegescorecard/v1/schools?"

# import all available data for NDSU
ndsu_req <- GET(URL, query = list(api_key = edu_key,
                                  school.name = "North Dakota State University"))

This request provides me with every piece of information collected by the U.S. Department of Education for NDSU. To retrieve the contents of this request I use the content() function which will output the data as an R object (a list in this case). The data is segmented into two main components: metadata and results. I’m primarily interested in the results.

The results branch of this list provides information on lat-long location, school identifier codes, some basic info on the school (city, number of branches, school website, accreditor, etc.), and then student data for the years 1997-2013.

ndsu_data <- content(ndsu_req)

names(ndsu_data)
## [1] "metadata" "results"

names(ndsu_data$results[[1]])
##  [1] "2008"     "2009"     "2006"     "ope6_id"  "2007"     "2004"    
##  [7] "2013"     "2005"     "location" "2002"     "2003"     "id"      
## [13] "1996"     "1997"     "school"   "1998"     "2012"     "2011"    
## [19] "2010"     "ope8_id"  "1999"     "2001"     "2000"

To see what kind of student data categories are offered we can assess a single year. You can see that available data includes earnings, academics, student info/demographics, admissions, costs, etc. With such a large data set, which includes many embedded lists, sometimes the easiest way to learn the data structure is to peruse names at different levels.

# student data categories available by year
names(ndsu_data$results[[1]]$`2013`)
## [1] "earnings"   "academics"  "student"    "admissions" "repayment" 
## [6] "aid"        "cost"       "completion"

# cost categories available by year
names(ndsu_data$results[[1]]$`2013`$cost)
## [1] "title_iv"      "avg_net_price" "attendance"    "tuition"      
## [5] "net_price"

# Avg net price cost categories available by year
names(ndsu_data$results[[1]]$`2013`$cost$avg_net_price)
## [1] "other_academic_year" "overall"             "program_year"       
## [4] "public"              "private"

So if I’m interested in comparing the rise in cost versus the rise in student debt I can simply subset for this data once I’ve identified its location and naming structure. Note that for this subsetting we use the magrittr package and the sapply function; both are covered in more detail in their relevant sections. This is just meant to illustrate the types of data available through this API.

library(magrittr)

# subset list for annual student data only
ndsu_yr <- ndsu_data$results[[1]][c(as.character(1996:2013))]

# extract median debt data for each year
ndsu_yr %>%
        sapply(function(x) x$aid$median_debt$completers$overall) %>% 
        unlist()
##    1997    1998    1999    2000    2001    2002    2003    2004 
## 13388.0 13856.0 14500.0 15125.0 15507.0 15639.0 16251.0 16642.5 
##    2005    2006    2007    2008    2009    2010    2011    2012 
## 17125.0 17125.0 17125.0 17250.0 19125.0 21500.0 23000.0 24954.5 
##    2013 
## 25050.0

# extract net price for each year
ndsu_yr %>% 
        sapply(function(x) x$cost$avg_net_price$overall) %>% 
        unlist()
##  2009  2010  2011  2012  2013 
## 13474 12989 13808 15113 14404

Quite simple isn’t it…at least once you’ve learned how the query requests are formatted for a particular API.

OAuth-required API

At the outset I mentioned how OAuth is an authorization framework that provides credentials as proof for access. Many APIs are open to the public and only require an API key; however, some APIs require authorization to account data (think personal Facebook & Twitter accounts). To access these accounts we must provide proper credentials and OAuth authentication allows us to do this. This section is not meant to explain the details of OAuth (for that see this, this, and this) but, rather, how to use httr in times when OAuth is required.

I’ll demonstrate by accessing the Twitter API using my Twitter account. The first thing we need to do is identify the OAuth endpoints used to request access and authorization. To do this we can use oauth_endpoint() which typically requires a request URL, authorization URL, and access URL. httr also included some baked-in endpoints to include LinkedIn, Twitter, Vimeo, Google, Facebook, and GitHub. We can see the Twitter endpoints using the following:

twitter_endpts <- oauth_endpoints("twitter")
twitter_endpts
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token

Next, I register my application at https://apps.twitter.com/. One thing to note is during the registration process, it will ask you for the callback url; be sure to use “http://127.0.0.1:1410”. Once registered, Twitter will provide you with keys and access tokens. The two we are concerned about are the API key and API Secret.

twitter_key <- "BZgukbCol..."   # truncated
twitter_secret <- "YpB8Xy..."   # truncated

We can then bundle the consumer key and secret into one object with oauth_app(). The first argument, appname is simply used as a local identifier; it does not need to match the name you gave the Twitter app you developed at https://apps.twitter.com/.

We are now ready to ask for access credentials. Since Twitter uses OAuth 1.0 we use oauth1.0_token() function and incorporate the endpoints identified and the oauth_app object we previously named twitter_app.

twitter_token <- oauth1.0_token(endpoint = twitter_endpts, twitter_app)

Waiting for authentication in browser...
Press Esc/Ctrl + C to abort
Authentication complete.

Once authentication is complete we can now use the API. I can pull all the tweets that show up on my personal timeline using the GET() function and the access cridentials I stored in twitter_token. I then use content() to convert to a list and I can start to analyze the data.

In this case each tweet is saved as an individual list item and a full range of data are provided for each tweet (i.e. id, text, user, geo location, favorite count, etc). For instance, we can see that the first tweet was by FiveThirtyEight concerning American politics and, at the time of this analysis, has been favorited by 3 people.

# request Twitter data
req <- GET("https://api.twitter.com/1.1/statuses/home_timeline.json",
           config(token = twitter_token))

# convert to R object
tweets <- content(req)

# available data for first tweet on my timeline
names(tweets[[1]])
 [1] "created_at"                    "id"                           
 [3] "id_str"                        "text"                         
 [5] "source"                        "truncated"                    
 [7] "in_reply_to_status_id"         "in_reply_to_status_id_str"    
 [9] "in_reply_to_user_id"           "in_reply_to_user_id_str"      
[11] "in_reply_to_screen_name"       "user"                         
[13] "geo"                           "coordinates"                  
[15] "place"                         "contributors"                 
[17] "is_quote_status"               "retweet_count"                
[19] "favorite_count"                "entities"                     
[21] "extended_entities"             "favorited"                    
[23] "retweeted"                     "possibly_sensitive"           
[25] "possibly_sensitive_appealable" "lang" 

# further analysis of first tweet on my timeline
tweets[[1]]$user$name
[1] "FiveThirtyEight"

tweets[[1]]$text
[1] "\U0001f3a7 A History Of Data In American Politics (Part 1): William Jennings Bryan to Barack Obama https://t.co/oCKzrXuRHf  https://t.co/6CvKKToxoH"

tweets[[1]]$favorite_count
[1] 3

This provides a fairly simple example of incorporating OAuth authorization. The httr provides several examples of accessing common social network APIs that require OAuth. I recommend you go through several of these examples to get familiar with using OAuth authorization; see them at demo(package = "httr"). The most difficult aspect of creating your own connections with APIs is gaining an understanding of the API and the arguments they leverage. This obviously requires time and energy devoted to digging into the API documentation and data library. Next its just a matter of trial and error (likely more the latter than the former) to learn how to translate these arguments into httr function calls to pull the data of interest.

Also, note that httr provides several other useful functions not covered here for communicating with APIs (i.e. POST(), BROWSE()). For more on these other httr capabilities see this quickstart vignette.

Exercises

  1. Go to opendatanetwork.com, search for open source data sets in your metropolitan area.
  2. Now use the RSocrata package to connect with one of these data set APIs and import the data.
  3. Use the rscorecard package to download Department of Education College Scorecard data for your undergrad school.
  1. In Automated Data Collection with R Munzert et al. state that “[t]he first way to get data from the web is almost too banal to be considered here and actually not a case of web scraping in the narrower sense.” 

  2. You can learn more about selectors at flukeout.github.io 

  3. You can simply assess the name of the ID in the highlighted element or you can right click the highlighted element in the developer tools window and select Copy selector. You can then paste directly into html_nodes() as it will paste the exact ID name that you need for that element.