Key Things to Know

What to Remember from this Section



  1. Lots of built-in data sets provided by R
  2. Text files use delimiters - these determine the function to read in the data
  3. There is no base R function to read in Excel data - must use a package
  4. Web scraping is a dense subject but scraping online text/Excel files can be as easy as reading in these files from your hard drive

Built-in Data

Built-in Data Sets

  • R has many built-in data sets
  • type data() into your console (104 data sets should appear)
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Built in Data Sets

  • R also has many convenient built-in character strings to be aware of
  • Try these:
letters
LETTERS
month.abb
month.name
state.abb
state.division
state.name
state.region


Note

  • to load any of these data sets/strings into your global environment: data("name of data set")
  • type ?name to get more information about the built-in data

Your Turn



1. Load the iris data set


2. What is this data measuring?

Solution



1. Load the iris data set

data(iris)


2. What is this data measuring?

?iris

Description from the help screen: "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."

Importing Text Files

Importing Text Files

.csv .txt .tsv

  • Text files are a popular way to hold and exchange tabular data
  • Text file formats use delimiters to separate the different elements
    • .csv
    • .txt
    • .tsv
  • These delimiters help us know what functions to use to read in the data
  • I'll illustrate with the following files in the R-Bootcamp download file
    • mydata.csv
    • mydata.txt
    • mydata.tsv

Importing Text Files

Base R

Use read.csv for comma separated files (automatically sets the sep = ",")

read.csv("data/mydata.csv")
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE


Use read.delim for tab delimited files (automatically sets the sep = "\t")

read.delim("data/mydata.txt")
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE

Importing Text Files

Base R

When importing data, save to a file by using the assignment operator:

mydata <- read.delim("data/mydata.tsv")

mydata
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE


  • You now have a data object in your global environment named mydata
  • View your data in a spreadsheet form:
    • clicking on the object in the global environment space or
    • type View(mydata) in your console

Your Turn



1. Read in the facebook.tsv file


2. Save it as an object titled facebook


3. Take a peek at what this data looks like

Solution



1 & 2: Read in the facebook.tsv file and save as facebook

facebook <- read.delim("data/facebook.tsv")


3.: Take a peek at what this data looks like

View(facebook)

Importing Excel Files

Importing Excel Files

  • Excel is still the spreadsheet software of choice
  • Base R does not have an Excel importing function but we can use the readxl package
# if you haven't installed the readxl package run the following line (minus hashtag)
# install.packages("readxl")

library(readxl)

read_excel("data/mydata.xlsx", sheet = "Sheet5")
##   variable 1 variable 2 variable 3 variable 4          variable 5
## 1         10       beer          1      42328 2015-11-20 13:30:00
## 2         25       wine          1         NA 2015-11-21 16:30:00
## 3          8       <NA>          0      42330 2015-11-22 14:45:00
  • Benefits of readxl
    • recognizes date and time variables
    • keeps text variables as characters rather than factors (more to come on this)
    • does not collapse variable names
    • read more about readxl capabilities here
  • xlsx is an alternative package for reading in Excel files

Your Turn


1. Read in the spreadsheet titled

        3. Median HH income, metro in the PEW Middle
        Class Data.xlsx file



2. Save it as an object titled pew


3. Take a peek at what this data looks like



Hint: You will need to skip 5 lines ☛ check out the help file on read_excel to do this

Solution



1 & 2: Read in the .xlsx file and save as pew

pew <- read_excel("data/PEW Middle Class Data.xlsx", sheet = "3. Median HH income, metro", skip = 5)


3: Take a peek at what this data looks like

View(pew)

Scraping Online Files

Scraping Online Files

Tabular files

  • Dense subject with LOTS to learn
  • We'll only focus on reading in tabular and Excel files stored online
  • For more info regarding "real web scraping" see the tutorial at: uc-r.github.io/scraping


Let's download some data from https://www.data.gov/metrics:

# the url for the online CSV
url <- "https://www.data.gov/media/federal-agency-participation.csv" 

# use read.csv to import
data_gov <- read.csv(url, stringsAsFactors = FALSE)

View(data_gov)

Scraping Online Files

Excel files

  • Scraping online Excel files follows a similar process
  • gdata package is particular easy to use


Let's download some data from Fair Market Rents for Section 8 Housing:

library(gdata)

url <- "http://www.huduser.org/portal/datasets/fmr/fmr2015f/FY2015F_4050_Final.xls"

# use read.xls to import
rents <- read.xls(url)

View(data_gov)

Your Turn



1. Download the file stored at:

https://bradleyboehmke.github.io/public/data/reddit.csv


2. Save it as an object titled reddit


3. Take a peek at what this data looks like

Solution



1. Download the file stored at: bradleyboehmke.github.io/public/data/reddit.csv

2. Save it as an object titled reddit

url <- "https://bradleyboehmke.github.io/public/data/reddit.csv"

reddit <- read.csv(url)


3. Take a peek at what this data looks like

View(reddit)

Key Things to Remember

Remember These Functions!

Operator/Function Description
data() access built-in data sets
? will provide you information regarding built-in data (i.e. ?mtcars)
read.csv() base R function for reading in .csv files (can also be used to read in a .csv file stored online)
read.delim() base R function for reading in .txt and .tsv files
read_excel() imports Excel data (provided by the readxl package)
read.xls() imports Excel data stored online (provided by the gdata package)
View() opens a spreadsheet-style data viewer

Break

5 minutes!