## What to Remember from this Section

• R has the ability to work with a variety of data types
• As an analyst, you should become familiar with dealing with the following types of data:
Data Type Description
- Numbers integer (i.e. 1,2,3…), double (i.e. 1.5, 3.66)
- Character Strings "r", "I attend UC", etc.
- Regular expressions patterns within text strings
- Factors nominal (male, female), ordinal (freshman, sophmore, junior), interval ($0-25,$26-50, $51-75) - Dates calendar dates (i.e. 2016-08-06, 08/06/2016), weekdays, hours, etc. - Logical TRUE, FALSE, any, all ## Numbers ## Numbers: two types of numbers Numeric data pimarily comes in two forms: integer & double (double precision floating point) # create a string of double-precision values dbl_var <- c(1, 2.5, 4.5) class(dbl_var) ## [1] "numeric" # placing an L after the values creates a string of integers int_var <- c(1L, 6L, 10L) class(int_var) ## [1] "integer" We can coerce integers to doubles and vice versa with as.double() and as.integer() as.integer(dbl_var) ## [1] 1 2 4 int_to_dbl <- as.double(int_var) class(int_to_dbl) ## [1] "numeric" # Combining double and integer will automatically coerce to the simplest form (double) c(dbl_var, int_var) ## [1] 1.0 2.5 4.5 1.0 6.0 10.0 ## Numbers: comparing numbers You've already seen logical operators using ==, !=, <, <=, >, >= x <- c(4, 4, 9, 12) y <- c(4, 4, 9, 12.00000008) x == y ## [1] TRUE TRUE TRUE FALSE Can also test for exact equality with identical() and near equality with all.equal() z <- c(4, 4, 9, 12) identical(x, y) ## [1] FALSE identical(x, z) ## [1] TRUE all.equal(x, y) ## [1] TRUE ## Numbers: rounding We can also round numbers multiple ways: x <- c(1, 1.35, 1.7, 2.05, 2.4, 2.75) # round to the nearest integer round(x) ## [1] 1 1 2 2 2 3 # round up ceiling(x) ## [1] 1 2 2 3 3 3 # round down floor(x) ## [1] 1 1 1 2 2 2 # round to a specified decimal round(x, digits = 1) ## [1] 1.0 1.4 1.7 2.0 2.4 2.8 ## Your Turn Import the numbers-your-turn.csv file in the data folder 1. Are the vectors x & y equal? Exactly or approximately equal? 2. Are the vectors y & z equal? Exactly or approximately equal? 3. Round x & y numbers to the 4th digit 4. Are these vectors equal now? ## Solution # import the numbers-your-turn.csv file in the data folder df <- read.csv("data/numbers-your-turn.csv") # 1. Are the vectors x & y equal? Exactly or approximately equal? identical(df$x, df$y) ## [1] FALSE all.equal(df$x, df$y) ## [1] "Mean relative difference: 1.041407e-07" # 2. Are the vectors x & z equal? Exactly or approximately equal? identical(df$y, df$z) ## [1] FALSE all.equal(df$y, df$z) ## [1] TRUE # 3. Round x & y numbers to the 4th digit x <- round(df$x, digits = 4)
y <- round(df$y, digits = 4) # 4. Are these vectors equal now? identical(x, y) ## [1] TRUE all.equal(x, y) ## [1] TRUE ## Characters ### "Hello world!" ## Characters: creating simple character strings Create character strings using "" a <- "learning to create" b <- "character strings" Combine character strings with c, paste() or paste0 # create a vector containing two elements - a and b c(a, b) ## [1] "learning to create" "character strings" # create a vector containing one element - a and b combined paste(a, b) ## [1] "learning to create character strings" # paste multiple strings paste("I", "love", "R") ## [1] "I love R" # change the separator paste("I", "love", "R", sep = "-") ## [1] "I-love-R" # collapse space between characters paste0("I", "love", "R") ## [1] "IloveR" ## Characters: test, conversion & coercion Use class(), mode() and/or is.character() to assess the data type a <- "Life of" b <- pi class(a) ## [1] "character" mode(a) ## [1] "character" is.character(pi) ## [1] FALSE Use as.character() to convert non-character to a character as.character(pi) ## [1] "3.14159265358979" Combining characters and non-characters will coerce all inputs to a character c(a, b) ## [1] "Life of" "3.14159265358979" ## Characters: summarizing Use length() to count the number of elements (individual character strings) in a vector length("How many elements are in this string?") ## [1] 1 length(c("How", "many", "elements", "are", "in", "this", "string?")) ## [1] 7 Use nchar() to count the number of characters in each element nchar("How many characters are in this string?") ## [1] 39 nchar(c("How", "many", "characters", "are", "in", "this", "string?")) ## [1] 3 4 10 3 2 4 7 ## Characters: manipulation ## Factors ### aka categorical variables ## Factors: different from characters Key Words : finite options and levels • nominal variables • male, female • brunnette, blonde, red, black • Hispanic, Caucasion, Asian, African • ordinal variables • slow, medium, fast • freshman, sophomore, junior, senior • interval variables •$1-100, $101-200,$201-300
• 0-10, 11-20, 21-30

## Factors: creating nominal factors

Create nominal factors with factor()

gender <- c("male", "female", "female")

class(gender)
## [1] "character"

gender2 <- factor(gender)

class(gender2)
## [1] "factor"

gender2
## [1] male   female female
## Levels: female male

set level preferences with level argument

factor(gender, levels = c("male", "female"))
## [1] male   female female
## Levels: male female

## Factors: creating ordered factors

Create ordinal/interval factors with ordered(); set level preferences with level argument

age.range <- c("18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above", "Under 18")

class(age.range)
## [1] "character"

# turn x into an ordered factor - levels default to the order of the data
age.range2 <- ordered(age.range)

class(age.range2)
## [1] "ordered" "factor"

age.range2
## [1] 18-24       25-34       35-44       45-54       55-64       65 or Above
## [7] Under 18
## 7 Levels: 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < ... < Under 18

set level preferences with level argument

ordered(age.range, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"))
## [1] 18-24       25-34       35-44       45-54       55-64       65 or Above
## [7] Under 18
## 7 Levels: Under 18 < 18-24 < 25-34 < 35-44 < 45-54 < ... < 65 or Above

## Factors: summarizing

If you want to know the levels that exist in your factor variable use levels()

facebook <- read.delim("data/facebook.tsv")

levels(facebook$gender) ## [1] "female" "male" We can use the table() function to quickly assess the counts of each level table(facebook$gender)
##
## female   male
##  40254  58574

Import the reddit.csv file in the data folder

1. What are the levels for the income.range variable?

2. Properly order the levels for income.range.

3. What are the counts for each level?

## Solution

# import the reddit.csv file in the data folder

# 1. What are the levels for the income.range variable?
levels(reddit$income.range) ## [1] "$100,000 - $149,999" "$150,000 or more"    "$20,000 -$29,999"
## [4] "$30,000 -$39,999"   "$40,000 -$49,999"   "$50,000 -$69,999"
## [7] "$70,000 -$99,999"   "Under $20,000" # 2. Properly order the levels for income.range. reddit$income.range <- ordered(reddit$income.range, levels = c("Under$20,000", "$20,000 -$29,999", "$30,000 -$39,999",
"$40,000 -$49,999", "$50,000 -$69,999", "$70,000 -$99,999",
"$100,000 -$149,999", "$150,000 or more")) # 3. What are the counts for each level? table(reddit$income.range)
##
##       Under $20,000$20,000 - $29,999$30,000 - $39,999 ## 7892 3206 2904 ##$40,000 - $49,999$50,000 - $69,999$70,000 - $99,999 ## 2686 4133 4101 ##$100,000 - $149,999$150,000 or more
##                3522                2695

## Dates: creating

• The lubridate package makes working with dates extremely easy
• To create a date variable we simply need to know the year-month-day order
Function Order of elements in date-time
ymd() year, month, day
ydm() year, day, month
mdy() month, day, year
dmy() day, month, year
hm() hour, minute
hms() hour, minute, second
ymd_hms() year, month, day, hour, minute, second

## Dates: creating

• The lubridate package makes working with dates extremely easy
• To create a date variable we simply need to know the year-month-day order
dates <- c("2015-07-01", "2015-08-01", "2015-09-01")

class(dates)
## [1] "character"

Convert this character string to date format with lubridate's ymd() function

# install.packages("lubridate") # run this line if you have not yet installed lubridate
library(lubridate)

dates2 <- ymd(dates)

class(dates2)
## [1] "Date"

dates2
## [1] "2015-07-01" "2015-08-01" "2015-09-01"

## Dates: create by merging

• Sometimes your date data are collected in separate elements
• To convert these separate data into one date object incorporate the ISOdate() function:
yr <- c("2012", "2013", "2014", "2015")
mo <- c("1", "5", "7", "2")
day <- c("02", "22", "15", "28")

# ISOdate converts to a POSIXct object
full_date <- ISOdate(year = yr, month = mo, day = day)
full_date
## [1] "2012-01-02 12:00:00 GMT" "2013-05-22 12:00:00 GMT"
## [3] "2014-07-15 12:00:00 GMT" "2015-02-28 12:00:00 GMT"

We can truncate the unused time data by converting with as.Date()

as.Date(full_date)
## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

## Dates: extract & manipulate

We can also easily extract components of dates using lubridate

Function Date-time element to extract
year() Year
month() Month
week() Week
yday() Day of year
mday() Day of month
wday() Day of week
hour() Hour
minute() Minute
second() Second
tz() Time zone

## Dates: extract & manipulate

We can also easily extract components of dates using lubridate

Extract time components:

year(full_date)
## [1] 2012 2013 2014 2015

week(full_date)
## [1]  1 21 28  9

wday(full_date, label = TRUE)
## [1] Mon  Wed  Tues Sat
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

Manipulate or change date-time components by using the function and then assignment

as.Date(full_date)
## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

year(full_date) <- c(2014, 2015, 2015, 2016)

as.Date(full_date)
## [1] "2014-01-02" "2015-05-22" "2015-07-15" "2016-02-28"

## Dates: summarizing

• We can also do regular statistical summaries of date objects
• Illustrate with the lakers data set that comes with the lubridate package

## Solution

# Import the facebook.tsv file in the data folder

# 1. Create a new date variable that combines the dob_day, dob_month, & dob_year variables.
facebook$dob <- as.Date(ISOdate(year = facebook$dob_year,
month = facebook$dob_month, day = facebook$dob_day))

# 2. What is the min, max, mean, and median date of births in this data frame?
##         Min.      1st Qu.       Median         Mean      3rd Qu.
## "1900-01-01" "1963-08-14" "1985-01-20" "1976-03-12" "1993-01-01"
##         Max.
## "2000-10-27"

## Logical: the basics

We already saw how we can get TRUE/FALSE responses from comparing elements

x <- c(4, 4, 9, 12, 2, 2, 10)
y <- c(4, 5, 9, 13, 2, 1, 10)

x == y
## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

This is just vector containing logical elements

z <- x == y

class(z)
## [1] "logical"

We can assess if any or all the elements are TRUE

any(z)
## [1] TRUE

all(z)
## [1] FALSE

## Remember These Functions!

Operator/Function Description
as.double(), as.integer() coerce to double floating point or integer numbers
identical(), all.equal() test for exact and near equality
round(), ceiling(), floor() round numbers
c(), paste(), paste0() combine character strings
as.character() coerce non-character to a character
nchar() count the number of characters in each element
factor(), ordered() create or coerce to factor variables
levels() assess the levels of a factor
table() get the counts of each level

## Remember These Functions!

Operator/Function Description
ymd(), mdy(), hm(), etc lubridate: create or convert to date-time variable
Isodate() create date variable by mergine separate date components
as.Date() truncate date-time variable to just date variable
year(), week(), etc lubridate: extract individual date components
any(), all() assess if any or all elements are TRUE

5 minutes!