Exploratory data analysis plotting should be quick and simple and base R excels at this

Visualization | Function |
---|---|

Strip chart | `stripchart()` |

Histogram | `hist()` |

Density plot | `plot(density())` |

Box plot | `boxplot()` |

Bar chart | `barplot()` |

Dot plot | `dotchart()` |

Scatter plot | `plot()` , `pairs()` |

Line chart | `plot()` |

In R, graphs are typically created interactively:

attach(mtcars) plot(wt, mpg) abline(lm(mpg~wt)) title("Regression of MPG on Weight")

You can specify fonts, colors, line styles, axes, reference lines, etc. by specifying graphical parameters

This allows a wide degree of customization; however…

```
I have found that
````ggplot`

is an easier syntax for customization needs

Import the following data sets from the data folder

facebook.tsv reddit.csv race-comparison.csv Supermarket Transactions.xlsx

Useful when sample sizes are small but not when sample size are large

stripchart(mtcars$mpg, pch = 16) stripchart(facebook$tenure, pch = 16)

hist(facebook$tenure) hist(facebook$tenure, breaks = 100, col = "grey", main = "Facebook User Tenure", xlab = "Tenure (Days)")

A perfect example of why customization with base R is not always enjoyable; in ggplot this is far simpler

x <- na.omit(facebook$tenure) # histogram h<-hist(x, breaks = 100, col = "grey", main = "Facebook User Tenure", xlab = "Tenure (Days)") # add a normal curve xfit <- seq(min(x), max(x), length = 40) yfit <- dnorm(xfit, mean = mean(x), sd = sd(x)) yfit <- yfit * diff(h$mids[1:2]) * length(x) lines(xfit, yfit, col = "red", lwd = 2)

Enclose density(x) within plot()

# basic density plot d <- density(facebook$tenure, na.rm = TRUE) plot(d, main = "Kernel Density of Tenure") # fill denisty plot by adding polygon() polygon(d, col = "red", border = "blue")

The previous methods provide good insights into the shape of the distribution but don't necessarily tell us about specific summary statistics such as:

summary(facebook$tenure)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 226.0 412.0 537.9 675.0 3139.0 2

However, boxplots provide a concise way to illustrate these standard statistics, the shape, and outliers of data:

boxplot(facebook$tenure, horizontal = TRUE) boxplot(facebook$tenure, horizontal = TRUE, notch = TRUE, col = "grey40")

Using the `facebook.tsv`

data…

Visually assess the continuous variables. What do you find?

reddit <- read.csv("data/reddit.csv") table(reddit$dog.cat) ## ## I like cats. I like dogs. I like turtles. ## 11156 17151 4442 barplot(table(reddit$dog.cat))

pets <- table(reddit$dog.cat) barplot(pets, main = "Reddit User Animal Preferences", col = "cyan") par(las = 1) barplot(pets, main = "Reddit User Animal Preferences", horiz = TRUE, names.arg = c("Cats", "Dogs", "Turtles"))