Numerical Data Descriptive Statistics

Descriptive statistics are the first pieces of information used to understand and represent a dataset. There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries. These summaries can be presented with a single numeric measure, using summary tables, or via graphical representation. Here, I illustrate the most common forms of descriptive statistics for numerical data but keep in mind there are numerous ways to describe and illustrate key features of data.


tl;dr

This tutorial covers the key features we are initially interested in understanding for numerical data, to include:

  • Central Tendency: What are the most typical values?
  • Variability: How do the values vary?
  • Shape: Are the values symmetrically or assymetrically distributed?
  • Outliers: Are there values that represent abnormalities in the data?
  • Visualization: We should understand these features of the data through statistics and visualization


Replication Requirements

To illustrate ways to compute different summary statistics, and to visualize the data to provide understanding of these key features, I’ll demonstrate using this data which contains data on 843 MLB players in the 2011 season:

##            Player                Team Position   Salary
## 1    A.J. Burnett    New York Yankees  Pitcher 16500000
## 2      A.J. Ellis Los Angeles Dodgers  Catcher   421000
## 3 A.J. Pierzynski   Chicago White Sox  Catcher  2000000
## 4      Aaron Cook    Colorado Rockies  Pitcher  9875000
## 5      Aaron Crow  Kansas City Royals  Pitcher  1400000
## 6    Aaron Harang    San Diego Padres  Pitcher  3500000

In addition, the packages we will leverage include the following:

library(moments)        # for calculating the skew and kurtosis
library(outliers)       # identifying and extracting outliers
library(ggplot2)        # for generating visualizations

See Working with packages for more information on installing, loading, and getting help with packages.


Central Tendency

There are three common measures of central tendency, all of which try to answer the basic question of which value is the most “typical.” These are the mean (average of all observations), median (middle observation), and mode (appears most often). Each of these measures can be calculated for an individual variable or across all variables in a particular data frame.

mean(salaries$Salary, na.rm = TRUE)
## [1] 3305055
median(salaries$Salary, na.rm = TRUE)
## [1] 1175000

Unfortunately, there is not a built in function to compute the mode of a variable1. However, we can create a function that takes the vector as an input and gives the mode value as an output:

get_mode <- function(v) {
        unique_value <- unique(v)
        unique_value[which.max(tabulate(match(v, unique_value)))]
}

get_mode(salaries$Salary)
## [1] 414000

Go to top


Variability

The central tendencies give you a sense of the most typical values (salaries in this case) but do not provide you with information on the variability of the values. Variability can be summarized in different ways, each providing you unique understanding of how the values are spread out.

Range

The range is a fairly crude measure of variability, defining the maximum and minimum values and the difference thereof. We can compute range summaries with the following:

# get the minimum value
min(salaries$Salary, na.rm = FALSE)
## [1] 414000

# get the maximum value
max(salaries$Salary, na.rm = FALSE)
## [1] 3.2e+07

# get both the min and max values
range(salaries$Salary, na.rm = FALSE)
## [1]   414000 32000000

# compute the spread between min & max values
max(salaries$Salary, na.rm = FALSE) - min(salaries$Salary, na.rm = FALSE)
## [1] 31586000

Percentiles

Given a certain percentage such as 25%, what is the salary value such that this percentage of salaries is below it? This type of question leads to percentiles and quartiles. Specifically, for any percentage p, the pth percentile is the value such that a percentage p of all values are less than it. Similarly, the first, second, and third quartiles are the percentiles corresponding to p=25%, p=50%, and p=75%. These three values divide the data into four groups, each with (approximately) a quarter of all observations. Note that the second quartile is equal to the median by definition. These measures are easily computed in R:

# fivenum() function provides min, 25%, 50% (median), 75%, and max
fivenum(salaries$Salary)
## [1]   414000   430325  1175000  4306250 32000000

# default quantile() percentiles are 0%, 25%, 50%, 75%, and 100% 
# provides same output as fivenum()
quantile(salaries$Salary, na.rm = TRUE)
##       0%      25%      50%      75%     100% 
##   414000   430325  1175000  4306250 32000000

# we can customize quantile() for specific percentiles
quantile(salaries$Salary, probs = seq(from = 0, to = 1, by = .1), na.rm = TRUE)
##       0%      10%      20%      30%      40%      50%      60%      70% 
##   414000   416520   424460   441300   672000  1175000  2004000  3320000 
##      80%      90%     100% 
##  5500000  9800000 32000000

# we can quickly compute the difference between the 1st and 3rd quantile
IQR(salaries$Salary)
## [1] 3875925

An alternative approach to is to use the summary() function with is a generic R function used to produce min, 1st quantile, median, mean, 3rd quantile, and max summary measures. However, note that the 1st and 3rd quantiles produced by summary() differ from the 1st and 3rd quantiles produced by fivenum() and the default quantile(). The reason for this is due to the lack of universal agreement on how the 1st and 3rd quartiles should be calculated.2 Eric Cai provided a good blog post that discusses this difference in the R functions.

summary(salaries$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   414000   430300  1175000  3305000  4306000 32000000

Variance

Although the range provides a crude measure of variability and percentiles/quartiles provide an understanding of divisions of the data, the most common measures to summarize variability are variance and its derivatives (standard deviation and mean/median absolute deviation). We can compute each of these as follows:

# variance
var(salaries$Salary)
## [1] 2.056389e+13

# standard deviation
sd(salaries$Salary)
## [1] 4534742

# mean absolute deviation
mad(salaries$Salary, center = mean(salaries$Salary))
## [1] 4229672

# median absolute deviation - note that the center argument defaults to median
# so it does not need to be specified, although I do just be clear
mad(salaries$Salary, center = median(salaries$Salary))
## [1] 1126776

Go to top


Shape

Two additional measures of a distribution that you will hear occasionally include skewness and kurtosis. Skewness is a measure of symmetry for a distribution. Negative values represent a left-skewed distribution where there are more extreme values to the left causing the mean to be less than the median. Positive values represent a right-skewed distribution where there are more extreme values to the right causing the mean to be more than the median.

Kurtosis is a measure of peakedness for a distribution. Negative values indicate a flat (platykurtic) distribution, positive values indicate a peaked (leptokurtic) distribution, and a near-zero value indicates a normal (mesokurtic) distribution.

We can get both skewness and kurtosis values using the moments package:

library(moments)

skewness(salaries$Salary, na.rm = TRUE)
## [1] 2.252809
kurtosis(salaries$Salary, na.rm = TRUE)
## [1] 8.682261

Go to top


Outliers

Outliers in data can distort predictions and affect their accuracy. Consequently, its important to understand if outliers are present and, if so, which observations are considered outliers. The outliers package provides a number of useful functions to systematically extract outliers. The functions of most use are outlier() and scores(). The outlier() function gets the most extreme observation from the mean. The scores() function computes the normalized (z, t, chisq, etc.) score which you can use to find observation(s) that lie beyond a given value.

library(outliers)

# gets most extreme right-tail observation
outlier(salaries$Salary)
## [1] 3.2e+07

# gets most extreme left-tail observation
outlier(salaries$Salary, opposite = TRUE)
## [1] 414000

# observations that are outliers based on z-scores
z_scores <- scores(salaries$Salary, type = "z")
which(abs(z_scores) > 1.96)
##  [1]   1  11  22  24  33  34  38  54  64 132 134 138 139 146 151 161 164
## [18] 168 211 236 242 248 249 348 355 370 384 441 452 460 484 496 517 520
## [35] 535 538 576 577 585 601 627 629 649 692 695 728 729 733 737 790 794
## [52] 801 803 816 843

# outliers based on values less than or greater than the "whiskers" on a 
# boxplot (1.5 x IQR or more below 1st quartile or above 3rd quartile)
which(scores(salaries$Salary, type = "iqr", lim = 1.5))
##  [1]   1  11  12  22  24  33  34  38  54  64  83  97 132 134 138 139 146
## [18] 151 161 164 168 189 208 211 236 242 248 249 293 298 300 329 336 348
## [35] 355 370 384 408 440 441 442 452 460 475 484 493 496 517 520 535 538
## [52] 547 576 577 585 601 620 627 629 649 664 679 692 695 707 728 729 733
## [69] 737 760 790 794 801 803 816 818 843

How you deal with outliers is a topic worthy of its own tutorial; however, if you want to simply remove an outlier or replace it with the sample mean or median then I recommend the rm.outlier() function provided also by the outliers package.

Go to top


Visualization

There are many graphical representations to illustrate these summary measures for numerical variables, but there are a couple fundamental ones that nearly everyone agrees needs to be assessed - histograms and boxplots.

Histograms are the most common type of chart for showing the distribution of a numerical variable. Histograms display a 1D distribution by dividing into bins and counting the number of observations in each bin. Whereas the previously discussed summary measures - mean, median, standard deviation, skewness - describes only one aspect of a numerical variable, a histogram provides the complete picture by illustrating the center of the distribution, the variability, skewness, and other aspects in one convenient chart. We can quickly visualize the histogram for our MLB salaries using base R graphics:

hist(salaries$Salary)

However, we can use ggplot to customize our graphic and create a more presentable product:

library(ggplot2)

ggplot(salaries, aes(Salary)) + 
        geom_histogram(colour = "black", fill = "white") +
        scale_x_log10(labels = scales::dollar) +          # x axis log & $$ labels
        geom_vline(aes(xintercept = mean(Salary)), 
                   color = "red", linetype = "dashed") +  # add line for mean
        annotate("text", x = mean(salaries$Salary) * 2, y = 155,
                 label = paste0("Avg: $", round(mean(salaries$Salary)/1000000, 1),"M"))

An alternative, and highly effective way to visualize data is with dotplots. In a dotplot a dot is placed at the appropriate value on the x axis for each case with multiples cases of a particular value resulting in dots stacking up. The dotplot below helps to illustrate that there is one unique player with a salary greater than $30M (Alex Rodriguez) and only six players with salaries greater than $20M.

ggplot(salaries, aes(x = Salary)) + 
        geom_dotplot() +
        scale_x_continuous(labels = scales::dollar)

Boxplots are an alternative way to illustrate the distribution of a variable and is a concise way to illustrate the standard quantiles, shape, and outliers of data. As the generic diagram indicates, the box itself extends, left to right, from the 1st quartile to the 3rd quartile. This means that it contains the middle half of the data. The line inside the box is positioned at the median. The lines (whiskers) coming out either side of the box extend to 1.5 interquartlie ranges (IQRs) from the quartlies. These generally include most of the data outside the box. More distant values, called outliers, are denoted separately by individual points.

Generic Box Plot

For a quick univariate assessment we can use the boxplot() function in base R graphics. This single boxplot illustrates the highly skewed nature of the distribution with many outliers on the right side.

# I use a log scale to spread out the data
boxplot(salaries$Salary, horizontal = TRUE, log = "x")

As before, we can use ggplot to refine the boxplot and add additional features such as a point illustrating the mean and also show the actual distribution of observations:

ggplot(salaries, aes(x = factor(0), y = Salary)) +
        geom_boxplot() +
        xlab("") +
        scale_x_discrete(breaks = NULL) +
        scale_y_log10(labels = scales::dollar) +
        coord_flip() +
        geom_jitter(shape = 16, position = position_jitter(0.4), alpha = .3) +
        stat_summary(fun.y = mean, geom = "point", shape = 23, size = 4, fill = "blue")

However, boxplots are more useful when comparing distributions. For instance, if you wanted to compare the distributions of salaries across the different positions boxplots provide a quick comparative assessment:

ggplot(salaries, aes(x = Position, y = Salary)) +
        geom_boxplot() +
        scale_y_continuous(labels = scales::dollar) +
        coord_flip()

Go to top

  1. There is a mode() function in R; however, it is used to get or set the type or storage mode of an object rather than to compute the statistical mode of a variable. 

  2. See Hyndman, R. & Fan, Y. (1996). Sample Quantiles in Statistical Packages, The American Statistician, 50(4), 361-365 for more discussion.