21 Apr 2017
It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response variable. Including such irrelevant variables leads to unnecessary complexity in the resulting model. Unfortunately, manually filtering through and comparing regression models can be tedious. Luckily, several approaches exist for automatically performing feature selection or variable selection — that is, for identifying those variables that result in superior regression results. This latest tutorial covers a traditional approach known as linear model selection.
07 Apr 2017
Although R provides built-in plotting functions,
ggplot2 has become the preeminent visualization package in R.
ggplot2 implements the Grammar of Graphics theory making it particularly effective for constructing visual representations of data and learning this library will allow you to make nearly any kind of (static) data visualization, customized to your exact specifications. This intro tutorial will get you started in making effective visualizations with R.
17 Mar 2017
See the new tutorial on resampling methods, which are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
03 Mar 2017
Clustering is a broad set of techniques for finding subgroups of observations within a data set. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. Clustering allows us to identify which observations (i.e. customers, students, states) are alike, and potentially categorize them therein. Check out our new tutorials covering k-means and hierarchical clustering.
17 Feb 2017
Principal components analysis (PCA) reduces the dimensionality of our data, allowing most of the variability to be explained using fewer variables than the original data set. This allows us to understand the primary features that can best represent our data. Check out the latest tutorial which covers PCA