Introduction to R

1. Getting Started with R
R can only be used after installation, which fortunately is just as simple as installing any other program. In this lesson, you learn about where to download R, how to decide on the best version, how to install it, and you get familiar with its environment, using RStudio as a front end. We also take a look at the package system.

2. The Basic Building Blocks in R
R is a flexible and robust programming language, and using it requires understanding how it handles data. We learn about performing basic math in R, storing various types of data in variables—such as numeric, integer, character, and time-based—and calling functions on the data.

3. Advanced Data Structures in R
Like many other languages, R offers more complex storage mechanisms such as vectors, arrays, matrices, and lists. We take a look at those and the data frame, a special storage type that strongly resembles a spreadsheet and is part of what makes working with data in R such a pleasure.

4. Reading Data into R
Data is abundant in the world, so analyzing it is just a matter of getting the data into R. There are many ways of doing so, the most common being reading from a CSV file or database. We cover these techniques, and also importing from other statistical tools, scraping websites, and reading Excel files.

5. Making Statistical Graphs
Visualizing data is a crucial part of data science both in the discovery phase and when reporting results. R has long been known for its capability to produce compelling plots, and Hadley Wickham’s ggplot2 package makes it even easier to produce better looking graphics. We cover histograms scatterplots, boxplots, line charts, and more, in both base graphics and ggplot2 and then explore newer packages ggvis and rCharts.

6. Basics of Programming
R has all the standard components of a programming language such as writing functions, if statements and loops, all with their own caveats and quirks. We start with the requisite “Hello, World!” function and learn about arguments to functions, the regular if statement and the vectorized version, and how to build loops and why they should be avoided.

7. Data Munging
Data scientists often bemoan that 80% of their work is manipulating data. As such, R has many tools for this, which are, contrary to what Python users may say, easy to use. We see how R excels at group operations using apply, lapply, and the plyr package. We also take a look at its facilities for joining, combining, and rearranging data. Then we speed that up with tidyr, data. Table, and dplyr.

8. In-Depth with dplyr
dplyr has become such an indispensable tool, nearly superseding plyr, that it is worth devoting extra attention to. So, we examine its select, filter, mutate, group by and summarize functions, among others.

9. Manipulating Strings
Text data is becoming more pervasive in the world, and fortunately, R provides ways for both combining text and ripping it apart, which we walk through. We also examine R’s extensive regular expression capabilities.

10. Reports and Slideshows with knitr
Successfully delivering the results of an analysis can be just as important as the analysis itself, so it is important to communicate them in an effective way. In this lesson, we learn how to use knitr and rmarkdown to write both static and interactive results in the form of pdf documents, websites, HTML5 slideshows, and even Word documents.

11. Include HTML Widgets in HTML Documents
Recent years have seen the advance of JavaScript-powered displays of information, and the HTML widgets package enables R to take advantage of arbitrary JavaScript libraries. In particular, we look at datatable for a tabular display of data, bokeh for rich web plots, and leaflet for powerful mapping.

12. Shiny
Built by RStudio, Shiny is a tool for building interactive data displays and dashboards all within R. This allows the R programmer to convey results in a compelling, user-rich experience in a language he or she is familiar with.

13. Package Building
Building packages is a great way to contribute back to the R community, and doing so has never been easier thanks to Hadley Wickham’s devtools package. This lesson covers all the requirements for a package and how to go about authoring, testing, and distributing them.

14. Rcpp for Faster Code
Sometimes pure R code is not fast enough, and extra speed is required. Rcpp enables R programmers to seamlessly integrate C++ code into their R code. We go over the basics of getting the two languages working together, write some speedy functions in C++, and even integrate C++ into R packages.

15. Basic Statistics
Naturally, R has all the basics when it comes to statistics such as means, variance, correlation, t-tests, and ANOVAs. We look at all the different ways those can be computed.

16. Linear Models
The workhorse of statistics is regression and its extensions. This consists of linear models, generalized linear models–including logistic and Poisson regression–and survival models. We look at how to fit these models in R and how to evaluate them using measures such as mean squared error, deviance, and AIC.

17. Other Models
Beyond regression there are many other types of models that can be fit to data. Models covered include regularization with the elastic net, Bayesian shrinkage, nonlinear models such as nonlinear least squares, splines and generalized additive models, decision tress, and random forests.

18. Time Series
Special care must be taken with data where there is time-based correlation, otherwise known as autocorrelation. We look at some common methods for dealing with time series such as ARIMA, VAR, and GARCH.

19. Clustering
A focal point of modern machine learning is clustering, the partitioning of data into groups. We explore three popular methods: K-means, K-medoids, and hierarchical clustering.

20. More Machine Learning
Two areas seeing increasing interest are recommendation engines and text mining, which we illustrate with RecommenderLab, RTextTools, and the irlba package for fast matrix factorization.

21. Network Analysis
The world is rich with network data that are nicely studied with graphical models. We show you how to analyze and visualize networks using the igraph package.

22. Automatic Parameter Tuning with Caret
Machine learning models often have parameters that need tuning, which can significantly affect the quality of the model. The Caret package, by Max Kuhn, makes finding optimal parameter values easy to find.

23. Fit a Bayesian Model with RStan
Bayesian data analysis uses simulations to fit both simple and complex models. Andrew Gelman’s new language, Stan, makes this faster and easier than ever before. We explore this by fitting a simple linear regression and varying-intercept multilevel model.