Time Series Analysis in R Part 3: Getting data from Quandl

This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here:

Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

For example, here we get the data in percent change form:

You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.


Time Series Analysis in R Part 2: Time Series Transformation Functions

In Part 1 of this series, we got started by looking at the ts object in R and how it represents time series data. In Part 2, I’ll discuss some of the many time series transformation functions that are available in R. This is by no means an exhaustive catalog. If you feel I left out anything important, please let me know. I plan on keeping this updated on a regular basis.

Often in time series analysis and modeling, we will want to transform data. There are a number of different functions that can be used to transform time series data such as the difference, log, moving average, percent change, lag, or cumulative sum. These type of function are useful for both visualizing time series data and for modeling time series. For example, the moving average function can be used to more easily visualize a high-variance time series and is also a critical part the ARIMA family of models. Functions such as the difference, percent change, and log difference are helpful for making non-stationary data stationary.

Let’s start with a very common operation. To take a lag we use the lag() function from the stats package. It takes a ts object and a value for the lag argument.

Notice that when we do this, tseries_lag1 is equivalent to tseries offset by 1, such that the value for 2000q1 in tseries become the value for 1999q4 in tseries_lag1 and so on. We can do this for any lag of our choosing. Let’s try 3 instead:

We can also do this in the opposite direction using a negative value for the lag argument, effectively creating a lead series instead:

Notice that tseries_lead1 now leads the original tseries by one time period.

We can take differences of a time series as well. This is equivalent to taking the difference between each value and its previous value:

The difference can be calculated over more than one lag. To do so, simply alter the value of the lag argument.

Some time series transformation functions are useful for series that get larger over time or series whose variance gets larger over time. These range from the basic logarithm function to the Box-Cox group of transformations (of which the natural logarithm is a special case). We can take the log of a time series using the log function in the same way that we would take the log of a vector. Let’s generate a time series that has increasing variance:

As we’ll discuss in more detail, the prevalence of increasing, or more generally non-constant, variance is called heteroskedasticity and can cause problems in linear regression. Often, it will need to be corrected before modeling. One way to do this is taking a log:

There are other more advanced ways of transforming data, one of which is the Box-Cox transformation, which allows us a bit more control over the transformation. The Box-Cox transformation takes the form (Hyndman and Athanasopoulos, 2013):

w_t = \begin{cases} \log{ y_t }, \text{ if } \lambda = 0; \\ \frac{ ({y_t}^{\lambda} - 1) }{ \lambda },\text{ otherwise } \\ \end{cases}

In order to demonstrate the Box-Cox transformation, I’ll introduce Rob Hyndman’s forecast package:

It has two functions that are of use here. The primary function is BoxCox(), which will return a transformed time series given a time series and a value for the parameter lambda:

Notice that this value of lambda here does not entirely take care of the heteroskedasticity problem. We can experiment with different values of lambda, or we can use the BoxCox.lambda() function, which will provide us an optimal value for parameter lambda:

The BoxCox.lambda() function has chosen the value 0.055. If we then use this value in our BoxCox() function, it returns a time series that appears to have constant variance.

Another common calculation that we may want to perform on time series is the percent change from one period to another. Because it seems that R does not have a function for performing this calculation, we can do it ourselves.

We can turn this into a function so that we can easily calculate percent change in the future. We’ll add an argument to specify the number of periods over which we want to calculate the change and set it to 1 by default. Additionally, we’ll add some argument verification.

Now if we wished to calculate the percentage change over more than one period we can do so. Let’s say we wanted to calculate the year over year percent change. We can call our pch function with a lag of 4.

Two of the functions that we have discussed so far, the difference and the log, are often combined in time series analysis. The log difference function is useful for making non-stationary data stationary and has some other useful properties. We can calculate the log difference in R by simply combining the log() and diff() functions.

Notice that after taking the log return, tseries appears to be stationary. We see some higher than normal volatility in the beginning of the series. This is due largely to the fact that the series levels start off so small. This leads into a nice property of the log return function, which is that it is a close approximation to the percent change:

This similarity is only approximate. The relationship does break down somewhat when the percent change from one period to the next is particularly large. You can read a good discussion of this topic here.

A moving average is another essential function for working with time series. For series with particularly high volatility, a moving average can help us to more clearly visualize its trend. Unfortunately, base R does not (to my knowledge) have a convenient function for calculating the moving average of a time series directly. However, we can use base R’s filter() function, which allows us to perform general linear filtering. We can set up the parameters of this function to be a moving average (Shumway and Stoffer, 2011). Here we apply the filter() function to tseries to create a 5 period moving average. The filter argument lets up specify the filter, which in this case is just a weighted average of 5 observations. The sides argument allows us to specify whether we want to apply the filter over past values (sides = 1), or to both past and future values (sides = 2). Here I set sides to 1 indicating that I want a trailing moving average:

The fact that we have to define the linear filter each time we use this function makes it a little cumbersome to use. If we don’t mind introducing a dependency to our code, we could use the SMA() function from the TTR package or the ma() function from the forecast package. The SMA() function takes a ts object and a value for n – the window over which we want to calculate the moving average.

The ma() function from the forecast package also performs moving average calculations. We supply the time series and a value for the order argument.

Let’s compare the results. The SMA() function returns a trailing moving average where each value is the mean of the n most recent trailing values. This is equivalent to the results we get from using the filter() function. The ma() function from the forecast package returns a centered moving average. In this case tseries_ma5for is equal to the average of the current observation, the previous two observation, and the next two observations. Which one you use would depend on your application.

That’s it for Part 2. I don’t know about you, but I am sick of working with generated data. I want to work with some real world data. In Part 3 we’ll do just that using Quandl. See you then.



Hyndman, Rob J. and George Athanasopoulos (2013) Forecasting: Principles and Practice. https://www.otexts.org/fpp

Shumway, Robert H. and David S. Stoffer (2011) Time Series Analysis and Its Applications With R Examples. New York, NY: Springer.

Time Series Analysis in R Part 1: The Time Series Object

Many of the methods used in time series analysis and forecasting have been around for quite some time but have taken a back seat to machine learning techniques in recent years. Nevertheless, time series analysis and forecasting are useful tools in any data scientist’s toolkit. Some recent time series-based competitions have recently appeared on kaggle, such as one hosted by Wikipedia where competitors are asked to forecast web traffic to various pages of the site. As an economist, I have been working with time series data for many years; however, I was largely unfamiliar with (and a bit overwhelmed by) R’s functions and packages for working with them. From the base ts objects to a whole host of other packages like xts, zoo, TTR, forecast, quantmod and tidyquant, R has a large infrastructure supporting time series analysis. I decided to put together a guide for myself in Rmarkdown. I plan on sharing this as I go in a series of blog posts. In part 1, I’ll discuss the fundamental object in R – the ts object.

The Time Series Object

In order to begin working with time series data and forecasting in R, you must first acquaint yourself with R’s ts object. The ts object is a part of base R. Other packages such as xts and zoo provide other APIs for manipulating time series objects. I’ll cover those in a later part of this guide.

Here we create a vector of simulated data that could potentially represent some real-world time-based data generation process. It is simply a sequence from 1 to 100 with some random normal noise added to it. We can use R’s base plot() function to see what it looks like:

This could potentially represent some time series, with time represented along the x-axis. However it’s hard to tell. The x-axis is simply an index from 1 to 100 in this case.

A vector object such as t above can easily be converted to a time series object using the ts() function. The ts() function takes several arguments, the first of which, x, is the data itself. We can look at all of the arguments of ts() using the args() function:

To begin, we will focus on the first four arguments – data, start, end and frequency. The data argument is the data itself (a vector or matrix). The start and end arguments allow us to provide a start date and end date for the series. Finally the frequency argument lets us specify the number of observations per unit of time. For example, if we had monthly data, we would use 12 for the frequency argument, indicating that there are 12 months in the year.

Let’s assume our generated data is quarterly data that starts in the first quarter of 2000. We would turn it into a ts object as below. We specify the start argument as a two element vector. The first element is the year and the second element is the observation of that year in which the data start. Because our data is quarterly, we use 4 for the frequency argument.

Notice that now when we plot the data, R recognizes that it is a ts object and plots the data as a line with dates along the x-axis.

Aside from creating ts objects containing a single series of data, we can also create ts objects that contain multiple series. We can do this by passing a matrix rather than a vector to the x argument of ts().

Now when we plot the ts object, R automatically facets the plot.

At this point, I should mention what really happens when we call the plot() function on a ts object. R recognizes when the x argument is a ts object and actually calls the plot.ts() function under the hood. We can verify this by using it directly. Notice that it produces an identical graph.

The plot.ts() function has different arguments geared towards time series objects. We can look at these again using the args() function.

Notice that it has an argument called plot.type that lets us indicate whether we want our plot to be faceted (multiple) or single-panel (single). Although in the case of our data above we would not want to plot all three series on the same panel given the difference in scale for ts3, it can be done quite easily.

I am not going to go in-depth into using R’s base plotting capability. Although it is perfectly fine, I strongly prefer to use ggplot2 as well as the ggplot-based graphing functions available in Rob Hyndman’s forecast package. We will discuss these in parts 2 and 3.

Convenience Functions for Time Series

There are several useful functions for use with ts objects that can make programming easier. These are window(), start(), end(), and frequency(). These are fairly self-explanatory. The window function is a quick and easy way to obtain a slice of a time series object. For example, look again at our object tseries. Assume that we wanted only the data from the first quarter of 2000 to the last quarter of 2012. We can do so using window():

The start() function returns the start date of a ts object, end() gives the end date, and frequency() returns the frequency of a given time series:

That’s all for now. In Part 2, we’ll dive into some of the many transformation functions for working with time series in R. See you then.

Linear Regression from Scratch in Python

In my last post I demonstrated how to obtain linear regression parameter estimates in R using only matrices and linear algebra. Using the well-known Boston data set of housing characteristics, I calculated ordinary least-squares parameter estimates using the closed-form solution. In this post I’ll explore how to do the same thing in Python using numpy arrays and then compare our estimates to those obtained using the linear_model  function from the statsmodels package.

First, let’s import the modules and functions we’ll need. We’ll use numpy for matrix and linear algebra. In the last post, we obtained the Boston housing data set from R’s MASS library. In Python, we can find the same data set in the scikit-learn module.

Next, we can load the Boston data using the load_boston function. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s as well as the median home price in each of the tracts.

Now we’re ready to start. Recall from my previous post that linear regression typically takes the form:

 y=\beta X + \epsilon

where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \epsilon is the error term; it represents features that affect the response, but are not explicitly included in our model.

From the boston  object, we will extract the features, which are conveniently already a numpy array, and assign it to X .  Our target variable is the median home value for each tract. We assign the target to the variable y .

If we examine the features, we can see that X is the expected shape.

We can also look at the feature names.

If you would like to see more details about these features and the data set, you can view the DESCR attribute of the boston  object.

We will need to add a vector of ones to our feature matrix to for the intercept term. We can do this easily in numpy and then add it as the first column in our feature matrix.

With the preparatory work out of the way, we can now implement the closed-form solution to obtain OLS parameter estimates.

\beta = (X^TX)^{-1}X^Ty

We do this in python using the numpy arrays we just created, the inv() function, and the  transpose() and dot()  methods.

Let’s examine them to see if they make sense. Here I convert the coeffs  array to a pandas DataFrame and add the feature names as an index.

Now we have our parameter vector β. These are the linear relationships between the median home value and each of the features. For example, we see that an increase of one unit in the number of rooms (RM) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘CHAS’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘CHAS’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

Now we’ll estimate the same model using the linear_model  function from statsmodels and assign the results to  coeffs_lm

Last of all, we place our newly-estimated parameters next to our original ones in the results  DataFrame and compare.

Once again, our results are identical. Now you know how these estimates are obtained using the closed-form solution. Like I mentioned in my R post on the same topic, you’d never implement linear regression in this way. You would use the linear_model  function or the LinearRegression  function from the scikit-learn package. These functions are very quick, require, very little code, and provides us with a number of diagnostic statistics, including R^2, t-statistics, and p-values.

I have made the code from this post available at my Github here.

Linear Regression from Scratch in R

One of the very first learning algorithms that you’ll encounter when studying data science and machine learning is least squares linear regression. Linear regression is one of the easiest learning algorithms to understand; it’s suitable for a wide array of problems, and is already implemented in many programming languages. Most users are familiar with the lm() function in R, which allows us to perform linear regression quickly and easily. But one drawback to the lm() function is that it takes care of the computations to obtain parameter estimates (and many diagnostic statistics, as well) on its own, leaving the user out of the equation. So, how does the lm() function obtain these parameter estimates?

In this post, I will outline the process from first principles in R. I will use only matrices, vectors, and matrix operations to obtain parameter estimates using the closed-form linear algebraic solution. After reading this post, you’ll see that it’s actually quite simple, and you’ll be able to replicate the process with your own data sets (though using the lm() function is of course much more efficient).

For this example, I’ll use the well-known “Boston” data set from the MASS library. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s. Our target variable will be the median home value for each tract — the column titled ‘medv.’ Let’s get an idea of what this data set looks like:

To learn more about the definition of each variable, type  help(Boston) into your R console.

Now we’re ready to start. Linear regression typically takes the form

 y=\beta X + \epsilon

where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \epsilon is the error term; it represents features that affect the response, but are not explicitly included in our model.
The first thing we will need is a vector of our response variable, typically called ‘y’. In this case, our response variable is the median home value in thousands of US dollars: ‘medv’:

Next, we’ll need our ‘X’ or ‘design’ matrix of the input features. We can do this with the as.matrix() function, grabbing all the variables except ‘medv’ from the Boston data frame. We’ll also need to add a column of ones to our X matrix for the intercept term.

Now that we have our response vector and our ‘X’ matrix, we can use them to solve for the parameters using the following closed form solution:

\beta = (X^TX)^{-1}X^Ty

The derivation of this equation is beyond the scope of this post. If you are interested in the derivation, a good place to start is Wikipedia or any good undergraduate textbook on linear regression.

We can implement this in R using our ‘X’ matrix and ‘y’ vector. The %*%  operator is simply matrix multiplication. The t() function takes the transpose of a matrix, and solve()  calculates the inverse of any (invertible) matrix.

Now we have our parameter vector β. These are the linear relationships between the response variable ‘medv’ and each of the features. For example, we see that an increase of one unit in the number of rooms (rm) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘chas’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘chas’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

Now let’s verify that these results are in fact correct. We want to compare our results to those produced by the lm()  function. Most of you will already know how to do this.

The parameter estimates are identical! Now you know how these estimates are obtained. Keep in mind that you’re unlikely to favor implementing linear regression in this way over using lm() . The lm()  function is very quick, and requires very little code. Using it provides us with a number of diagnostic statistics, including R^2, t-statistics, and the oft-maligned p-values, among others. Nevertheless, I wanted to provide a window into what lm()  does behind the scenes.

There are other ways to solve for the parameters in linear regression. For example, gradient descent can be used to obtain parameter estimates when the number of features is extremely large, a situation that can drastically slow solution time when using the closed-form method. Stochastic gradient descent is often used when both the number of features and the number of samples are very large. I hope to detail these in a future post.

I have made the code from this post available at my Github here. Cheers!

Extracting Tables from PDFs in R using the Tabulizer Package

Recently I wanted to extract a table from a pdf file so that I could work with the table in R. Specifically, I wanted to get data on layoffs in California from the California Employment Development Department. The EDD publishes a list of all of the layoffs in the state that fall under the WARN act here. Unfortunately, the tables are available only in pdf format. I wanted an interactive version of the data that I could work with in R and export to a csv file. Fortunately, the tabulizer package in R makes this a cinch. In this post, I will use this scenario as a working example to show how to extract data from a pdf file using the tabulizer package in R.

The link to the pdf gets updated often, so here I’ve provided the pdf as downloaded from the site on November 29, 2016:



First, we will need to load the tabulizer package as well as dplyr.

Next we will the the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

Now we have a list object called out, with each element a matrix representation of a page of the pdf table. We want to combine these into a single data matrix containing all of the data. We can do so most elegantly by combining do.call and rbind, passing it our list of matrices. Notice that I am excluding the last page here. The final page is the totals and summary information. We don’t need that.

After doing so, the first three rows of the matrix contain the headers, which have not been formatted well since they take up multiple rows of the pdf table. Let’s fix that. Here I turn the matrix into a data.frame dropping the first three rows. Then I create a character vector containing the formatted headers and use that as the column names.

We now have a data.frame of all of the California layoffs. A quick glance at the first few rows:

In order to manipulate the data properly , we will probably want to change the date column to a Date object as well as convert the No.of.Employees column to numeric. Here I do so using dplyr.

Last of all, I finish up by writing the final table to csv so that I can load it for later use.

I have found the tabulizer package to be wonderfully easy to use. Much of the process of extracting the data and tables from pdfs is abstracted away from the user. This was a very simple example, however if one requires more finely-tuned control of how tables are extracted, the extract_tables function has a lot of additional arguments to tweak to one’s liking. I encourage you to take a look for yourself.

You can find the code for this post on my Github.

Completing the Coursera Data Science Specialization

First of all, I’d like to apologize to anyone who read my last blog post. In it, I mentioned that I would write several follow-up posts in which I would go through the entire modelling process in the Kaggle Animal Shelter competition. Unfortunately, life got in the way and I was unable to complete these posts before the competition ended. Fortunately, I have been hard at work on something that I am now able to share.

Some of you that know me know that for the past year and a half, I have been working on completing all ten courses of the Coursera Johns Hopkins Data Science Specialization. I finally completed it earlier this month. In the final course of the specialization, we are tasked with creating an application that can perform text prediction. Many of you who have smartphones are very familiar with this type of application. When typing a text message for example, your phone may provide you with several options for the next word before you even type it.

I am not going to write a long post about how I did this. I simply wanted to share my final application with you. You can find it here or by clicking the picture below:

text prediction app

If you are interested in how it was done, I put together a few brief slides on my methodology, which you can see below. I plan on eventually releasing my code on github after I make some additional improvements.

Special thanks to Adam Danz, Brandon Danz, Bo Bucklen, Chad McComsey, Anthony Mead, Arthur Mead, and Gabbie Nirenburg for doing some QA testing and providing feedback.

Saving the Kittehs (and Dogies too) with Data (Part 1)

A couple months ago, Kaggle started a new knowledge competition on its site called ‘Animal Shelter Outcomes.’ This is an educational competition, which features no prize pool. The data, from the Austin Animal Center, contains attributes and outcomes of all the cats and dogs that went through the system between October of 2013, and March of 2016. Using the supplied features, competitors are asked to build a model to predict outcomes, which include Adoption, Euthanasia, Died, Return to Owner, and Transfer.


Since I am a big fan of using data science for good rather than helping corporations make more money from our personal information, I found this competition particularly appealing. I decided that I would give it a try and blog about my process along the way (Remember, I’m learning too!). You may have read my previous Kaggle blog post where we built a model to classify crime in San Francisco. This time however, I will give a much more thorough treatment to the modeling process. We will begin by exploring the data to gain some insights in Parts 1 and 2 as well as building some new features. In later posts, we will train several tree-based models starting with a simple decision tree and following with a random forest and a gradient-boosted tree. Let’s get started!


We begin by downloading the training data from the competition page here. In today’s post, we will only be using the training data, but we will need test data later when we are finished building our models. Once the data are saved to your directory of choice, we open it up in R. Notice that the data are in gzip format and that R can read the data directly with no need to unzip it.

Let’s take a look at the data using head():

There are a number of features that will likely be useful for predicting outcomes. We have information on breed, age, sex, and color. In today’s post we will process and explore the age data. I am guessing that age will likely have some predictive power. For example, older animals may be less likely to be adopted than younger animals. Let’s find out by making boxplots of the age distribution by outcome type.  Notice that there are several problems with the age data that we will need to address. First, it is a string rather than just a numeric value. Second, it is in different units, with some age measured in weeks, some in months and some in years. Let’s fix that.

We are going to get rid of the unit labels and standardize all the age values to years. First we create a named vector called unitMap that maps the various unit labels to a divisor which we will use translate all the age values to years. Obviously the data labelled ‘year’ or ‘years’ just stays the same, so we divide by 1. Next we split the AgeuponOutcome column on the space character and extract the second element (the unit label), which we then use to index the unitMap vector. Then we create a new column called AgeuponOutcomeNorm which is the first element returned by splitting the AgeuponOutcome column (the value) divided by unitDiv. Finally we take care of some missing values by just plugging in the median. Sorry, that was a mouthful. If you have questions email me or comment below.


Let’s take a look again. Notice we now have a new column called AgeuponOutcomeNorm which contains our standardized age data. Double check that it’s correct.

Looks good! Now let’s get to the plot. Here I load the ggplot2 library and then use it to make boxplots of the Age column by outcome type. I will also facet this plot by animal type, so that we can visualize dogs and cats separately:



Our plot shows that age will likely have some predictive power in determining outcomes. Not surprisingly the median age for dogs that are euthanized is higher than for other outcomes. Dogs that are put to sleep are usually older than cats that are euthanized. For both cats and dogs, the median age of adopted animals is much lower. Interestingly, the median age of animals that are returned to their owners is quite high. I’m also surprised at how low is the median age of animals that died and how tight that distribution is for felines.

That wraps up Part 1. We gained some insight about the distribution of age across the various outcome types. In addition, the age variable has been standardized and is ready for modelling. We have a whole lot more data exploration to do before we get to that point, so stay tuned for Part 2!

Hacking the Democratic Debate: Word Clouds in R

Word clouds are an easy way to visualize textual data. They are used to distill text, whether it be a speech, a book, or an article, into a form that shows the frequency of words used. Frequency of a word can be emphasized with size, color or orientation. They are also a great early foray for those looking to get a taste of data science. Making a word cloud requires very little technical know-how. Anyone can make one with a minimal amount of programming ability.

Today we will go through the entire pipeline of making a word cloud in R. We will start with getting and cleaning our data and then going through all the processing steps to make our final product. To make things interesting, we will use an example from current events; the first Democratic debate, which aired on October 13th from the Wynn Las Vegas in Nevada. Specifically, we will make a word cloud using a transcript of the debate and we will focus on my favorite candidate: Bernie Sanders. We will obtain a copy of the debate transcript in html and parse it to obtain everything that Bernie Sanders said. Then we’ll make our word cloud using the data.

We begin by loading the packages that we will need into R. The XML package provides functionality to quickly extract content from html. We will use the stringr package to assist with string parsing. The tm package provides functions for parsing text and getting word frequency counts. Finally, the wordcloud package provides an easy interface for making the word cloud from text.

First, we will need a transcript. There are several sources of full-text transcripts of the debate. I chose the one published by CBS News, which you can find here. Next, we download the file to our local drive as so:

Once we have the file, we can load it into R as an object of type HTMLInternalDocument, using htmlTreeParse(). This is a special version of the more generic xmlParse() function in the XML library. It is particularly well-suited for html, which may be malformed and does not always conform strictly to xml formatting.  This stores our html document in a tree-like format that makes parsing it a snap.

We can take a peek at the html document to get an idea of how to proceed. Looking at the html where the transcript starts, we can see that all of the actual debate speech is wrapped in html ‘p’ tags, which denote paragraphs. So, a good start would be to extract all of the text that resides within the p tags. Again, the XML library makes this very easy. We use the xpathApply function, which is a specialized version of R’s apply function. We provide xpathApply three arguments: our HTMLInternalDocument object that we want to parse, a node type to extract (all p tags in this case), and a function that we wish to perform on each node. The double forward-slash in front of p indicates that we want the p tags from all the nodes of the document. We pass the xmlValue function, which tells xpathApply that we simply want the raw text between each set of p tags. We then unlist() the results. The xpathApply function returns a list, but we prefer to have our results as a character vector.

The results are now in a character vector, where each element of the vector is the text that resided within each p tag. This is the entire transcript of the debate; however we only want Bernie Sanders’ speech. If you take a look at htmltext, you will notice a common structure that we can exploit. In each element, if a new speaker is speaking, the text begins with the speaker’s name in capital letters followed by a colon. If the element does not represent a new speaker, then the element doest not contain the speaker’s name. Also notice that the first time a candidate speaks, the name before the colon is the candidate’s full name and title. After that, only the last name is given prior to the colon.

We first initialize an empty vector named sanders. This will hold all of the sanders quotes that we extract. We then iterate over every element of htmltext and use regexpr() to get the index of the colon, if there is one. If there is no colon, regexpr() returns a -1. If there is a colon, we set the value of the speaker variable to whatever is in front of the colon, and then append everything after the colon to our sanders vector if Sanders is the speaker. If there is no colon and Sanders is still the speaker, then we append the entire quote to our vector. Since the first p tag contains only metadata, and is not part of the transcript, we start with the second element.

Great! Now we have a vector containing all of Bernie Sanders’ speech from the debate. Now we are ready to clean it up a bit and get our word frequency counts, so that we can make the cloud. We’ll begin this part by turning our vector of strings into a character vector of length 1 using paste. This collapses everything into a single string separated by spaces.

In order to further process our text data we will use the tm package, which is short for ‘text-mining’. Here we turn our character vector into a Corpus object, using the VectorSource function to load our sanders_corpus vector. A Corpus object is a collection of text documents. In our case, the Corpus will be a single document.

Now we can use the tm_map function. This allows us to apply transformations to the documents in our Corpus. We call tm_map and pass it two arguments. The first is our Corpus object and the second is a function that is applied to each document. Here we do this five times. Each function that we pass to tm_map should be fairly self-explanatory. We first remove all the white space, punctuation, and numbers. We also remove common SMART stop words using removeWords and passing it the built-in list of common SMART stop words. SMART stop words are a list built by Gerald Salton and Chris Buckley at Cornell University. You can see the entire list and read more about the SMART information retrieval system here. I also removed  all occurrences of “CROSSTALK” which occurred when more than one person was speaking at once, as well as “LAUGHTER” and “APPLAUSE.”

We arrive arrive at the final step. The wordcloud() function does most of the work for us here. We pass it a number of arguments that control the appearance of the plot. Scale is the range of the word size. Max.words allows you to set how many of the words in the document are plotted. The rot.per argument determines what proportion of the words you wish to appear vertically. Random.order sets whether or not you want the words to be chosen randomly. We want words to appear by their frequency, so we’ve set this to false here. If you are following along with this tutorial, feel free to play with the various settings to see how the cloud varies.


That’s it! We have our finished product. The above code can be easily adapted to create a similar word cloud for the candidate of your own choosing. You could also apply this approach to one of the Republican or Democratic debates coming up in the near future. Give it a try and let me know about it in the comments below. I have made this post and its code available as a Jupyter notebook on Github here. Cheers!

Let’s do a Kaggle!

If you are interested in machine learning and predictive modeling and you’ve reached the point where you want to test and practice some of your new found skills on real-world data, then kaggle.com is the place that you want to be. If you’ve never heard of it before, kaggle is a site that hosts data science competitions. For each competition, kaggle provides you with a training data set consisting of features and feature labels, as well as an unlabeled test data set. As a competitor, your job is to explore the data set and build a model on the training data in order to predict the the class labels in the test data set. You then submit your answer to the site and your rank among all the other competitors is calculated and displayed on a leaderboard. When the competition ends, a final score is calculated based on a full test set, and the winners are determined. Many competitions even have cash prize pools!


image credit: jessi reel

Today we’ll go over how incredibly easy (and fun!) it is to get started competing on kaggle. There are already lots of tutorials out there on getting started with kaggle, many of which are great. However it seems like every one of them focuses on the Titanic data competition. This is one of kaggle’s ‘101’ competitions; educational competitions based on publicly available data and have no prize pool. Instead, we will work with the San Francisco Crime Classification competition. This is a ‘playground’ competition. It also offers no prize pool, and is aimed at helping you build your skill set. This competition is a bit more challenging due to the difficulty in getting accurate predictions. The training data set is also much larger, with 878,049 observations.

We will use Python’ scikit-learn machine learning library to build a simple prediction model for this task. We will use the logistic regression algorithm. Although logistic regression has been around for quite a long time and may be less powerful than many modern machine learning algorithms like support vector machines, it is easy to understand and quite well-suited to our classification application.

The first thing that you will need to do is register at kaggle.com. Once you have registered and logged in, go to the San Francisco Crime classification page and in the Dashboard on the left-hand side click ‘Data’ and then download the ‘train.csv’ and ‘test.csv’ zip files. Then unzip and save them to your directory of choice.

The competition asks you to predict the category of a reported crime. Categories include things like ‘Larceny/Theft’, ‘Grand theft from locked auto’, and ‘Warrants’ just to name a few of the 39 that are possible. To do so, you are given each incident’s date, time, police district, address, and latitude/longitude.  You can use any or all of these to build a model. Today we will focus specifically on the address of each crime with our hypothesis being that specific crimes may be likely to occur at  similar addresses. For example, prostitution is likely to occur in certain hotspots, or that theft is more likely to occur in places with a lot of tourists.  Although this is probably true of certain crimes, there are likely others where this is less true. Nevertheless, we’ll test our hypothesis out.

The address for each incident is given as either a street and cross street, or a street with a block number. We will turn these addresses into a consistent set of numerical features that we can feed to our logistic regression model. We will build a model with x features, where x is the total number of unique words in the entire set of addresses. Each feature will be the count of a specific word or number in each address. For example, if an address is given as ‘OAK ST/LAGUNA ST’, then it will have a 1 under OAK, a 1 under LAGUNA, and a 2 under ST. The remaining x-3 features will be 0. As an example, the addresses for the first three entries in the training set are ‘OAK ST/LAGUNA ST’, ‘OAK ST/LAGUNA ST’, and ‘VANNESS AVE/GREENWICH ST.’ So, after turning these into a numerical feature array we would have:

0 0 1 1 2 0
0 0 1 1 2 0
1 1 0 0 1 1

We continue adding features until we have exhausted our observations in the training data. Then we can feed the array, along with our training set class (crime category) labels to out logistic regression model. The model will estimate a probability that each incident falls under each of the 39 different categories based on the counts of words for that incident.

Fire up your Python IDE, Jupyter notebook, or whatever you prefer to use. We begin by importing some libraries that we will need:

The use of numpy and pandas will be obvious to most of you. We will use the CountVectorizer function from sklearn’s feature_extraction module to build the feature array from the addresses as discussed above. The train_test_split function will be used to split our training data into our own complete train and test sets, so that we may assess the performance of our model before we submit our results to kaggle. Notice that we can’t test the performance of our model using the test set provided by kaggle since it does not contain the crime category labels. The LogisticRegression class will be used to fit our data to a logistic regression model and make predictions. Finally, we will use the log_loss function to assess the performance of our model. We choose log_loss (or logarithmic loss) because this is the metric that kaggle uses to score and rank results for this competition.

Next, we will read in our data that we downloaded earlier using pandas’ read_csv():

Take a peak at the data using the head() method:


Id Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
0 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.425892 37.774599
3 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday NORTHERN NONE 1500 Block of LOMBARD ST -122.426995 37.800873
4 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday PARK NONE 100 Block of BRODERICK ST -122.438738 37.771541

For our purposes, we are only interested in the ‘Category’ and ‘Address’ columns. Let’s begin building our feature array using CountVectorizer. It’s important to note that we need to build our feature array for both the training data and the test data together rather than separately. Failing to do so will result in a feature array that is inconsistent between the train and test data. We will combine the address columns from train and test into a single numpy array:

Then we build our CountVectorizer and fit our data to it. First we build a custom list of ‘stop words.’ These are common words that we do not wish to include in our feature set. For our purposes this includes the abbreviations of street types such as ‘AV’ for avenue and ‘WY’ for way. We include this list as an argument in our CountVectorizer as well as specify a maximum number of features of 300 (We can change this parameter later if we wish).

Now our feature array is ready. Before we feed it to our model, let’s split our training data (the one with category labels) into our own training and test set. This gives us the ability to test the performance of our model before we submit to kaggle. First we’ll need to take from our features array just those rows corresponding to our training data. We assign that to X and the labels to y:

Then we use the train_test_split function from scikit-learn’s cross validation module. This function randomly partitions our data into a training and test set. We specify that we want to set aside 20% of the data for testing. We assign a value to the random_state argument for reproducibility:

Now we are ready to build our model and fit our training data to it:

Then we can use the model to make predictions on our test set that we set aside. We can use the predict() method of our model to predict the class label. However this competition requires that for each observation, we submit a probability that the class label is each of our 39 possible labels. For that, we use the predict_proba() method:

Take a look at the first row of results. It should have 39 values between zero and one:

Now that we have our predicted probabilities, we can see how well we did versus the actual labels. We use the log_loss() function and provide two arguments: our actual class labels for our test data set and the predicted values from our model:

We got a score of 2.48 on our test data. Looking at the leaderboard for the competition, this would put us at around 85th place  out of 533 if this score were to hold up when we submit our full results. It won’t exactly, but it should be reasonably close.

Lets run the model and predict the results for the full unlabeled kaggle test data. We build our model using the full training feature set, and then predict the probabilities for the full test data set:

Our results array should have 884,262 rows and 39 columns. To wrap things up we format our data according to the kaggle submission standards and then write it to a csv. The columns should be named according to the 39 possible categories. Each rows should have an index. The index column should be called ‘Id’:

Last of all, zip the file and then submit the results at the submission page here. Simply drag and drop you zip file where you see ‘Click or drop your submission here,’ and then click the big blue submit button. Kaggle will take a few minutes to upload your submission and score your results. After this, you will see you new position on the leaderboard. If you’ve submitted results by following this post exactly, you should have gotten a log loss score of 2.50732, enough to put you around 104th place out of 533 as of the time of this writing. Not bad for a first try! Congratulations!

Keep in mind that we have covered only a small portion of the data science pipeline is this post, the purpose of which is to get you familiar with building a quick model and submitting results. A full analysis would include extensive data exploration, feature engineering, model building and validation, followed by likely more of the same. I encourage you to do your own exploration and build your own model, incorporating some of the other features. Yours truly is currently in 81st place in this competition. See if you can catch me on the leaderboard! If you do, let me know about it in the comments below. Cheers and have fun kaggling!