Linear Regression from Scratch in Python

In my last post I demonstrated how to obtain linear regression parameter estimates in R using only matrices and linear algebra. Using the well-known Boston data set of housing characteristics, I calculated ordinary least-squares parameter estimates using the closed-form solution. In this post I’ll explore how to do the same thing in Python using numpy arrays and then compare our estimates to those obtained using the linear_model  function from the statsmodels package.

First, let’s import the modules and functions we’ll need. We’ll use numpy for matrix and linear algebra. In the last post, we obtained the Boston housing data set from R’s MASS library. In Python, we can find the same data set in the scikit-learn module.

Next, we can load the Boston data using the load_boston function. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s as well as the median home price in each of the tracts.

Now we’re ready to start. Recall from my previous post that linear regression typically takes the form:

 y=\beta X + \epsilon

where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \epsilon is the error term; it represents features that affect the response, but are not explicitly included in our model.

From the boston  object, we will extract the features, which are conveniently already a numpy array, and assign it to X .  Our target variable is the median home value for each tract. We assign the target to the variable y .

If we examine the features, we can see that X is the expected shape.

We can also look at the feature names.

If you would like to see more details about these features and the data set, you can view the DESCR attribute of the boston  object.

We will need to add a vector of ones to our feature matrix to for the intercept term. We can do this easily in numpy and then add it as the first column in our feature matrix.

With the preparatory work out of the way, we can now implement the closed-form solution to obtain OLS parameter estimates.

\beta = (X^TX)^{-1}X^Ty

We do this in python using the numpy arrays we just created, the inv() function, and the  transpose() and dot()  methods.

Let’s examine them to see if they make sense. Here I convert the coeffs  array to a pandas DataFrame and add the feature names as an index.

Now we have our parameter vector β. These are the linear relationships between the median home value and each of the features. For example, we see that an increase of one unit in the number of rooms (RM) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘CHAS’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘CHAS’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

Now we’ll estimate the same model using the linear_model  function from statsmodels and assign the results to  coeffs_lm

Last of all, we place our newly-estimated parameters next to our original ones in the results  DataFrame and compare.

Once again, our results are identical. Now you know how these estimates are obtained using the closed-form solution. Like I mentioned in my R post on the same topic, you’d never implement linear regression in this way. You would use the linear_model  function or the LinearRegression  function from the scikit-learn package. These functions are very quick, require, very little code, and provides us with a number of diagnostic statistics, including R^2, t-statistics, and p-values.

I have made the code from this post available at my Github here.

Linear Regression from Scratch in R

One of the very first learning algorithms that you’ll encounter when studying data science and machine learning is least squares linear regression. Linear regression is one of the easiest learning algorithms to understand; it’s suitable for a wide array of problems, and is already implemented in many programming languages. Most users are familiar with the lm() function in R, which allows us to perform linear regression quickly and easily. But one drawback to the lm() function is that it takes care of the computations to obtain parameter estimates (and many diagnostic statistics, as well) on its own, leaving the user out of the equation. So, how does the lm() function obtain these parameter estimates?

In this post, I will outline the process from first principles in R. I will use only matrices, vectors, and matrix operations to obtain parameter estimates using the closed-form linear algebraic solution. After reading this post, you’ll see that it’s actually quite simple, and you’ll be able to replicate the process with your own data sets (though using the lm() function is of course much more efficient).

For this example, I’ll use the well-known “Boston” data set from the MASS library. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s. Our target variable will be the median home value for each tract — the column titled ‘medv.’ Let’s get an idea of what this data set looks like:

To learn more about the definition of each variable, type  help(Boston) into your R console.

Now we’re ready to start. Linear regression typically takes the form

 y=\beta X + \epsilon

where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \epsilon is the error term; it represents features that affect the response, but are not explicitly included in our model.
The first thing we will need is a vector of our response variable, typically called ‘y’. In this case, our response variable is the median home value in thousands of US dollars: ‘medv’:

Next, we’ll need our ‘X’ or ‘design’ matrix of the input features. We can do this with the as.matrix() function, grabbing all the variables except ‘medv’ from the Boston data frame. We’ll also need to add a column of ones to our X matrix for the intercept term.

Now that we have our response vector and our ‘X’ matrix, we can use them to solve for the parameters using the following closed form solution:

\beta = (X^TX)^{-1}X^Ty

The derivation of this equation is beyond the scope of this post. If you are interested in the derivation, a good place to start is Wikipedia or any good undergraduate textbook on linear regression.

We can implement this in R using our ‘X’ matrix and ‘y’ vector. The %*%  operator is simply matrix multiplication. The t() function takes the transpose of a matrix, and solve()  calculates the inverse of any (invertible) matrix.

Now we have our parameter vector β. These are the linear relationships between the response variable ‘medv’ and each of the features. For example, we see that an increase of one unit in the number of rooms (rm) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘chas’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘chas’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

Now let’s verify that these results are in fact correct. We want to compare our results to those produced by the lm()  function. Most of you will already know how to do this.

The parameter estimates are identical! Now you know how these estimates are obtained. Keep in mind that you’re unlikely to favor implementing linear regression in this way over using lm() . The lm()  function is very quick, and requires very little code. Using it provides us with a number of diagnostic statistics, including R^2, t-statistics, and the oft-maligned p-values, among others. Nevertheless, I wanted to provide a window into what lm()  does behind the scenes.

There are other ways to solve for the parameters in linear regression. For example, gradient descent can be used to obtain parameter estimates when the number of features is extremely large, a situation that can drastically slow solution time when using the closed-form method. Stochastic gradient descent is often used when both the number of features and the number of samples are very large. I hope to detail these in a future post.

I have made the code from this post available at my Github here. Cheers!

Extracting Tables from PDFs in R using the Tabulizer Package

Recently I wanted to extract a table from a pdf file so that I could work with the table in R. Specifically, I wanted to get data on layoffs in California from the California Employment Development Department. The EDD publishes a list of all of the layoffs in the state that fall under the WARN act here. Unfortunately, the tables are available only in pdf format. I wanted an interactive version of the data that I could work with in R and export to a csv file. Fortunately, the tabulizer package in R makes this a cinch. In this post, I will use this scenario as a working example to show how to extract data from a pdf file using the tabulizer package in R.

The link to the pdf gets updated often, so here I’ve provided the pdf as downloaded from the site on November 29, 2016:



First, we will need to load the tabulizer package as well as dplyr.

Next we will the the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

Now we have a list object called out, with each element a matrix representation of a page of the pdf table. We want to combine these into a single data matrix containing all of the data. We can do so most elegantly by combining and rbind, passing it our list of matrices. Notice that I am excluding the last page here. The final page is the totals and summary information. We don’t need that.

After doing so, the first three rows of the matrix contain the headers, which have not been formatted well since they take up multiple rows of the pdf table. Let’s fix that. Here I turn the matrix into a data.frame dropping the first three rows. Then I create a character vector containing the formatted headers and use that as the column names.

We now have a data.frame of all of the California layoffs. A quick glance at the first few rows:

In order to manipulate the data properly , we will probably want to change the date column to a Date object as well as convert the No.of.Employees column to numeric. Here I do so using dplyr.

Last of all, I finish up by writing the final table to csv so that I can load it for later use.

I have found the tabulizer package to be wonderfully easy to use. Much of the process of extracting the data and tables from pdfs is abstracted away from the user. This was a very simple example, however if one requires more finely-tuned control of how tables are extracted, the extract_tables function has a lot of additional arguments to tweak to one’s liking. I encourage you to take a look for yourself.

You can find the code for this post on my Github.

Completing the Coursera Data Science Specialization

First of all, I’d like to apologize to anyone who read my last blog post. In it, I mentioned that I would write several follow-up posts in which I would go through the entire modelling process in the Kaggle Animal Shelter competition. Unfortunately, life got in the way and I was unable to complete these posts before the competition ended. Fortunately, I have been hard at work on something that I am now able to share.

Some of you that know me know that for the past year and a half, I have been working on completing all ten courses of the Coursera Johns Hopkins Data Science Specialization. I finally completed it earlier this month. In the final course of the specialization, we are tasked with creating an application that can perform text prediction. Many of you who have smartphones are very familiar with this type of application. When typing a text message for example, your phone may provide you with several options for the next word before you even type it.

I am not going to write a long post about how I did this. I simply wanted to share my final application with you. You can find it here or by clicking the picture below:

text prediction app

If you are interested in how it was done, I put together a few brief slides on my methodology, which you can see below. I plan on eventually releasing my code on github after I make some additional improvements.

Special thanks to Adam Danz, Brandon Danz, Bo Bucklen, Chad McComsey, Anthony Mead, Arthur Mead, and Gabbie Nirenburg for doing some QA testing and providing feedback.

Saving the Kittehs (and Dogies too) with Data (Part 1)

A couple months ago, Kaggle started a new knowledge competition on its site called ‘Animal Shelter Outcomes.’ This is an educational competition, which features no prize pool. The data, from the Austin Animal Center, contains attributes and outcomes of all the cats and dogs that went through the system between October of 2013, and March of 2016. Using the supplied features, competitors are asked to build a model to predict outcomes, which include Adoption, Euthanasia, Died, Return to Owner, and Transfer.


Since I am a big fan of using data science for good rather than helping corporations make more money from our personal information, I found this competition particularly appealing. I decided that I would give it a try and blog about my process along the way (Remember, I’m learning too!). You may have read my previous Kaggle blog post where we built a model to classify crime in San Francisco. This time however, I will give a much more thorough treatment to the modeling process. We will begin by exploring the data to gain some insights in Parts 1 and 2 as well as building some new features. In later posts, we will train several tree-based models starting with a simple decision tree and following with a random forest and a gradient-boosted tree. Let’s get started!


We begin by downloading the training data from the competition page here. In today’s post, we will only be using the training data, but we will need test data later when we are finished building our models. Once the data are saved to your directory of choice, we open it up in R. Notice that the data are in gzip format and that R can read the data directly with no need to unzip it.

Let’s take a look at the data using head():

There are a number of features that will likely be useful for predicting outcomes. We have information on breed, age, sex, and color. In today’s post we will process and explore the age data. I am guessing that age will likely have some predictive power. For example, older animals may be less likely to be adopted than younger animals. Let’s find out by making boxplots of the age distribution by outcome type.  Notice that there are several problems with the age data that we will need to address. First, it is a string rather than just a numeric value. Second, it is in different units, with some age measured in weeks, some in months and some in years. Let’s fix that.

We are going to get rid of the unit labels and standardize all the age values to years. First we create a named vector called unitMap that maps the various unit labels to a divisor which we will use translate all the age values to years. Obviously the data labelled ‘year’ or ‘years’ just stays the same, so we divide by 1. Next we split the AgeuponOutcome column on the space character and extract the second element (the unit label), which we then use to index the unitMap vector. Then we create a new column called AgeuponOutcomeNorm which is the first element returned by splitting the AgeuponOutcome column (the value) divided by unitDiv. Finally we take care of some missing values by just plugging in the median. Sorry, that was a mouthful. If you have questions email me or comment below.


Let’s take a look again. Notice we now have a new column called AgeuponOutcomeNorm which contains our standardized age data. Double check that it’s correct.

Looks good! Now let’s get to the plot. Here I load the ggplot2 library and then use it to make boxplots of the Age column by outcome type. I will also facet this plot by animal type, so that we can visualize dogs and cats separately:



Our plot shows that age will likely have some predictive power in determining outcomes. Not surprisingly the median age for dogs that are euthanized is higher than for other outcomes. Dogs that are put to sleep are usually older than cats that are euthanized. For both cats and dogs, the median age of adopted animals is much lower. Interestingly, the median age of animals that are returned to their owners is quite high. I’m also surprised at how low is the median age of animals that died and how tight that distribution is for felines.

That wraps up Part 1. We gained some insight about the distribution of age across the various outcome types. In addition, the age variable has been standardized and is ready for modelling. We have a whole lot more data exploration to do before we get to that point, so stay tuned for Part 2!

Hacking the Democratic Debate: Word Clouds in R

Word clouds are an easy way to visualize textual data. They are used to distill text, whether it be a speech, a book, or an article, into a form that shows the frequency of words used. Frequency of a word can be emphasized with size, color or orientation. They are also a great early foray for those looking to get a taste of data science. Making a word cloud requires very little technical know-how. Anyone can make one with a minimal amount of programming ability.

Today we will go through the entire pipeline of making a word cloud in R. We will start with getting and cleaning our data and then going through all the processing steps to make our final product. To make things interesting, we will use an example from current events; the first Democratic debate, which aired on October 13th from the Wynn Las Vegas in Nevada. Specifically, we will make a word cloud using a transcript of the debate and we will focus on my favorite candidate: Bernie Sanders. We will obtain a copy of the debate transcript in html and parse it to obtain everything that Bernie Sanders said. Then we’ll make our word cloud using the data.

We begin by loading the packages that we will need into R. The XML package provides functionality to quickly extract content from html. We will use the stringr package to assist with string parsing. The tm package provides functions for parsing text and getting word frequency counts. Finally, the wordcloud package provides an easy interface for making the word cloud from text.

First, we will need a transcript. There are several sources of full-text transcripts of the debate. I chose the one published by CBS News, which you can find here. Next, we download the file to our local drive as so:

Once we have the file, we can load it into R as an object of type HTMLInternalDocument, using htmlTreeParse(). This is a special version of the more generic xmlParse() function in the XML library. It is particularly well-suited for html, which may be malformed and does not always conform strictly to xml formatting.  This stores our html document in a tree-like format that makes parsing it a snap.

We can take a peek at the html document to get an idea of how to proceed. Looking at the html where the transcript starts, we can see that all of the actual debate speech is wrapped in html ‘p’ tags, which denote paragraphs. So, a good start would be to extract all of the text that resides within the p tags. Again, the XML library makes this very easy. We use the xpathApply function, which is a specialized version of R’s apply function. We provide xpathApply three arguments: our HTMLInternalDocument object that we want to parse, a node type to extract (all p tags in this case), and a function that we wish to perform on each node. The double forward-slash in front of p indicates that we want the p tags from all the nodes of the document. We pass the xmlValue function, which tells xpathApply that we simply want the raw text between each set of p tags. We then unlist() the results. The xpathApply function returns a list, but we prefer to have our results as a character vector.

The results are now in a character vector, where each element of the vector is the text that resided within each p tag. This is the entire transcript of the debate; however we only want Bernie Sanders’ speech. If you take a look at htmltext, you will notice a common structure that we can exploit. In each element, if a new speaker is speaking, the text begins with the speaker’s name in capital letters followed by a colon. If the element does not represent a new speaker, then the element doest not contain the speaker’s name. Also notice that the first time a candidate speaks, the name before the colon is the candidate’s full name and title. After that, only the last name is given prior to the colon.

We first initialize an empty vector named sanders. This will hold all of the sanders quotes that we extract. We then iterate over every element of htmltext and use regexpr() to get the index of the colon, if there is one. If there is no colon, regexpr() returns a -1. If there is a colon, we set the value of the speaker variable to whatever is in front of the colon, and then append everything after the colon to our sanders vector if Sanders is the speaker. If there is no colon and Sanders is still the speaker, then we append the entire quote to our vector. Since the first p tag contains only metadata, and is not part of the transcript, we start with the second element.

Great! Now we have a vector containing all of Bernie Sanders’ speech from the debate. Now we are ready to clean it up a bit and get our word frequency counts, so that we can make the cloud. We’ll begin this part by turning our vector of strings into a character vector of length 1 using paste. This collapses everything into a single string separated by spaces.

In order to further process our text data we will use the tm package, which is short for ‘text-mining’. Here we turn our character vector into a Corpus object, using the VectorSource function to load our sanders_corpus vector. A Corpus object is a collection of text documents. In our case, the Corpus will be a single document.

Now we can use the tm_map function. This allows us to apply transformations to the documents in our Corpus. We call tm_map and pass it two arguments. The first is our Corpus object and the second is a function that is applied to each document. Here we do this five times. Each function that we pass to tm_map should be fairly self-explanatory. We first remove all the white space, punctuation, and numbers. We also remove common SMART stop words using removeWords and passing it the built-in list of common SMART stop words. SMART stop words are a list built by Gerald Salton and Chris Buckley at Cornell University. You can see the entire list and read more about the SMART information retrieval system here. I also removed  all occurrences of “CROSSTALK” which occurred when more than one person was speaking at once, as well as “LAUGHTER” and “APPLAUSE.”

We arrive arrive at the final step. The wordcloud() function does most of the work for us here. We pass it a number of arguments that control the appearance of the plot. Scale is the range of the word size. Max.words allows you to set how many of the words in the document are plotted. The rot.per argument determines what proportion of the words you wish to appear vertically. Random.order sets whether or not you want the words to be chosen randomly. We want words to appear by their frequency, so we’ve set this to false here. If you are following along with this tutorial, feel free to play with the various settings to see how the cloud varies.


That’s it! We have our finished product. The above code can be easily adapted to create a similar word cloud for the candidate of your own choosing. You could also apply this approach to one of the Republican or Democratic debates coming up in the near future. Give it a try and let me know about it in the comments below. I have made this post and its code available as a Jupyter notebook on Github here. Cheers!

Let’s do a Kaggle!

If you are interested in machine learning and predictive modeling and you’ve reached the point where you want to test and practice some of your new found skills on real-world data, then is the place that you want to be. If you’ve never heard of it before, kaggle is a site that hosts data science competitions. For each competition, kaggle provides you with a training data set consisting of features and feature labels, as well as an unlabeled test data set. As a competitor, your job is to explore the data set and build a model on the training data in order to predict the the class labels in the test data set. You then submit your answer to the site and your rank among all the other competitors is calculated and displayed on a leaderboard. When the competition ends, a final score is calculated based on a full test set, and the winners are determined. Many competitions even have cash prize pools!


image credit: jessi reel

Today we’ll go over how incredibly easy (and fun!) it is to get started competing on kaggle. There are already lots of tutorials out there on getting started with kaggle, many of which are great. However it seems like every one of them focuses on the Titanic data competition. This is one of kaggle’s ‘101’ competitions; educational competitions based on publicly available data and have no prize pool. Instead, we will work with the San Francisco Crime Classification competition. This is a ‘playground’ competition. It also offers no prize pool, and is aimed at helping you build your skill set. This competition is a bit more challenging due to the difficulty in getting accurate predictions. The training data set is also much larger, with 878,049 observations.

We will use Python’ scikit-learn machine learning library to build a simple prediction model for this task. We will use the logistic regression algorithm. Although logistic regression has been around for quite a long time and may be less powerful than many modern machine learning algorithms like support vector machines, it is easy to understand and quite well-suited to our classification application.

The first thing that you will need to do is register at Once you have registered and logged in, go to the San Francisco Crime classification page and in the Dashboard on the left-hand side click ‘Data’ and then download the ‘train.csv’ and ‘test.csv’ zip files. Then unzip and save them to your directory of choice.

The competition asks you to predict the category of a reported crime. Categories include things like ‘Larceny/Theft’, ‘Grand theft from locked auto’, and ‘Warrants’ just to name a few of the 39 that are possible. To do so, you are given each incident’s date, time, police district, address, and latitude/longitude.  You can use any or all of these to build a model. Today we will focus specifically on the address of each crime with our hypothesis being that specific crimes may be likely to occur at  similar addresses. For example, prostitution is likely to occur in certain hotspots, or that theft is more likely to occur in places with a lot of tourists.  Although this is probably true of certain crimes, there are likely others where this is less true. Nevertheless, we’ll test our hypothesis out.

The address for each incident is given as either a street and cross street, or a street with a block number. We will turn these addresses into a consistent set of numerical features that we can feed to our logistic regression model. We will build a model with x features, where x is the total number of unique words in the entire set of addresses. Each feature will be the count of a specific word or number in each address. For example, if an address is given as ‘OAK ST/LAGUNA ST’, then it will have a 1 under OAK, a 1 under LAGUNA, and a 2 under ST. The remaining x-3 features will be 0. As an example, the addresses for the first three entries in the training set are ‘OAK ST/LAGUNA ST’, ‘OAK ST/LAGUNA ST’, and ‘VANNESS AVE/GREENWICH ST.’ So, after turning these into a numerical feature array we would have:

0 0 1 1 2 0
0 0 1 1 2 0
1 1 0 0 1 1

We continue adding features until we have exhausted our observations in the training data. Then we can feed the array, along with our training set class (crime category) labels to out logistic regression model. The model will estimate a probability that each incident falls under each of the 39 different categories based on the counts of words for that incident.

Fire up your Python IDE, Jupyter notebook, or whatever you prefer to use. We begin by importing some libraries that we will need:

The use of numpy and pandas will be obvious to most of you. We will use the CountVectorizer function from sklearn’s feature_extraction module to build the feature array from the addresses as discussed above. The train_test_split function will be used to split our training data into our own complete train and test sets, so that we may assess the performance of our model before we submit our results to kaggle. Notice that we can’t test the performance of our model using the test set provided by kaggle since it does not contain the crime category labels. The LogisticRegression class will be used to fit our data to a logistic regression model and make predictions. Finally, we will use the log_loss function to assess the performance of our model. We choose log_loss (or logarithmic loss) because this is the metric that kaggle uses to score and rank results for this competition.

Next, we will read in our data that we downloaded earlier using pandas’ read_csv():

Take a peak at the data using the head() method:


Id Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
0 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.425892 37.774599
3 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday NORTHERN NONE 1500 Block of LOMBARD ST -122.426995 37.800873
4 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday PARK NONE 100 Block of BRODERICK ST -122.438738 37.771541

For our purposes, we are only interested in the ‘Category’ and ‘Address’ columns. Let’s begin building our feature array using CountVectorizer. It’s important to note that we need to build our feature array for both the training data and the test data together rather than separately. Failing to do so will result in a feature array that is inconsistent between the train and test data. We will combine the address columns from train and test into a single numpy array:

Then we build our CountVectorizer and fit our data to it. First we build a custom list of ‘stop words.’ These are common words that we do not wish to include in our feature set. For our purposes this includes the abbreviations of street types such as ‘AV’ for avenue and ‘WY’ for way. We include this list as an argument in our CountVectorizer as well as specify a maximum number of features of 300 (We can change this parameter later if we wish).

Now our feature array is ready. Before we feed it to our model, let’s split our training data (the one with category labels) into our own training and test set. This gives us the ability to test the performance of our model before we submit to kaggle. First we’ll need to take from our features array just those rows corresponding to our training data. We assign that to X and the labels to y:

Then we use the train_test_split function from scikit-learn’s cross validation module. This function randomly partitions our data into a training and test set. We specify that we want to set aside 20% of the data for testing. We assign a value to the random_state argument for reproducibility:

Now we are ready to build our model and fit our training data to it:

Then we can use the model to make predictions on our test set that we set aside. We can use the predict() method of our model to predict the class label. However this competition requires that for each observation, we submit a probability that the class label is each of our 39 possible labels. For that, we use the predict_proba() method:

Take a look at the first row of results. It should have 39 values between zero and one:

Now that we have our predicted probabilities, we can see how well we did versus the actual labels. We use the log_loss() function and provide two arguments: our actual class labels for our test data set and the predicted values from our model:

We got a score of 2.48 on our test data. Looking at the leaderboard for the competition, this would put us at around 85th place  out of 533 if this score were to hold up when we submit our full results. It won’t exactly, but it should be reasonably close.

Lets run the model and predict the results for the full unlabeled kaggle test data. We build our model using the full training feature set, and then predict the probabilities for the full test data set:

Our results array should have 884,262 rows and 39 columns. To wrap things up we format our data according to the kaggle submission standards and then write it to a csv. The columns should be named according to the 39 possible categories. Each rows should have an index. The index column should be called ‘Id’:

Last of all, zip the file and then submit the results at the submission page here. Simply drag and drop you zip file where you see ‘Click or drop your submission here,’ and then click the big blue submit button. Kaggle will take a few minutes to upload your submission and score your results. After this, you will see you new position on the leaderboard. If you’ve submitted results by following this post exactly, you should have gotten a log loss score of 2.50732, enough to put you around 104th place out of 533 as of the time of this writing. Not bad for a first try! Congratulations!

Keep in mind that we have covered only a small portion of the data science pipeline is this post, the purpose of which is to get you familiar with building a quick model and submitting results. A full analysis would include extensive data exploration, feature engineering, model building and validation, followed by likely more of the same. I encourage you to do your own exploration and build your own model, incorporating some of the other features. Yours truly is currently in 81st place in this competition. See if you can catch me on the leaderboard! If you do, let me know about it in the comments below. Cheers and have fun kaggling!

Mapping Census Data in R: Pennsylvania’s Oldest Housing Stock

Recently, chloropleth maps have gained popularity on the internet. Data journalists and organizations use them to map data for web visualizations. They are a useful tool for showing how your data vary across regions, whether they be countries, US States, or counties. Although they seem a little intimidating at first glance, they are actually fairly easy to make. Today we’ll make just such a map. We will take a look at the state of Pennsylvania’s oldest housing stock by mapping the percentage of the total stock that was built before 1940 for every Census tract in the state.

Thanks to Kevin Johnson, whose recent blog post taught me how to use ggplot2 to map data, I now have a workflow for mapping U.S. Census data using shapefiles obtained from the US Census TIGER database. This workflow is largely adapted from his, with one major exception. Rather than download a csv of the data that I want to map from the Census American Fact Finder website, I will use the Census API to get the data I need. This will make the workflow for more efficient, since whenever I want to map new data, I can just change the query parameters of my API call rather than sifting through American Fact Finder for the right csv file.

Let’s get started. First, we will load all the libraries that we will need:

We will use rjson to obtain the data from the Census API, which returns json objects. The rgdal library provides us with the readOGR() function, which permits us to read a shapefile into R as a special data.frame called a SpatialPolygonsDataFrame. The dplyr library will be used for data manipulation, as usual. The mapping will be done using ggplot2, and scales and ggmap will provide some functions to assist us in formatting and saving the map.

Next we need to get our data. The Census Bureau conducts a survey called the American Community Survey that collects data on a wide array of characteristics of US households. One of those characteristics is the age of the house. These ages are lumped into categories as shown in this example taken from the American Fact Finder site:


We are interested in the oldest category, 1939 or earlier. We want the data for every Census tract in the state of Pennsylvania. Let’s head over to the developer page for the Census. Specifically, we want the API documentation for the American Community Survey 5-year estimates for 2013 located here. We want to find the API variable name for the “Built 1939 or earlier” category. Clicking on the link for the xml version of the 2009 – 2013 ACS Data Profile API Variables, we can search for what we want. Doing so yields a variable name of “DP04_0025PE”, which is the percentage of the total housing stock that was built in 1939 or earlier.

Now we can construct the pieces of an API query to get the data:

The endpoint is the API endpoint for the 5-year American Community Survey data. The variable called “get” selects the variable and also includes “NAME”, which just returns the geography names in the json response. The variable I’ve named “geo” is a string specifying that we want all of the Census tracts in the state of Pennsylvania. You can check the documentation for further details on how to select the various geographies. Finally, my API key, preceded by “key=” is stored in the “api_key” variable. Note that you’ll need to replace XXX with your key. I’ve not included mine for obvious reasons. Next, we need put it all together into a single string, with an “&” separating the parameters:

Which gives:


Then we use the fromJSON() function to make the request:

This returns a json object, which R parses into a list. Here are the first few entries:

We need to get this data into a data.frame. For each item in the list, each of which is a list itself, we need the value and the geography id of the tract, which are elements 2 and 5. Let’s write a quick helper function to extract these two pieces of data, which we will plug into sapply. This creates two vectors; one vector of the housing data, and one of the geography IDs.

Then we combine the two vectors into a data.frame. The value column is converted to numeric and divided by 100, a step necessary for the mapping which we’ll see shortly.

Finally, we can move on to the shapefiles. I will admit that I know very little about shapefiles and that they are largely a black box to me. Nevertheless, working with them at a basic level is not terribly difficult. I have downloaded the cartographic boundary shapefile for the state of Pennsylvania’s Census tracts from the Census Bureau’s website here. The shapefile is actually a collection of eight files with extensions such as .shp, .prj, and dbf. Save all of these files to a single directory.

We load the shapefiles into R using readOGR(). The dsn parameter takes the directory where the shapefiles are located. The layer parameter takes the name of the file without the extension. The fortify() function turns the object returned by readOGR() and converts it to a data.frame with which ggplot will be able to interface.

Here’s a peek at the tract data.frame:

We now need to merge this shapefile data with the housing data, matching up the geography IDs. The problem is that the IDs in the tract data.frame have extra digits on the left side that the IDs in the housing data do not:

We don’t need those first five digits as they are part of the ID that identifies the state. We’ll get rid of them:

Lastly, we merge the data sets, and finally we are ready to make the map:

We use ggplot’s geom_polygon() function. We feed it the plotData data.frame, specify longitude and latitiude as the x and y coordinates, and tell it to fill the resulting shapes using the value column, which is our housing data.




It looks great, but I’d like to add the county borders to provide some frame of reference. We can do that easily by obtaining the shapefile for the Pennsylvania counties from the Census website, read the data in the same manner as before:

Then we can just add a layer to our current plot:



We’re all done. This was a lot of work. The good news is that we now have a working script to repeat this process at will. If we want to change the variable of interest, we just have to make a change to our API query. If we want to map a different state or a different geography type like counties, all we need to do get the right shapefile. Our script will work for any state and any state-level geography.

I’ve put the entire R script and the shapefiles we used today in a Github repository. Let me know if you have any questions or comments below. Cheers!

Five Million Rows of Philly Parking Ticket Data

Earlier this month, the city of Philadelphia released a gold mine of a dataset: data on all of the parking tickets issued by the Philadelphia Parking Authority from 2012 to March of 2015. This is a large dataset; nearly five million observations showing the location, date, time, amount of fine, and an anonymized license plate id for each ticket! Wow! I couldn’t wait to start digging around in it. Today we’ll do some exploratory analysis in R and try to uncover something interesting.

First, let’s get the data.  The link provided above will take you to the url at where the data are located. I downloaded the Parking_Violations.csv file using the “Download” button at the top right and saved it to my working directory. Once we have that, we’ll load a couple libraries that will likely be useful:

The dplyr library will provide us with some handy tools for data manipulation. Im also loading the data.table library specifically for its fread() function. This is optimized to read data from files much faster than the read.csv() in the R base package. Since our dataset has nearly five million rows and is 767MB in size, this will be useful. (In fact I tried both functions for comparison. The fread() function indeed performed considerably faster).

Let’s load it in. Here we load the data using fread() and then we use tbl_df() to turn our data.frame (named ‘violations’) into a “tbl_df” object. This retains all the characteristics of a data.frame, but is optimized for speed and also prints to the console in a cleaner way.

Now let’s take a quick glance using head():

The data is loaded correctly, but already we see a problem. The column names have spaces in them, which won’t fly once we start manipulating the data frame. As you can tell from the output above, some of the columns are not shown due to out screen real estate not being big enough. Let’s take a look at the column names using colnames():

How can we fix this? We can do it rather easily with a single line of code using gsub(). Even if you aren’t familiar with R, it should be fairly self-explanatory:

Now take another look:

Excellent. The spaces have been replaced with underscores. Unfortunately, cleaning and preparing data is never quite so easy. We have another problem, which is not apparent to you yet because we have yet to see the entire data.frame. If we look at the column named “Fine”, we see that the fines have a dollar sign in front of them. Here’s a peek using dplyr’s select() function:

We’ll have to get rid of them if we want to perform any mathematical functions on that data, and we most certainly do! In addition, that means that the Fine column is likely a character vector rather than numeric. We can confirm using class() to inspect its type:

Luckily, this is another quick fix thanks to gsub(). Here we remove the dollar signs and covert the column to numeric in one fell swoop:

Ok, that’s done. Now we can get started on exploring the data. Let’s see the total amount in fines that the Parking Authority has collected over this time frame. That’s easy. We just sum the Fine column:

In a little over three years, the PPA has raked in nearly $225 million dollars in various fines.

One thing that I was interested in is how much money the PPA typically collects in a single day. We could simply divide the above amount by the number of days in the dataset to get the answer, but let’s dig a little deeper. I want to group the dataset by individual days and then sum all of the fines for each of those days. Once we do, we can plot it and see what it looks like. Before we do, we will need a date for each observation. You may have noticed earlier that there is a column for the date, but that it also includes the time. We will need to extract the date from each. Here we make a new column called ‘Date’ by first extracting the first 10 characters from the ‘Issue_Date_and_Time’ column using substring(). Then we convert the dates into Date objects using the as.Date() function and passing it the desired format (You can learn more about Date format options in the R documentation).

Now if we look our new ‘Date’ column, we see that it looks good.

Now, we can group our data by date and sum up the fines. We can do that by chaining together a couple of dplyr functions together in one line. We create a new data.frame called ‘grouped’. We’ll use group_by() to group the data by date and the summarize() to sum the Fine column for each date group. If you are not familiar with dplyr, the ‘%>%’ operator pipes the output of the previous function into the following function, saving a lot of intermediate steps. After that we simply rename the Fine column to ‘ttl__fine’; otherwise it will be given a default of “V2”, which isn’t very descriptive.

Now we are ready to plot the data. In order to do so, we’ll need to load a few more packages. We are going to use the most excellent ggplot2 package. We will also load a couple of other supporting packages that we’ll need.

Originally I had thought to plot this data as a line, with the date on the x-axis and total fines for each day on the y-axis. As you can imagine, this looked pretty messy. On a whim, I changed the plot type to a scatter, and something interesting appeared. Here is the code I am using to plot the data:

And the plot itself:


As you can see the data appear to be stratified into 3 distinct layers. Now there’s nothing earth-shattering here. It seems likely that the big cloud at the top are weekdays, the middle cloud is weekends, and the bottom layer are holidays. There are also some points stuck in between probably due to variation. Making assumptions however, is usually never a good idea. So let’s do our due diligence here and try to confirm this hypothesis.

Let’s split our days into normal weekdays, weekend, and holidays. The PPA provides a list of their holidays here. They are New Year’s Day, Easter Sunday, Memorial, Independence Day, Labor Day, Thanksgiving, Christmas, and every Saturday after 11am from Thanksgiving to New Year. No metered parking is enforced during these days, but other violations are still enforced.

We can use the timeDate package in R to work with dates:

First, let’s define the PPA holidays. The timeDate package has functions that return the dates of various holidays for any given year. Here we define our years of the dataset, and then feed it into each relevant holiday function and combine the resulting dates into a single vector of class timeDate:

Next, we will use the isBizDay() functions to add a new column to our grouped data.frame called ‘is_holiday’. This will be a logical vector indicating TRUE where the data is a business day and FALSE when it is not. We pass our ppa_holidays object into the function’s ‘holiday’ argument so that it will return FALSE for weekdays that are holidays. We do something similar to create a ‘is_holiday’ column indicating TRUE for days that are PPA holidays and FALSE otherwide. Note that for simplicity I am not counting the Saturdays between Thanksgiving and New Year’s Day where parking is free after 11am as holidays. We would have to incorporate the time and I’m too lazy to do that right now. This should still be fine for confirming our theory. Last of all we create a ‘is_weekend’ column, which is TRUE for all days that are not already defined as business days or holidays:

Great. now we can use dplyr to subset the grouped dataframe by weekends, business days, and holidays and then calculate a means like so:


If you compare these numbers to our plot, it looks like our hypothesis holds up very well. Business days average a little over $216,000 per day. Holidays earn a little over $23,000. The one exception is weekends, where our average seems higher than what the plot indicates; although if enough of those ‘stray’ points are weekends as well, then it may make sense.

That wraps up this post. Hopefully you learned something interesting. This is only scratching the surface of what can be done with this enormous data set. There are a whole lot of other questions we can investigate. We could combine it with other data sets like weather. We could map the data in ways that might provide newer insights. I definitely plan on diving deeper into this data soon. You should too. If you do, let me know about it in the comments below!

I will put all of the code from this post on my Github soon. Last of all, here is plot and link to this data on plotly:

Philadelphia - Total Parking Fines Issued per Day 2012-2015

Looking at Philly Bike Thefts in R

A while ago I read a post by Daniel Forsyth in which he analyzes data on bike thefts in Philadelphia using Python. In his post, he briefly discusses the relationship between the number of bike thefts and temperature, with thefts unsurprisingly rising dramatically in the warmer months. Inspired by his post, I decided to take a deeper look at this and build a data visualization depicting the relationship. Having recently begun learning the plotly API for R, I decided to take it for a spin.

Here is the final product that we will be working to build today:

Philadelphia Bike Thefts by Temperature (2010-2013)

  I want to group the all of the thefts by temperature and create a scatterplot showing total the total number of thefts by temperature. Let’s get started. First, let’s load the R libraries that we’ll need:


The plotly library will provide the tools we need to build our scatter plot and then send it to via their API. This will create a url, where our beautiful interactive plot will live for all the world too see. The data.table library will make working with the data a bit easier than using the functions available in the R standard library.

Next, we need the data. Daniel’s analysis used bike theft data from the City of Philadelphia’s Github page. You can find the date here. Upon examining the page, I noticed that there is now more data available than at the time of Daniel’s analysis. There are two csv files. The first, bicycle_thefts_Part1.csv, contains records from January 1, 2010 to September 16, 2013. The second, bicycle_thefts_Part2.csv, has records from September 15, 2014 to February 8, 2015. For now, we are going to work with only Part 1, but feel free to add Part 2 for you own work. CofPgithubLets grab the csv data directly from the site, rather than downloading files. This will make our analysis more reproducible. We can place the data directly in R data frames using read.csv() in conjunction with url():

thefts_link = ''

thefts = read.csv(url(thefts_link))

Take a look at the data to make sure that it was read into memory correctly. Using the head() command, we can look at the first five rows of data:   thefts_1   Everything looks good. For our analysis, we are primarily concerned with the THEFT_DATE column, which we will use to join the theft data with the weather data. So, now we will need to read in our weather data. For this, I have pilfered Daniel’s weather data from his github page. Thanks Daniel! I saved the file to my local drive and named it weather.csv. Let’s grab it:

weather = read.csv('weather.csv')

Now take a peak at the weather data with head():   weather_1   Notice that the date is divided into three columns; one for the month, one for the day and one for the year. We will need one column that combines all three into a single cohesive date so that the format is the same as the date column in out thefts data. To do this, we will use the paste() function, which will simply concatenate the three pieces into one date:

weather$DATE = paste(weather$MO, weather$DAY, weather$YEAR, sep="/")

Take a look using head():   weather_2   We now have a properly formatted DATE column. Let’s prepare to merge the weather data frame with the thefts data frame. Before we do, we’ll need to rename the THEFT_DATE column in the thefts data frame to ‘DATE’, such that it matches the weather frame. We can access that element of the thefts column names using colnames() like so:

colnames(thefts)[4] = 'DATE'

Now we can use the merge() function to join the two data frames on the DATE column. We’ll name our new data frame ‘data’:

data = merge(thefts, weather, by='DATE')

Take a look:   data_1   We now want to prepare our data for plotting. Recall that we are going to plot the total number of thefts for each temperature. To do so, we will need to group the rows by temperature, and then count the number of rows at each temperature. We have a high and low temperature for each day, but let’s calculate the mean of those to get a better picture of the typical temperature throughout the day. Then we will add a column of ones so that we have something to sum up the thefts. Once that is done, we’ll turn our data frame into an R data table using data.table(). Then we can use the table object to do our group and sum operation:

data$TEMP = (data$HIGH + data$LOW) / 2
data$counts = 1

data = data.table(data)

by_temp = data[, sum(counts), by=TEMP]

Once that is done, we will have a new data table called ‘by_temp’ that shows the count of thefts for each temperature:   by_temp   Great! We now have our data ready to be plotted. We will use the plotly library to build the plot and then send it to the plotly API. Since this post has already gotten quite long, I will not go into detail about this, but if you are curious about plotly, I encourage you to head to their site and check out their R documentation. We begin by instantiating a new plotly object, py, using plotly(). We then define our trace, which is basically defining the data for each axis of the graph and the type of graph. In addition, we create a layout object, which defines much of the formatting for our plot. Lastly, we call the plotly method on our py object. This sends our data to the plotly API for plotting and creates a unique url that we can share:

plot(by_temp, col='blue', main='Bike Thefts by Temperature', xlab='Daily Mean Temperature', ylab='Thefts')

py = plotly()

trace1 = list(
  x = by_temp$TEMP,
  y = by_temp$V1,

  type = 'scatter',

  marker = list(
    color = 'rgb(0, 204, 102)',

plot_data = list(trace1)

layout = list(
  title = 'Philadelphia Bike Thefts by Temperature (2010-2013)',

  xaxis = list(
    title = 'Daily Temperature'

  yaxis = list(

  annotations = list(
    x = 0.9,
    y = -0.17,
    xref = "paper",
    yref = "paper",
    text = "Source: OpenDataPhilly",
    showarrow = FALSE

response = py$plotly(plot_data, kwargs=list(filename='ph_bike_thefts', layout=layout, fileopt='overwrite'))

Now, if R does not open a web browser and automatically navigate to the new plot, you can get the url using:


Here is the plot again with full functionality:

Philadelphia Bike Thefts by Temperature (2010-2013)


Bravo, we’re done! You can find all of the code at my Github page here. Let me know below if you have any questions, comments, or suggestions. Thanks!