Tag Archives: r

Time Series Analysis in R Part 3: Getting data from Quandl

This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here:

Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

For example, here we get the data in percent change form:

You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.


Linear Regression from Scratch in R

One of the very first learning algorithms that you’ll encounter when studying data science and machine learning is least squares linear regression. Linear regression is one of the easiest learning algorithms to understand; it’s suitable for a wide array of problems, and is already implemented in many programming languages. Most users are familiar with the lm() function in R, which allows us to perform linear regression quickly and easily. But one drawback to the lm() function is that it takes care of the computations to obtain parameter estimates (and many diagnostic statistics, as well) on its own, leaving the user out of the equation. So, how does the lm() function obtain these parameter estimates?

In this post, I will outline the process from first principles in R. I will use only matrices, vectors, and matrix operations to obtain parameter estimates using the closed-form linear algebraic solution. After reading this post, you’ll see that it’s actually quite simple, and you’ll be able to replicate the process with your own data sets (though using the lm() function is of course much more efficient).

For this example, I’ll use the well-known “Boston” data set from the MASS library. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s. Our target variable will be the median home value for each tract — the column titled ‘medv.’ Let’s get an idea of what this data set looks like:

To learn more about the definition of each variable, type  help(Boston) into your R console.

Now we’re ready to start. Linear regression typically takes the form

 y=\beta X + \epsilon

where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \epsilon is the error term; it represents features that affect the response, but are not explicitly included in our model.
The first thing we will need is a vector of our response variable, typically called ‘y’. In this case, our response variable is the median home value in thousands of US dollars: ‘medv’:

Next, we’ll need our ‘X’ or ‘design’ matrix of the input features. We can do this with the as.matrix() function, grabbing all the variables except ‘medv’ from the Boston data frame. We’ll also need to add a column of ones to our X matrix for the intercept term.

Now that we have our response vector and our ‘X’ matrix, we can use them to solve for the parameters using the following closed form solution:

\beta = (X^TX)^{-1}X^Ty

The derivation of this equation is beyond the scope of this post. If you are interested in the derivation, a good place to start is Wikipedia or any good undergraduate textbook on linear regression.

We can implement this in R using our ‘X’ matrix and ‘y’ vector. The %*%  operator is simply matrix multiplication. The t() function takes the transpose of a matrix, and solve()  calculates the inverse of any (invertible) matrix.

Now we have our parameter vector β. These are the linear relationships between the response variable ‘medv’ and each of the features. For example, we see that an increase of one unit in the number of rooms (rm) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘chas’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘chas’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

Now let’s verify that these results are in fact correct. We want to compare our results to those produced by the lm()  function. Most of you will already know how to do this.

The parameter estimates are identical! Now you know how these estimates are obtained. Keep in mind that you’re unlikely to favor implementing linear regression in this way over using lm() . The lm()  function is very quick, and requires very little code. Using it provides us with a number of diagnostic statistics, including R^2, t-statistics, and the oft-maligned p-values, among others. Nevertheless, I wanted to provide a window into what lm()  does behind the scenes.

There are other ways to solve for the parameters in linear regression. For example, gradient descent can be used to obtain parameter estimates when the number of features is extremely large, a situation that can drastically slow solution time when using the closed-form method. Stochastic gradient descent is often used when both the number of features and the number of samples are very large. I hope to detail these in a future post.

I have made the code from this post available at my Github here. Cheers!

Completing the Coursera Data Science Specialization

First of all, I’d like to apologize to anyone who read my last blog post. In it, I mentioned that I would write several follow-up posts in which I would go through the entire modelling process in the Kaggle Animal Shelter competition. Unfortunately, life got in the way and I was unable to complete these posts before the competition ended. Fortunately, I have been hard at work on something that I am now able to share.

Some of you that know me know that for the past year and a half, I have been working on completing all ten courses of the Coursera Johns Hopkins Data Science Specialization. I finally completed it earlier this month. In the final course of the specialization, we are tasked with creating an application that can perform text prediction. Many of you who have smartphones are very familiar with this type of application. When typing a text message for example, your phone may provide you with several options for the next word before you even type it.

I am not going to write a long post about how I did this. I simply wanted to share my final application with you. You can find it here or by clicking the picture below:

text prediction app

If you are interested in how it was done, I put together a few brief slides on my methodology, which you can see below. I plan on eventually releasing my code on github after I make some additional improvements.

Special thanks to Adam Danz, Brandon Danz, Bo Bucklen, Chad McComsey, Anthony Mead, Arthur Mead, and Gabbie Nirenburg for doing some QA testing and providing feedback.

Hacking the Democratic Debate: Word Clouds in R

Word clouds are an easy way to visualize textual data. They are used to distill text, whether it be a speech, a book, or an article, into a form that shows the frequency of words used. Frequency of a word can be emphasized with size, color or orientation. They are also a great early foray for those looking to get a taste of data science. Making a word cloud requires very little technical know-how. Anyone can make one with a minimal amount of programming ability.

Today we will go through the entire pipeline of making a word cloud in R. We will start with getting and cleaning our data and then going through all the processing steps to make our final product. To make things interesting, we will use an example from current events; the first Democratic debate, which aired on October 13th from the Wynn Las Vegas in Nevada. Specifically, we will make a word cloud using a transcript of the debate and we will focus on my favorite candidate: Bernie Sanders. We will obtain a copy of the debate transcript in html and parse it to obtain everything that Bernie Sanders said. Then we’ll make our word cloud using the data.

We begin by loading the packages that we will need into R. The XML package provides functionality to quickly extract content from html. We will use the stringr package to assist with string parsing. The tm package provides functions for parsing text and getting word frequency counts. Finally, the wordcloud package provides an easy interface for making the word cloud from text.

First, we will need a transcript. There are several sources of full-text transcripts of the debate. I chose the one published by CBS News, which you can find here. Next, we download the file to our local drive as so:

Once we have the file, we can load it into R as an object of type HTMLInternalDocument, using htmlTreeParse(). This is a special version of the more generic xmlParse() function in the XML library. It is particularly well-suited for html, which may be malformed and does not always conform strictly to xml formatting.  This stores our html document in a tree-like format that makes parsing it a snap.

We can take a peek at the html document to get an idea of how to proceed. Looking at the html where the transcript starts, we can see that all of the actual debate speech is wrapped in html ‘p’ tags, which denote paragraphs. So, a good start would be to extract all of the text that resides within the p tags. Again, the XML library makes this very easy. We use the xpathApply function, which is a specialized version of R’s apply function. We provide xpathApply three arguments: our HTMLInternalDocument object that we want to parse, a node type to extract (all p tags in this case), and a function that we wish to perform on each node. The double forward-slash in front of p indicates that we want the p tags from all the nodes of the document. We pass the xmlValue function, which tells xpathApply that we simply want the raw text between each set of p tags. We then unlist() the results. The xpathApply function returns a list, but we prefer to have our results as a character vector.

The results are now in a character vector, where each element of the vector is the text that resided within each p tag. This is the entire transcript of the debate; however we only want Bernie Sanders’ speech. If you take a look at htmltext, you will notice a common structure that we can exploit. In each element, if a new speaker is speaking, the text begins with the speaker’s name in capital letters followed by a colon. If the element does not represent a new speaker, then the element doest not contain the speaker’s name. Also notice that the first time a candidate speaks, the name before the colon is the candidate’s full name and title. After that, only the last name is given prior to the colon.

We first initialize an empty vector named sanders. This will hold all of the sanders quotes that we extract. We then iterate over every element of htmltext and use regexpr() to get the index of the colon, if there is one. If there is no colon, regexpr() returns a -1. If there is a colon, we set the value of the speaker variable to whatever is in front of the colon, and then append everything after the colon to our sanders vector if Sanders is the speaker. If there is no colon and Sanders is still the speaker, then we append the entire quote to our vector. Since the first p tag contains only metadata, and is not part of the transcript, we start with the second element.

Great! Now we have a vector containing all of Bernie Sanders’ speech from the debate. Now we are ready to clean it up a bit and get our word frequency counts, so that we can make the cloud. We’ll begin this part by turning our vector of strings into a character vector of length 1 using paste. This collapses everything into a single string separated by spaces.

In order to further process our text data we will use the tm package, which is short for ‘text-mining’. Here we turn our character vector into a Corpus object, using the VectorSource function to load our sanders_corpus vector. A Corpus object is a collection of text documents. In our case, the Corpus will be a single document.

Now we can use the tm_map function. This allows us to apply transformations to the documents in our Corpus. We call tm_map and pass it two arguments. The first is our Corpus object and the second is a function that is applied to each document. Here we do this five times. Each function that we pass to tm_map should be fairly self-explanatory. We first remove all the white space, punctuation, and numbers. We also remove common SMART stop words using removeWords and passing it the built-in list of common SMART stop words. SMART stop words are a list built by Gerald Salton and Chris Buckley at Cornell University. You can see the entire list and read more about the SMART information retrieval system here. I also removed  all occurrences of “CROSSTALK” which occurred when more than one person was speaking at once, as well as “LAUGHTER” and “APPLAUSE.”

We arrive arrive at the final step. The wordcloud() function does most of the work for us here. We pass it a number of arguments that control the appearance of the plot. Scale is the range of the word size. Max.words allows you to set how many of the words in the document are plotted. The rot.per argument determines what proportion of the words you wish to appear vertically. Random.order sets whether or not you want the words to be chosen randomly. We want words to appear by their frequency, so we’ve set this to false here. If you are following along with this tutorial, feel free to play with the various settings to see how the cloud varies.


That’s it! We have our finished product. The above code can be easily adapted to create a similar word cloud for the candidate of your own choosing. You could also apply this approach to one of the Republican or Democratic debates coming up in the near future. Give it a try and let me know about it in the comments below. I have made this post and its code available as a Jupyter notebook on Github here. Cheers!

Mapping Census Data in R: Pennsylvania’s Oldest Housing Stock

Recently, chloropleth maps have gained popularity on the internet. Data journalists and organizations use them to map data for web visualizations. They are a useful tool for showing how your data vary across regions, whether they be countries, US States, or counties. Although they seem a little intimidating at first glance, they are actually fairly easy to make. Today we’ll make just such a map. We will take a look at the state of Pennsylvania’s oldest housing stock by mapping the percentage of the total stock that was built before 1940 for every Census tract in the state.

Thanks to Kevin Johnson, whose recent blog post taught me how to use ggplot2 to map data, I now have a workflow for mapping U.S. Census data using shapefiles obtained from the US Census TIGER database. This workflow is largely adapted from his, with one major exception. Rather than download a csv of the data that I want to map from the Census American Fact Finder website, I will use the Census API to get the data I need. This will make the workflow for more efficient, since whenever I want to map new data, I can just change the query parameters of my API call rather than sifting through American Fact Finder for the right csv file.

Let’s get started. First, we will load all the libraries that we will need:

We will use rjson to obtain the data from the Census API, which returns json objects. The rgdal library provides us with the readOGR() function, which permits us to read a shapefile into R as a special data.frame called a SpatialPolygonsDataFrame. The dplyr library will be used for data manipulation, as usual. The mapping will be done using ggplot2, and scales and ggmap will provide some functions to assist us in formatting and saving the map.

Next we need to get our data. The Census Bureau conducts a survey called the American Community Survey that collects data on a wide array of characteristics of US households. One of those characteristics is the age of the house. These ages are lumped into categories as shown in this example taken from the American Fact Finder site:


We are interested in the oldest category, 1939 or earlier. We want the data for every Census tract in the state of Pennsylvania. Let’s head over to the developer page for the Census. Specifically, we want the API documentation for the American Community Survey 5-year estimates for 2013 located here. We want to find the API variable name for the “Built 1939 or earlier” category. Clicking on the link for the xml version of the 2009 – 2013 ACS Data Profile API Variables, we can search for what we want. Doing so yields a variable name of “DP04_0025PE”, which is the percentage of the total housing stock that was built in 1939 or earlier.

Now we can construct the pieces of an API query to get the data:

The endpoint is the API endpoint for the 5-year American Community Survey data. The variable called “get” selects the variable and also includes “NAME”, which just returns the geography names in the json response. The variable I’ve named “geo” is a string specifying that we want all of the Census tracts in the state of Pennsylvania. You can check the documentation for further details on how to select the various geographies. Finally, my API key, preceded by “key=” is stored in the “api_key” variable. Note that you’ll need to replace XXX with your key. I’ve not included mine for obvious reasons. Next, we need put it all together into a single string, with an “&” separating the parameters:

Which gives:


Then we use the fromJSON() function to make the request:

This returns a json object, which R parses into a list. Here are the first few entries:

We need to get this data into a data.frame. For each item in the list, each of which is a list itself, we need the value and the geography id of the tract, which are elements 2 and 5. Let’s write a quick helper function to extract these two pieces of data, which we will plug into sapply. This creates two vectors; one vector of the housing data, and one of the geography IDs.

Then we combine the two vectors into a data.frame. The value column is converted to numeric and divided by 100, a step necessary for the mapping which we’ll see shortly.

Finally, we can move on to the shapefiles. I will admit that I know very little about shapefiles and that they are largely a black box to me. Nevertheless, working with them at a basic level is not terribly difficult. I have downloaded the cartographic boundary shapefile for the state of Pennsylvania’s Census tracts from the Census Bureau’s website here. The shapefile is actually a collection of eight files with extensions such as .shp, .prj, and dbf. Save all of these files to a single directory.

We load the shapefiles into R using readOGR(). The dsn parameter takes the directory where the shapefiles are located. The layer parameter takes the name of the file without the extension. The fortify() function turns the object returned by readOGR() and converts it to a data.frame with which ggplot will be able to interface.

Here’s a peek at the tract data.frame:

We now need to merge this shapefile data with the housing data, matching up the geography IDs. The problem is that the IDs in the tract data.frame have extra digits on the left side that the IDs in the housing data do not:

We don’t need those first five digits as they are part of the ID that identifies the state. We’ll get rid of them:

Lastly, we merge the data sets, and finally we are ready to make the map:

We use ggplot’s geom_polygon() function. We feed it the plotData data.frame, specify longitude and latitiude as the x and y coordinates, and tell it to fill the resulting shapes using the value column, which is our housing data.




It looks great, but I’d like to add the county borders to provide some frame of reference. We can do that easily by obtaining the shapefile for the Pennsylvania counties from the Census website, read the data in the same manner as before:

Then we can just add a layer to our current plot:



We’re all done. This was a lot of work. The good news is that we now have a working script to repeat this process at will. If we want to change the variable of interest, we just have to make a change to our API query. If we want to map a different state or a different geography type like counties, all we need to do get the right shapefile. Our script will work for any state and any state-level geography.

I’ve put the entire R script and the shapefiles we used today in a Github repository. Let me know if you have any questions or comments below. Cheers!

Five Million Rows of Philly Parking Ticket Data

Earlier this month, the city of Philadelphia released a gold mine of a dataset: data on all of the parking tickets issued by the Philadelphia Parking Authority from 2012 to March of 2015. This is a large dataset; nearly five million observations showing the location, date, time, amount of fine, and an anonymized license plate id for each ticket! Wow! I couldn’t wait to start digging around in it. Today we’ll do some exploratory analysis in R and try to uncover something interesting.

First, let’s get the data.  The link provided above will take you to the url at data.phila.gov where the data are located. I downloaded the Parking_Violations.csv file using the “Download” button at the top right and saved it to my working directory. Once we have that, we’ll load a couple libraries that will likely be useful:

The dplyr library will provide us with some handy tools for data manipulation. Im also loading the data.table library specifically for its fread() function. This is optimized to read data from files much faster than the read.csv() in the R base package. Since our dataset has nearly five million rows and is 767MB in size, this will be useful. (In fact I tried both functions for comparison. The fread() function indeed performed considerably faster).

Let’s load it in. Here we load the data using fread() and then we use tbl_df() to turn our data.frame (named ‘violations’) into a “tbl_df” object. This retains all the characteristics of a data.frame, but is optimized for speed and also prints to the console in a cleaner way.

Now let’s take a quick glance using head():

The data is loaded correctly, but already we see a problem. The column names have spaces in them, which won’t fly once we start manipulating the data frame. As you can tell from the output above, some of the columns are not shown due to out screen real estate not being big enough. Let’s take a look at the column names using colnames():

How can we fix this? We can do it rather easily with a single line of code using gsub(). Even if you aren’t familiar with R, it should be fairly self-explanatory:

Now take another look:

Excellent. The spaces have been replaced with underscores. Unfortunately, cleaning and preparing data is never quite so easy. We have another problem, which is not apparent to you yet because we have yet to see the entire data.frame. If we look at the column named “Fine”, we see that the fines have a dollar sign in front of them. Here’s a peek using dplyr’s select() function:

We’ll have to get rid of them if we want to perform any mathematical functions on that data, and we most certainly do! In addition, that means that the Fine column is likely a character vector rather than numeric. We can confirm using class() to inspect its type:

Luckily, this is another quick fix thanks to gsub(). Here we remove the dollar signs and covert the column to numeric in one fell swoop:

Ok, that’s done. Now we can get started on exploring the data. Let’s see the total amount in fines that the Parking Authority has collected over this time frame. That’s easy. We just sum the Fine column:

In a little over three years, the PPA has raked in nearly $225 million dollars in various fines.

One thing that I was interested in is how much money the PPA typically collects in a single day. We could simply divide the above amount by the number of days in the dataset to get the answer, but let’s dig a little deeper. I want to group the dataset by individual days and then sum all of the fines for each of those days. Once we do, we can plot it and see what it looks like. Before we do, we will need a date for each observation. You may have noticed earlier that there is a column for the date, but that it also includes the time. We will need to extract the date from each. Here we make a new column called ‘Date’ by first extracting the first 10 characters from the ‘Issue_Date_and_Time’ column using substring(). Then we convert the dates into Date objects using the as.Date() function and passing it the desired format (You can learn more about Date format options in the R documentation).

Now if we look our new ‘Date’ column, we see that it looks good.

Now, we can group our data by date and sum up the fines. We can do that by chaining together a couple of dplyr functions together in one line. We create a new data.frame called ‘grouped’. We’ll use group_by() to group the data by date and the summarize() to sum the Fine column for each date group. If you are not familiar with dplyr, the ‘%>%’ operator pipes the output of the previous function into the following function, saving a lot of intermediate steps. After that we simply rename the Fine column to ‘ttl__fine’; otherwise it will be given a default of “V2”, which isn’t very descriptive.

Now we are ready to plot the data. In order to do so, we’ll need to load a few more packages. We are going to use the most excellent ggplot2 package. We will also load a couple of other supporting packages that we’ll need.

Originally I had thought to plot this data as a line, with the date on the x-axis and total fines for each day on the y-axis. As you can imagine, this looked pretty messy. On a whim, I changed the plot type to a scatter, and something interesting appeared. Here is the code I am using to plot the data:

And the plot itself:


As you can see the data appear to be stratified into 3 distinct layers. Now there’s nothing earth-shattering here. It seems likely that the big cloud at the top are weekdays, the middle cloud is weekends, and the bottom layer are holidays. There are also some points stuck in between probably due to variation. Making assumptions however, is usually never a good idea. So let’s do our due diligence here and try to confirm this hypothesis.

Let’s split our days into normal weekdays, weekend, and holidays. The PPA provides a list of their holidays here. They are New Year’s Day, Easter Sunday, Memorial, Independence Day, Labor Day, Thanksgiving, Christmas, and every Saturday after 11am from Thanksgiving to New Year. No metered parking is enforced during these days, but other violations are still enforced.

We can use the timeDate package in R to work with dates:

First, let’s define the PPA holidays. The timeDate package has functions that return the dates of various holidays for any given year. Here we define our years of the dataset, and then feed it into each relevant holiday function and combine the resulting dates into a single vector of class timeDate:

Next, we will use the isBizDay() functions to add a new column to our grouped data.frame called ‘is_holiday’. This will be a logical vector indicating TRUE where the data is a business day and FALSE when it is not. We pass our ppa_holidays object into the function’s ‘holiday’ argument so that it will return FALSE for weekdays that are holidays. We do something similar to create a ‘is_holiday’ column indicating TRUE for days that are PPA holidays and FALSE otherwide. Note that for simplicity I am not counting the Saturdays between Thanksgiving and New Year’s Day where parking is free after 11am as holidays. We would have to incorporate the time and I’m too lazy to do that right now. This should still be fine for confirming our theory. Last of all we create a ‘is_weekend’ column, which is TRUE for all days that are not already defined as business days or holidays:

Great. now we can use dplyr to subset the grouped dataframe by weekends, business days, and holidays and then calculate a means like so:


If you compare these numbers to our plot, it looks like our hypothesis holds up very well. Business days average a little over $216,000 per day. Holidays earn a little over $23,000. The one exception is weekends, where our average seems higher than what the plot indicates; although if enough of those ‘stray’ points are weekends as well, then it may make sense.

That wraps up this post. Hopefully you learned something interesting. This is only scratching the surface of what can be done with this enormous data set. There are a whole lot of other questions we can investigate. We could combine it with other data sets like weather. We could map the data in ways that might provide newer insights. I definitely plan on diving deeper into this data soon. You should too. If you do, let me know about it in the comments below!

I will put all of the code from this post on my Github soon. Last of all, here is plot and link to this data on plotly:

Philadelphia - Total Parking Fines Issued per Day 2012-2015

Looking at Philly Bike Thefts in R

A while ago I read a post by Daniel Forsyth in which he analyzes data on bike thefts in Philadelphia using Python. In his post, he briefly discusses the relationship between the number of bike thefts and temperature, with thefts unsurprisingly rising dramatically in the warmer months. Inspired by his post, I decided to take a deeper look at this and build a data visualization depicting the relationship. Having recently begun learning the plotly API for R, I decided to take it for a spin.

Here is the final product that we will be working to build today:

Philadelphia Bike Thefts by Temperature (2010-2013)

  I want to group the all of the thefts by temperature and create a scatterplot showing total the total number of thefts by temperature. Let’s get started. First, let’s load the R libraries that we’ll need:


The plotly library will provide the tools we need to build our scatter plot and then send it to plotly.com via their API. This will create a url, where our beautiful interactive plot will live for all the world too see. The data.table library will make working with the data a bit easier than using the functions available in the R standard library.

Next, we need the data. Daniel’s analysis used bike theft data from the City of Philadelphia’s Github page. You can find the date here. Upon examining the page, I noticed that there is now more data available than at the time of Daniel’s analysis. There are two csv files. The first, bicycle_thefts_Part1.csv, contains records from January 1, 2010 to September 16, 2013. The second, bicycle_thefts_Part2.csv, has records from September 15, 2014 to February 8, 2015. For now, we are going to work with only Part 1, but feel free to add Part 2 for you own work. CofPgithubLets grab the csv data directly from the site, rather than downloading files. This will make our analysis more reproducible. We can place the data directly in R data frames using read.csv() in conjunction with url():

thefts_link = 'https://raw.githubusercontent.com/CityOfPhiladelphia/phl-open-geodata/master/bicycle_thefts/bicycle_thefts_Part1.csv'

thefts = read.csv(url(thefts_link))

Take a look at the data to make sure that it was read into memory correctly. Using the head() command, we can look at the first five rows of data:   thefts_1   Everything looks good. For our analysis, we are primarily concerned with the THEFT_DATE column, which we will use to join the theft data with the weather data. So, now we will need to read in our weather data. For this, I have pilfered Daniel’s weather data from his github page. Thanks Daniel! I saved the file to my local drive and named it weather.csv. Let’s grab it:

weather = read.csv('weather.csv')

Now take a peak at the weather data with head():   weather_1   Notice that the date is divided into three columns; one for the month, one for the day and one for the year. We will need one column that combines all three into a single cohesive date so that the format is the same as the date column in out thefts data. To do this, we will use the paste() function, which will simply concatenate the three pieces into one date:

weather$DATE = paste(weather$MO, weather$DAY, weather$YEAR, sep="/")

Take a look using head():   weather_2   We now have a properly formatted DATE column. Let’s prepare to merge the weather data frame with the thefts data frame. Before we do, we’ll need to rename the THEFT_DATE column in the thefts data frame to ‘DATE’, such that it matches the weather frame. We can access that element of the thefts column names using colnames() like so:

colnames(thefts)[4] = 'DATE'

Now we can use the merge() function to join the two data frames on the DATE column. We’ll name our new data frame ‘data’:

data = merge(thefts, weather, by='DATE')

Take a look:   data_1   We now want to prepare our data for plotting. Recall that we are going to plot the total number of thefts for each temperature. To do so, we will need to group the rows by temperature, and then count the number of rows at each temperature. We have a high and low temperature for each day, but let’s calculate the mean of those to get a better picture of the typical temperature throughout the day. Then we will add a column of ones so that we have something to sum up the thefts. Once that is done, we’ll turn our data frame into an R data table using data.table(). Then we can use the table object to do our group and sum operation:

data$TEMP = (data$HIGH + data$LOW) / 2
data$counts = 1

data = data.table(data)

by_temp = data[, sum(counts), by=TEMP]

Once that is done, we will have a new data table called ‘by_temp’ that shows the count of thefts for each temperature:   by_temp   Great! We now have our data ready to be plotted. We will use the plotly library to build the plot and then send it to the plotly API. Since this post has already gotten quite long, I will not go into detail about this, but if you are curious about plotly, I encourage you to head to their site and check out their R documentation. We begin by instantiating a new plotly object, py, using plotly(). We then define our trace, which is basically defining the data for each axis of the graph and the type of graph. In addition, we create a layout object, which defines much of the formatting for our plot. Lastly, we call the plotly method on our py object. This sends our data to the plotly API for plotting and creates a unique url that we can share:

plot(by_temp, col='blue', main='Bike Thefts by Temperature', xlab='Daily Mean Temperature', ylab='Thefts')

py = plotly()

trace1 = list(
  x = by_temp$TEMP,
  y = by_temp$V1,

  type = 'scatter',

  marker = list(
    color = 'rgb(0, 204, 102)',

plot_data = list(trace1)

layout = list(
  title = 'Philadelphia Bike Thefts by Temperature (2010-2013)',

  xaxis = list(
    title = 'Daily Temperature'

  yaxis = list(

  annotations = list(
    x = 0.9,
    y = -0.17,
    xref = "paper",
    yref = "paper",
    text = "Source: OpenDataPhilly",
    showarrow = FALSE

response = py$plotly(plot_data, kwargs=list(filename='ph_bike_thefts', layout=layout, fileopt='overwrite'))

Now, if R does not open a web browser and automatically navigate to the new plot, you can get the url using:


Here is the plot again with full functionality:

Philadelphia Bike Thefts by Temperature (2010-2013)


Bravo, we’re done! You can find all of the code at my Github page here. Let me know below if you have any questions, comments, or suggestions. Thanks!