Category Archives: R

Time Series Analysis in R Part 3: Getting data from Quandl

This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here:

Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

For example, here we get the data in percent change form:

You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.

 

Time Series Analysis in R Part 1: The Time Series Object

Many of the methods used in time series analysis and forecasting have been around for quite some time but have taken a back seat to machine learning techniques in recent years. Nevertheless, time series analysis and forecasting are useful tools in any data scientist’s toolkit. Some recent time series-based competitions have recently appeared on kaggle, such as one hosted by Wikipedia where competitors are asked to forecast web traffic to various pages of the site. As an economist, I have been working with time series data for many years; however, I was largely unfamiliar with (and a bit overwhelmed by) R’s functions and packages for working with them. From the base ts objects to a whole host of other packages like xts, zoo, TTR, forecast, quantmod and tidyquant, R has a large infrastructure supporting time series analysis. I decided to put together a guide for myself in Rmarkdown. I plan on sharing this as I go in a series of blog posts. In part 1, I’ll discuss the fundamental object in R – the ts object.

The Time Series Object

In order to begin working with time series data and forecasting in R, you must first acquaint yourself with R’s ts object. The ts object is a part of base R. Other packages such as xts and zoo provide other APIs for manipulating time series objects. I’ll cover those in a later part of this guide.

Here we create a vector of simulated data that could potentially represent some real-world time-based data generation process. It is simply a sequence from 1 to 100 with some random normal noise added to it. We can use R’s base plot() function to see what it looks like:

This could potentially represent some time series, with time represented along the x-axis. However it’s hard to tell. The x-axis is simply an index from 1 to 100 in this case.

A vector object such as t above can easily be converted to a time series object using the ts() function. The ts() function takes several arguments, the first of which, x, is the data itself. We can look at all of the arguments of ts() using the args() function:

To begin, we will focus on the first four arguments – data, start, end and frequency. The data argument is the data itself (a vector or matrix). The start and end arguments allow us to provide a start date and end date for the series. Finally the frequency argument lets us specify the number of observations per unit of time. For example, if we had monthly data, we would use 12 for the frequency argument, indicating that there are 12 months in the year.

Let’s assume our generated data is quarterly data that starts in the first quarter of 2000. We would turn it into a ts object as below. We specify the start argument as a two element vector. The first element is the year and the second element is the observation of that year in which the data start. Because our data is quarterly, we use 4 for the frequency argument.

Notice that now when we plot the data, R recognizes that it is a ts object and plots the data as a line with dates along the x-axis.

Aside from creating ts objects containing a single series of data, we can also create ts objects that contain multiple series. We can do this by passing a matrix rather than a vector to the x argument of ts().

Now when we plot the ts object, R automatically facets the plot.

At this point, I should mention what really happens when we call the plot() function on a ts object. R recognizes when the x argument is a ts object and actually calls the plot.ts() function under the hood. We can verify this by using it directly. Notice that it produces an identical graph.

The plot.ts() function has different arguments geared towards time series objects. We can look at these again using the args() function.

Notice that it has an argument called plot.type that lets us indicate whether we want our plot to be faceted (multiple) or single-panel (single). Although in the case of our data above we would not want to plot all three series on the same panel given the difference in scale for ts3, it can be done quite easily.

I am not going to go in-depth into using R’s base plotting capability. Although it is perfectly fine, I strongly prefer to use ggplot2 as well as the ggplot-based graphing functions available in Rob Hyndman’s forecast package. We will discuss these in parts 2 and 3.

Convenience Functions for Time Series

There are several useful functions for use with ts objects that can make programming easier. These are window(), start(), end(), and frequency(). These are fairly self-explanatory. The window function is a quick and easy way to obtain a slice of a time series object. For example, look again at our object tseries. Assume that we wanted only the data from the first quarter of 2000 to the last quarter of 2012. We can do so using window():

The start() function returns the start date of a ts object, end() gives the end date, and frequency() returns the frequency of a given time series:

That’s all for now. In Part 2, we’ll dive into some of the many transformation functions for working with time series in R. See you then.

Extracting Tables from PDFs in R using the Tabulizer Package

Recently I wanted to extract a table from a pdf file so that I could work with the table in R. Specifically, I wanted to get data on layoffs in California from the California Employment Development Department. The EDD publishes a list of all of the layoffs in the state that fall under the WARN act here. Unfortunately, the tables are available only in pdf format. I wanted an interactive version of the data that I could work with in R and export to a csv file. Fortunately, the tabulizer package in R makes this a cinch. In this post, I will use this scenario as a working example to show how to extract data from a pdf file using the tabulizer package in R.

The link to the pdf gets updated often, so here I’ve provided the pdf as downloaded from the site on November 29, 2016:

warn-report-for-7-1-2016-to-11-25-2016

 

First, we will need to load the tabulizer package as well as dplyr.

Next we will the the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

Now we have a list object called out, with each element a matrix representation of a page of the pdf table. We want to combine these into a single data matrix containing all of the data. We can do so most elegantly by combining do.call and rbind, passing it our list of matrices. Notice that I am excluding the last page here. The final page is the totals and summary information. We don’t need that.

After doing so, the first three rows of the matrix contain the headers, which have not been formatted well since they take up multiple rows of the pdf table. Let’s fix that. Here I turn the matrix into a data.frame dropping the first three rows. Then I create a character vector containing the formatted headers and use that as the column names.

We now have a data.frame of all of the California layoffs. A quick glance at the first few rows:

In order to manipulate the data properly , we will probably want to change the date column to a Date object as well as convert the No.of.Employees column to numeric. Here I do so using dplyr.

Last of all, I finish up by writing the final table to csv so that I can load it for later use.

I have found the tabulizer package to be wonderfully easy to use. Much of the process of extracting the data and tables from pdfs is abstracted away from the user. This was a very simple example, however if one requires more finely-tuned control of how tables are extracted, the extract_tables function has a lot of additional arguments to tweak to one’s liking. I encourage you to take a look for yourself.

You can find the code for this post on my Github.