Thursday, April 28, 2016

Shades of Blue: Wie blau ist Österreich wirklich?

Am 24. April 2016 hat die Bundespräsidentschaftswahl in Österreich für ein politisches Erdbeben gesorgt: das erste Mal in der Geschichte kam kein Kandidat von SPÖ oder ÖVP in die Stichwahl. Stattdessen wird diese nun zwischen dem FPÖ-Kandidaten Hofer und dem (mehr oder weniger, offiziell jedenfalls unabhängigen) Grünen Van der Bellen entschieden. Besonders schwer wog das Ergebnis auch deshalb, da Hofer einen deutlichen Vorsprung erzielen konnte: seine rund 35% standen Van der Bellens 21% gegenüber.

Schon bald geisterten Landkarten Österreichs durch das Internet1, die nahezu das ganze Bundesgebiet blau einfärbten. Wie das? Für jede Gemeinde wurde die Farbe anhand des stimmenstärksten Kandidaten entschieden: Blau wurden jene Gemeinden in denen Hofer führte, Grün jene von Van der Bellen. Ein paar andere Farbkleckse gab es auch noch für die übrigen Kandidaten. 
Im Beispiel ist ersichtlich, dass sich -- abgesehen von ein paar grünen Inseln -- Österreich in ein blaues Meer verwandelt hatte. Dieser Form der Darstellung haften jedoch einige ziemlich deutliche Fehler inne:
  1. Ignorieren der Briefwahlstimmen
  2. Winner takes it all
  3. Ignorieren der Wahlbeteiligung
  4. Fläche ungleich Bevölkerungsdichte
In Summe führen diese Darstellungsfehler dazu, dass vollkommen unberechtigt Österreich gleichmäßig blau eingefärbt wird und der Bevölkerung so eine Eindeutigkeit vorgegaukelt wird, die so einfach nicht stimmt. Freilich hat Hofer die erste Runde der Wahl gewonnen, aber deswegen ist Österreich noch lange nicht blau.

Wie blau ist Österreich nun wirklich?

Im Folgenden eine Aufarbeitung und Behebung der eingangs angesprochenen Fehler der Originalkarte. Dargestellt werden die Ergebnisse von Hofer (blau) und Van der Bellen (orange)2. Orange wurde deshalb gewählt, da es zu Blau eine Komplementärfarbe ist -- im Gegensatz zu Grün. Zu den beiden Extremfarben wird Weiß gemischt werden, das die Unsicherheit des Ergebnisses aus unterschiedlichen Gründen darstellt. 

Ignorieren der Briefwahlstimmen

Die Originalkarte stellt nur jene Stimmen dar, die in den Wahllokalen abgegeben wurden. Die Briefwahlstimmen -- es waren rund 640.000 Wahlkarten ausgegeben worden -- wurden nicht berücksichtigt. Zum einen deshalb, da diese am Wahlsonntag noch nicht gezählt worden waren. Zum anderen liegen diese auch nur auf Bezirks- nicht aber auf Gemeindeebene vor. Da aber BriefwählerInnen traditionell nicht blau wählen, ist dies eine erste Fehlerquelle.

Hier nun eine Karte Österreichs, die auch die Briefwahlstimmen berücksichtigt. Da diese eben nicht auf Gemeindeebene vorliegen, erfolgt die Darstellung grober auf Ebene des politischen Bezirks. Blau sind jene Bezirke in denen Hofer die absolut meisten Stimmen erzielen konnte, Orange jene in denen Van der Bellen obsiegte. Die Farbe Weiß für die Unsicherheit kommt hier noch nicht vor. 

Winner takes it all

Die in der Originalkarte gewählte Art der Darstellung schlägt eine Gemeinde immer dem Gewinner zu, unabhängig davon wie stark der Vorsprung tatsächlich ist. Dies, obwohl das österreichische Wahlrecht dafür überhaupt keine Basis bietet. Während andere Länder durchaus auch das Winner takes it all-Prinzip kennen, ist dies Österreich vollkommen fremd. Hierzulande wird strikt proportional gewählt, auch wenn das bei einer Bundespräsidentenwahl nur eine untergeordnete Rolle spielt. Dadurch entsteht jedoch der Eindruck, Gemeinden würden homogen wählen. Dem Fehlschluss der ecological fallacy wird damit Tür und Tor geöffnet. Viel besser geeignet wäre eine Darstellung, die das Verhältnis der Stimmen zueinander berücksichtigt.
Um diesen Fehler zu beheben, wird das Verhältnis der Hofer und Van der Bellen-Stimmen zueinander dargestellt. Je blauer ein Bezirk, desto größer der Anteil der Stimmen für Hofer, je oranger, desto mehr haben für Van der Bellen votiert. Je heller die jeweilige Farbe ist, desto geringer ist die Deutlichkeit des Ergebnisses: die Farbe Weiß wird hinzugemischt. In weißen Bezirken liegen Hofer und Van der Bellen nahezu gleich auf.

Ignorieren der Wahlbeteiligung

Ein weiteres Manko der Originalkarte ist das Ignorieren der Wahlbeteiligung. Wenn aus der Karte, wie oftmals nahegelegt, geschlossen werden soll, dass die Einstellung der Bevölkerung "blau" sei, so ist diese jedoch auch von essentieller Bedeutung. In der folgenden Darstellung wird daher das Verhältnis der Hofer und Van der Bellen-Stimmen noch mit der Wahlbeteiligung gewichtet. Je ausgeprägter die Farbe ist, desto höher lag auch die Wahlbeteiligung in dem Bezirk. Weiß wird nun zusätzlich die Farbe der NichtwählerInnen.

Fläche ungleich Bevölkerungsdichte

Schließlich, und das ist ein Fehler der häufig bei Kartendarstellungen passiert, bekommen Flächen eine Bedeutung. Flächen haben in der Geographie und beispielsweise der Landwirtschaft durchaus Bedeutung: die Dichte der Bewaldung kann sinnvoll auf einer Fläche dargestellt werden. Für Wahlergebnisse eignet sich diese Darstellung aber nur bedingt. Jedenfalls muss die Bevölkerungsdichte berücksichtigt werden. Das Land ist ja nicht mehr blau, nur weil flächengroße aber bevölkerungsarme Bezirke blauer gewählt haben. Und da weite Teile Österreichs nur spärlich besiedelt sind, kommt diesem Faktor einige Bedeutung zu.

Zum Abschluss also eine Darstellung, die all diese Fehler bereinigt. Nach wie vor, je blauer ein Bezirk, desto deutlicher das Votum für Hofer. Je mehr Weiß in der Farbkodierung enthalten ist, desto geringer die Wahlbeteiligung und desto geringer die Bevölkerungsanzahl.
Österreich ist also immer noch großteils blau, aber das Blau liegt in unterschiedlichsten Schattierungen vor, und es dominiert das Weiß der NichtwählerInnen und der bevölkerungsarmen Gebirgstäler. Gänzlich verschwunden ist auch das Artefakt der Vorarlberger GrünwählerInnen im Westen.

Update

Wie in den Kommentaren kritisch angemerkt, hatte sich ein kleiner Fehler eingeschlichen -- Bezirke die gänzlich von anderen umfasst sind, wie zB Graz von Graz-Umgebung umfasst wird, hat die Farbdarstellung nicht ganz gepasst. Das wurde jetzt korrigiert. Danke für die Hinweise.

Fußnoten

1: Stellvertretend für die zahlreichen Darstellungen, wurde diese, ohne Quellenangabe, von Liza Ulitzka auf Facebook geteilt. Eine weitere Darstellung, allerdings auf Bezirksebene findet sich im Standard. Bezugnehmend auf derartige Darstellungen verfasste auch Hans Rauscher im Standard einen Kommentar dazu.

2: Zugrunde liegt der Betrachtung das vorläufige amtliche Wahlergebnis. Das Kartenmaterial wurde unter einer CC-BY-Lizenz von der Statistik Austria zur Verfügung gestellt.    

Thursday, November 26, 2015

Accessing APIs from R (and a little R programming)

Accessing APIs from R (and a little R programming)

APIs are the driving force behind data mash-ups. It is APIs that allow machines to access data programmatically – that is automatically from within a program – to make use of API provided functionalities and data. Without APIs much of today’s Web 2.0, Apps and data applications would be outright impossible.

This post is about using APIs with R. As an example. we’ll use the EU’s EurLex1 data base API as provided by Buhl Rassmussen. This API is a good example of the APIs you might find in the wild. Of course, there are the APIs of large vendors, like Google or Facebook, that are thought out and well documented. But then there is the vast majority of smaller APIs for special applications that often lack in structure or documentation. Nevertheless, these APIs often provide access to valuable ressources.

Background on APIs

API is short for Application Programming Interface. Basically, it means a way of accessing the functionality of a program from inside another program. So instead of performing an action using an interface that was made for humans, a point and click GUI for instance, an API allows a program to perform that action automatically. The power of this concept becomes only visible, when you imagine that you can mesh the calling of an API in the program with anything else that program might want to do. Some examples from data science:

  • Retrieve data and produce a visualization from it that gets updated every time someone looks at it
  • Have tweets automatically translated and entities reported
  • Have additional nodes in a computer cluster launched as soon as tasks become cumbersome, to ensure fast data processing

While an API can be any defined interface between two programs, today APIs usually refer to a special kind of APIs that are based on the WWW’s HyperText Transfer Protocol (HTTP) that is also used by web servers and browsers to exchange data. Indeed, one might consider browsing the web as using APIs: a program (the browser) uses a defined set of commands and conventions to retrieve data (the webpage) from a remote server (the website) and renders it locally in the browser (the thing you see).

All web-based 2 APIs have always the same structure: they consist of a URL to a domain and a path to an endpoint. For instance: http://example.com/api where http://example.com is the URL and /api is the path to the endpoint.

Web-based APIs that are used for data science come usually in two flavors that are named after the HTTP verbs defined 3:

  • GET - sends a set of parameters, a query to an endpoint and then recieves an answer.
  • POST - sends a data payload to an endpoint to be processed at the remote system, usually receiving only a success message as an answer.

There are other verbs defined in HTTP, like DELETE, but they are less common in APIs. The by far largest group of APIs makes only use of the GET verb. Let’s look at that flavor in greater detail.

A canonical example would be an API that allows to retrieve data from some data base and the API’s query can be used to narrow down the selection. Let’s say an API provides access to newspaper articles. By specifying the parameter year the API returns not all articles, but only those that were written in a specific year. Let’s say we are only interested in articles written in 2014. The corresponding API call would, thus, look like: http://example.com/api?year=2014. We already know which part is the URL and which part is the path to the endpoint. What’s new is the query year=2014. Note that it’s separated from the path by a question mark. In this example, year is the name of the parameter, and 2014 is its value.

Upon receiving the API call, the remote system crafts an answer. The answer can be in any format. It could be a image file, or a movie, or text, or … In recent years, JSON has become the most common answer format by far. JSON is a simple text file that uses special characters and conventions to bring structure into its contents. You can find more info at the Wikipedia page on JSON. For now it suffices to know that is a popular format to store data, that can potentially be nested and delivered together with metadata. And that R can process it quite easily.

The big problem with APIs is that they are always designed by humans. So APIs vary wildly in logical structure and the quality of documentation. This unfortunately means, that there is no simple catch-all solution for working with APIs and all programs will need to be custom tailored to the API used. This also means that using an API almost always requires programming to some degree.

Accessing APIs from R

In this example, we’ll use R to retrieve data from an API and process it. The API we’ll query provides data on EU legislative documents. More specifically, we are interested in which week day is most popular for EU energy legislative documents to go into force. Ok, perhaps that’s not a mind blowing research question, but one that will allow us to demonstrate the using of APIs and the required data processing quite nicely.

Required packages

There are many facilities in R that can be used to access APIs. The one package that I find most useful is Hadley4’s httr. It allows for easy crafting of API calls and also handling the more intricate aspects of APIs like authentication.

Working with JSON data is facilitated a lot by the jsonlite package. It does a good job translating JSON’s nested data structures into sensible R objects. Well, most of the time, anyway.

Since in this example we are going to work with dates, let’s use another of Hadley’s packages: lubridate. If you work with dates frequently, it’s a package that might be a valuable addition to your toolbox.

If you don’t yet have these packages installed, you can use this R code to obtain them:

install.packages(c("httr", "jsonlite", "lubridate"))

Outline of the example

In this subsection I’ll outline the steps required to perform to find an answer to our question (which is the most popular day for having energy related documents turn into force?).

EurLex documents all bear a document classifier (directory code in EurLex parlance) that can be used to single out documents that relate to a specific topic. The EurLex classifiers are always four dot separated pairs of digits. For instance, the classifier 07.40.30.00 identifies documents that relate to air traffic safety. We will use the appropriate classifiers to retrieve the data on energy related documents.

So, these are the required steps we’ll need to take to get our answer:

  1. Retrieve a list of all the valid classifiers
  2. Extract from that answer only those classifiers that relate to energy, i.e. start with 12..5
  3. Retrieve the documents’ meta data that are classified with one of the classifiers we’ve found to be relevant.
  4. Work with the data we’ve retrieved to find out which weekday is most frequent.

Steps (1) and (3) will involve calling the API and (2) and (4) are just local data processing chores.

Before doing anything else, we need to load the required packages:

library(httr)
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
library(lubridate)

One more thing before we get started: R has the “feature” of turning character strings automatically into factor variables. This is great, when doing actual statistical work. It is this magic that allows R to turn multinomial variables into dummy variables in regression models and produce nice cross tables. When working with APIs, however, this “feature” becomes a hinderance. Let’s just turn it off. Note: this call only affects the current session; when you restart R, all settings will be back to normal.

options(stringsAsFactors = FALSE)

Retrieving valid classifiers

Calling the /eurlex/directory_code endpoint directly, retrieves a list of all valid classifiers. Let’s obtain that list. First, we need set up the URL and path part of the API call. A query is not required at this point, as the API provides the answer directly.

url  <- "http://api.epdb.eu"
path <- "eurlex/directory_code"

Executing an API call with the GET flavor is done using the GET() function.

raw.result <- GET(url = url, path = path)

Let’s explore what we’ve got back:

names(raw.result)
##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"

The result we got back from the API is a list of length 10. Of these, two parts are important:

If we examine the status code,

raw.result$status_code
## [1] 200

we see that we’ve got 200, which means, all worked out fine. Note that this status code only tells us, that the server recieved our request, not if it was valid for the API or found any data.

Let’s look at the actual answer or data payload we’ve got back. Let’s just look at the first few elements:

head(raw.result$content)
## [1] 7b 22 30 31 2e 30

That’s useless, unless you speak Unicode. Let’s translate that into text.

this.raw.content <- rawToChar(raw.result$content)

Let’s see how large that is in terms of characters:

nchar(this.raw.content)
## [1] 121493

That’s rather large. Let’s look at the first 100 characters:

substr(this.raw.content, 1, 100)
## [1] "{\"01.07.00.00\":{\"directory_code\":\"01.07.00.00\",\"url\":\"http:\\/\\/api.epdb.eu\\/eurlex\\/directory_code\\/"

So the result is a single character string that contains a JSON file. Let’s tell R to parse it into something R can work with.

this.content <- fromJSON(this.raw.content)

What did R make out of it?

class(this.content) #it's a list
## [1] "list"
length(this.content) #it's a large list
## [1] 462
this.content[[1]] #the first element
## $directory_code
## [1] "01.07.00.00"
## 
## $url
## [1] "http://api.epdb.eu/eurlex/directory_code/?dc=01.07.00.00&key="
## 
## $number_of_documents
## [1] "126"
## 
## $list_of_acts_inforce_eurlex
## [1] "http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=0107*&repihm="
this.content[[2]] #the second element
## $directory_code
## [1] "01.10.00.00"
## 
## $url
## [1] "http://api.epdb.eu/eurlex/directory_code/?dc=01.10.00.00&key="
## 
## $number_of_documents
## [1] "191"
## 
## $list_of_acts_inforce_eurlex
## [1] "http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=011*&repihm="

So, apparently R makes a list out of it, with one element per classifier. Each element has:

  • the directory code document classifier
  • a URL where one can retrieve more details
  • the number of documents with that classifier
  • another URL with yet more details

So, essentially, the result is a list of lists. Lists are not (always) very nice to work with, and lists of lists are usally despicable. Let’s turn it into a data frame:

this.content.df <- do.call(what = "rbind",
                           args = lapply(this.content, as.data.frame))

This call does a number of things. lapply(this.content, as.data.frame) turns each of the 462 list elements into mini single-row data frames. This is required, so that we then can combine (rbind) them all together into a single data frame.

In case you are interested in the gory details:

  1. The function rbind takes any number of data frames as arguments, and turns them into a single data frame, by just stacking one on top of the next. So C <- rbind(A, B) will yield a data frame that has first the contents of A and then those of B stacked on top of each other.
  2. The function lapply takes a list (the first argument) and applies the function that is the second argument (here: as.data.frame) to each of its elements. So, the call to lapply turns our list of lists into a list of single row data frames.
  3. The function do.call is a true wonder girl. She uses its args argument as arguments to the function named at the what argument. So here, it executes rbind with all the elements (single row data frames, that we created in (2)) of our list of data frames. It is the same as typing:
    rbind(OurDfList[[1]], OurDfList[[2]], OurDfList[[3]], ...) where ... would need to be replaced with all the other list items.

What have we got now?

class(this.content.df) #a single data frame
## [1] "data.frame"
dim(this.content.df)   #with 462 rows and 4 variables
## [1] 462   4
head(this.content.df)
##             directory_code
## 01.07.00.00    01.07.00.00
## 01.10.00.00    01.10.00.00
## 01.20.00.00    01.20.00.00
## 01.30.00.00    01.30.00.00
## 01.40.00.00    01.40.00.00
## 01.40.10.00    01.40.10.00
##                                                                       url
## 01.07.00.00 http://api.epdb.eu/eurlex/directory_code/?dc=01.07.00.00&key=
## 01.10.00.00 http://api.epdb.eu/eurlex/directory_code/?dc=01.10.00.00&key=
## 01.20.00.00 http://api.epdb.eu/eurlex/directory_code/?dc=01.20.00.00&key=
## 01.30.00.00 http://api.epdb.eu/eurlex/directory_code/?dc=01.30.00.00&key=
## 01.40.00.00 http://api.epdb.eu/eurlex/directory_code/?dc=01.40.00.00&key=
## 01.40.10.00 http://api.epdb.eu/eurlex/directory_code/?dc=01.40.10.00&key=
##             number_of_documents
## 01.07.00.00                 126
## 01.10.00.00                 191
## 01.20.00.00                  69
## 01.30.00.00                  24
## 01.40.00.00                 382
## 01.40.10.00                 514
##                                                                list_of_acts_inforce_eurlex
## 01.07.00.00  http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=0107*&repihm=
## 01.10.00.00   http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=011*&repihm=
## 01.20.00.00   http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=012*&repihm=
## 01.30.00.00   http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=013*&repihm=
## 01.40.00.00   http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=014*&repihm=
## 01.40.10.00 http://eur-lex.europa.eu/Result.do?RechType=RECH_repertoire&rep=01401*&repihm=

That’s nice and we can work with it to extract all the classifiers of energy topics.

Extracting energy classifiers

We’ve almost found the classifiers for energy topics. We just need to filter them out of all the other classifiers that are available. Remember, Energy classifiers start with 12.. We can use that fact:

headClass <- substr(x     = this.content.df[, "directory_code"],
                    start = 1,
                    stop  = 2)

headClass is now just a character vector containing the first two characters of the directory_code for each of the 462 different classifiers.

length(headClass)
## [1] 462
head(headClass)
## [1] "01" "01" "01" "01" "01" "01"

If these first two characters equal 12, it’s an energy topic:

isEnergy <- headClass == "12"
table(isEnergy) # 19 of the topic classifiers start with 12
## isEnergy
## FALSE  TRUE 
##   443    19

Let’s use this logical vector to index our data frame:

relevant.df <- this.content.df[isEnergy, ]

And let’s narrow that down to solely the document identifiers:

relevant.dc <- relevant.df[, "directory_code"]

relevant.dc is now a character vector with all the directory codes that are relating to energy topics.

length(relevant.dc)
## [1] 19
relevant.dc
##  [1] "12.07.00.00" "12.10.00.00" "12.10.10.00" "12.10.20.00" "12.20.10.00"
##  [6] "12.20.20.00" "12.20.30.00" "12.20.40.00" "12.30.00.00" "12.40.00.00"
## [11] "12.40.10.00" "12.40.20.00" "12.40.30.00" "12.40.40.00" "12.40.50.00"
## [16] "12.50.10.00" "12.50.20.00" "12.50.30.00" "12.60.00.00"

We’ve come a long way. We’ve retrieved all possible classifiers from the API, parsed the answer so that we can work with it, and, finally, extracted all those classifiers that are relating to energy topics.

Retrieving energy documents’ meta data

In this step, we use the identified classifiers to retrieve all the documents’ meta data that relate to energy topics. We’ll use the API again. The base parts of the API call have not changed.

Now to the query: we cannot pass all the relevant classifiers in a single call. Rather, we need to create 19 queries, one for each identified classifier. There are many ways to do that. Let’s do it the pretty way with our own function:

makeQuery <- function(classifier) {
  this.query <- list(classifier)
  names(this.query) <- "dc"
  return(this.query)
}

Remember, that a the query part of an API call is a named list. The name is the name of the API parameter, and it’s value is, well, it’s value. Our function, makeQuery() takes a single argument, classifier and turns it into a single element list and sets the name of that list’s single element to be dc. Let’s try it out with nonsense

makeQuery("foo")
## $dc
## [1] "foo"

We discover, that the function indeed returns a list with a single element, that element is named dc and it’s value is the string we’ve specified.

Let’s apply our new function to all of our relevant classifiers from above to turn them into queries. Remember the lapply function we can use to apply a function to each element of a list:

queries <- lapply(as.list(relevant.dc), makeQuery)

Now we have a list (queries) that is composed of all the individual queries that result from our function. Now we are good to go to acutally execute the API calls.

It’s time now to execute the API calls we’ve created. Let’s start out with the first element of our query list.

this.raw.result <- GET(url = url, path = path, query = queries[[1]])

What did we get back?

this.result <- fromJSON(rawToChar(this.raw.result$content))

We’ve got back the meta data for all 11 documents that are classified with our first relevant classifier (12.07.00.00). For each document, we get:

names(this.result[[1]])
##  [1] "form"                "title"               "api_url"            
##  [4] "eurlex_perma_url"    "doc_id"              "date_document"      
##  [7] "of_effect"           "end_validity"        "oj_date"            
## [10] "directory_codes"     "legal_basis"         "addressee"          
## [13] "internal_ref"        "additional_info"     "text_url"           
## [16] "prelex_relation"     "relationships"       "eurovoc_descriptors"
## [19] "subject_matter"

Apparently, our call does work just fine. Executing the first query stored in queries results in 11 documents and their respective meta data.

Let’s execute the query for each element in queries. How to go about this? Why not use lapply to execute each of the queries in the list? We could do that, but let’s try another approach: a loop. A loop – loops – over a set of numbers, and at each iteration executes some code. Let’s try that:

First, we need something where we can store the results. For that, we create an empty list with just enough room to store each of the queries’ results:

all.results <- vector(mode   = "list",
                      length = length(relevant.dc))

This made a new, empty list called all.results. It has as many empty slots as we have energy related classifiers.

for (i in 1:length(all.results)) {
  this.query       <- queries[[i]]
  this.raw.answer  <- GET(url = url, path = path, query = this.query)
  this.answer      <- fromJSON(rawToChar(this.raw.answer$content))
  all.results[[i]] <- this.answer
  message(".", appendLF = FALSE)
  Sys.sleep(time = 1)
}

This loop iterates over the numbers 1 to 19 (the number of relevant classifiers). At each iteration, it:

  • loads the appropriate query (whos time has come)
  • executes the API call with that query
  • extracts the content of the response and converts it from JSON
  • writes the results to the empty list we had created beforehand
  • prints a dot, so we don’t get bored waiting
  • waits for a second (because we are polite and don’t want to bog down EU)

all.results is now no empty list no more. It is filled with the answers the API has produced as result to our 19 queries.

Now we have results. The next step is to beat these results into a shape we can actually use to find our corvetted answer. Of all the parts of the answer, we are interested in form, date_document and of_effect. Let’s create another function that returns just these parts as a data frame.

parseAnswer <- function(answer) {
  this.form   <- answer$form
  this.date   <- answer$date
  this.effect <- answer$of_effect
  result <- data.frame(form   = this.form,
                       date   = this.date,
                       effect = this.effect)
  return(result)
}

Let’s try our function on one of the results. Remember, that the results we’ve retrieved is a list (19 elements, one for each classifier) of lists (one for each document; the numbers of documents varies from classifier to classifier).

parseAnswer(all.results[[1]][[2]])

This took from the first classifier (the [[1]]) the second document (the [[2]]). We see, it’s a data frame with one row and three columns.

Now to apply our function to all of our list of lists results, we need to go a little deeper and combine all the skills we’ve learned so far. We could use a loop to iterate over all the 19 classifiers first and then a second loop to iterate over all the documents in each classifier. But that’s rather verbose and cumbersome. Let’s use lapply instead:

parsedAnswers <- lapply(all.results,
                        function(x) do.call("rbind", lapply(x, parseAnswer)))

What’s happening here? Let’s start from the inside out:

  1. We apply our function, parseAnswer on each document in a classifier
  2. Inside each classifier, we rbind the single line data frames together to form a single data.frame with one row per document.
  3. We do this for each of the 19 classifiers in all.results.

We get a back a list of data frames, each data frame having as many rows as there are documents in that classifier.

class(parsedAnswers) #list
## [1] "list"
length(parsedAnswers) #19
## [1] 19
sapply(parsedAnswers, nrow) #11, 15, 107, ...
##  [1]  11  15 107 110 172  41  16  22  55  42  16  60  62 143  84  18  11
## [18]  65  28

Let’s combine these 19 data frames in a single one. How can we do that? Of course just like before using do.call and rbind.

finalResult <- do.call("rbind", parsedAnswers)
class(finalResult) #data.frame
## [1] "data.frame"
dim(finalResult) # 1078 rows, 3 columns
## [1] 1078    3

All the final results are now contained neatly in a single data frame. Note that the data frame’s row names are actually the document IDs. We can use them to retrieve the actual document’s meta data.

Working with dates

The data we’ve retrieved from the API is still all only characters. If we tell R that the date columns (date and effect) are actually dates, R can calculate with these dates.

First, we need to convert characters to dates. Let’s try this out with some arbitrary date.

date.character <- "1981-05-02"
date.POSIXct <- ymd(date.character)

class(date.character) #character
## [1] "character"
class(date.POSIXct) #POSIXct
## [1] "POSIXct" "POSIXt"

That worked just fine. Let’s do this for the date columns in our final results data frame:

finalResult$date <- ymd(finalResult$date)
finalResult$effect <- ymd(finalResult$effect)

At last, we’ve retrieved all the data we need to answer our question and brought it into a format we can work with. Let’s answer our question, which day of the week is most popular for letting laws become effective:

finalResult$effectDay <- wday(finalResult$effect, label = TRUE)
table(finalResult$effectDay) #Most documents went into effect on a Wednesday
## 
##   Sun   Mon  Tues   Wed Thurs   Fri   Sat 
##    31   160   132   172   145   384    54

We see, that Wednesdays are the most popular ones.

This concludes this little tour de force of introducing working with APIs with R. We covered not only how to craft API calls, but also how to use R’s (list) programming features to deal with API answers and beat data into shape.


  1. EurLex documents have been used in the past as text-book examples for statistical programming and machine learning. See for instance TU Darmstadt’s project.

  2. well, most anyway

  3. Actually, it’s a little bit more complicated. GET and POST are methods an API might implement. Often, the same API will provide GET and POST methods side by side for different purposes. If taking the HTTP standard as the proverbial letter of the law, GET methods should not change anything on the remote system, i.e. only return data, while POST methods should change a state or a file on the remote system. In practice, POST methods are also used to provide data to the remote system that it can use to work with, e.g. a text that should be automatically translated.

  4. Hadley Wickham is perhaps one of the most prolific R developers. He’s responsible for a great wealth of packages, among them the visualization package ggplot2 and the data munging facilities of dplyr. Check out Hadley’s personal website to get a glimpse on all the projects he’s involved with.

  5. Finding out that 12. is the document classifier that identifies energy related documents actually took quite some research. For easy access, Project Mulan provides a list of all EurLex directory codes.

Saturday, August 9, 2014

Visualizing Geo-Referenced Data With R

Visualizing Geo-Referenced Data With R

Visualizing Geo-Referenced Data With R

Christoph Waldhauser

07/01/2014

A key issue in data science is visualization to aid in the telling of stories and the exploration of patterns in data. In this blog post I’m going to look at producing maps that display data with R giving two examples. A large portion of the data of interest to social sciences is actually geo-referenced: election results of electoral districts in a country, worldwide country indicators, locations of conflict hot spots on a map. All these data are screaming for being visualized on maps. I will first outline the required steps for producing maps with geo-referenced data. Then I’ll walk through two examples for two different kinds of geo-referenced data.

Steps

Producing publication quality maps entails a number of steps:

  1. Finding maps. While perhaps sounding trivial, it is not so easy to come up with public domain map data that can be used with R. What is required is a data frame that contains the corner points of any borders that should be printed on the map. These borders surround named areas. For example: provincial borders around provinces in a country. This can be produced easily from Shape files that are used with the popular (but proprietary and quite expensive) ArcGIS software package. The maps R package contains a number of maps for the US, Italy, France and New Zealand. It also provides a map of the World.

  2. Finding geo-referenced data. If you came this far, then you probably already have some sort of geo-referenced data. Broadly speaking, these can fall into two categories:

    • Areas surrounded by borders in the region and identified by name
    • Points identified by coordinates in degrees of longitude and latitude
  3. Drafting a plot to tell a story.

  4. Implement the plot using (e.g.) ggplot2.

Area data example

In these examples I’ll produce a number of different maps using R’s standard data sets. The first map will put key indicators of US States on a map. In the terms of the definition I sketched above, these are area data.

Some of R’s maps are organized in the maps R package. If you don’t have it installed yet, it’s only a quick install.packages("maps") away. After installing it, it is still necessary to load it:

library(maps)

As noted above, the maps package provides a number of maps at varying levels of details. However, these maps are stored in a binary format on disk. In order to turn them into a data frame that can be used for plotting, they need to be converted. This can be achieved by ggplot’s map_data function. Let’s first load the ggplot package, we will need it later for plotting the maps as well.

library(ggplot2)

The map_data function requires only the name of the map that should be converted, for the case at hand this is "state". Conceptionally, the data frame is simply a number of points describing the corners, every corner, of a state. This means a set of points identified by longitude and latitude and and order describing the path that is the border around a state. This is accompanied by a group identifier. The group identifier binds together all those points that describe a single area. In most cases this identical with a state. However, if a state territory also encompasses islands, then these islands form their own respective group each. This explains why 50 states yield the 63 unique groups encoded within this data set. The region identifier finally provides the States’ names. This is the variable where we will later merge the geo-referenced data with. There is also another variable, subregion, that in this map is not used.

us <- map_data("state")
head(us)
First six entries of the US State map data
long lat group order region subregion
-87.46 30.39 1 1 alabama
-87.48 30.37 1 2 alabama
-87.53 30.37 1 3 alabama
-87.53 30.33 1 4 alabama
-87.57 30.33 1 5 alabama
-87.59 30.33 1 6 alabama

In a next step, let’s look at geo-referenced data. R comes with the state data set that contains a number of data frames with statistical data. The data frame of interest is state.x77. Let’s take a look at it:

data(state)

The data frame state.x77 contains the following variables for each state: . For this map, let’s consider Income:

Min. 1st Qu. Median Mean 3rd Qu. Max.
3100 3990 4520 4440 4810 6320

After having identified a variable, we need to create a data set that merges the shape information from above with the variable of interest. The merge function can be used to this end. However, the map data only covers the continental US and therefore does not contain the states of Alaska and Hawaii. On the other hand, the state data set does not contain data for the District of Columbia. The respective entries from both data sets will therefore be lost in the merge operation.

tmp <- data.frame(region=tolower(rownames(state.x77)),
                Income=state.x77[,"Income"])
dta <- merge(x=us, y=tmp)
dta <- dta[order(dta$order),]

Let’s start creating the map by setting up the aesthetics. Remember that in ggplot2 aesthetics are used to map data to plot properties. Here, we state that the longitude goes on the x-axis and the latitude on the y-axis. Further, the group identifier separates the individual polygons.

p1 <- ggplot(data=dta, aes(x=long, y=lat, group=group))

The ggplot2 geom we need for a map is a polygon. Basically, a polygon is a free shape object that is defined by a path along its borders; exactly what the map data provides. By adding the geom_polygon to the previously set up aesthetics, the map is plotted:

p1 + geom_polygon(colour="white")

plot of chunk simple-map-plot

The geom_polygon has two interesting properties: colour to specify the color of the line depicting the border and fill to control the polygon’s fill color. It is fill that we will use to bring our data on the map. In order to add the area data to the plot, the geom_polygon needs to be beefed up with an aesthetics call to fill on it’s own.

p1 + geom_polygon(colour="white", aes(fill=Income))

plot of chunk data-map-plot

We can complete the plot using ggplot2’s usual annotation features:

p1 + geom_polygon(colour="white", aes(fill=Income)) + 
  xlab("Longitude") + 
  ylab("Latitude") + 
  ggtitle("Per capita income distribution of US States, 1974")

plot of chunk data-map-plot-complete

Point data example

The second broad type of geo-referenced data are points of interest. Here we will plot the approximate locations of US nuclear power plants. The size of plotting marks will correspond with the power station’s licensed output.

First, the data needs to be retrieved. The US Nuclear Regulatory Commission publishes among other things annual reports on all nuclear power plants. Their Appendix A contains some data of all the nuclear power stations in the US and is available as Excel file.

There are multiple ways of working with Excel files in R, I find the xlsx package to be quite handy, so we’ll use that to read in the data into R. After reading it in, let’s clean up the data a little as well.

library(xlsx)
nuclear.raw <- read.xlsx(file = "appa.xls", sheetIndex = 1, startRow = 2, header = TRUE)

nuclear <- nuclear.raw[,c(1, 4, 7, 11, 13, 15, 16, 18:23, 25:27)]
colnames(nuclear) <- c("Name", "Location", "Type", "Construction", "Operating", 
                       "Expiring", "Output", paste0("Capacity", c(12:7,5:3)))

Location in this data set is provided by city names and these cities relative locations to larger, better known population centers. We first need to extract the city names and then translate them into Longitude and Latitude. This can be done using the zipcode R package. Let’s extract the city names first.

library(stringr)
citiesStates <- sapply(strsplit(as.character(nuclear$Location), 
                                split = "(", fixed=TRUE), function(x) return(x[1]))
citiesStates <- t(sapply(strsplit(citiesStates, split=",", fixed=TRUE), 
                         function(x) return(str_trim(x[1:2]))))
colnames(citiesStates) <- c("city", "state")
citiesStates[,"state"] <- toupper(citiesStates[,"state"])
nuclear <- data.frame(nuclear, citiesStates)

This parsing works for most of the names provided by the NRC. Unfortunately, they are occasionally inconsistent. With a little bit more effort, even those cases could be mapped. As this post is more about visualization then regular expressions, let’s settle for the ones where the parsing worked out.

The zipcode package provides a database of zipcodes, city names and their geographic locations. We merge these locations to our nuclear data set, using city and state as key. Unfortunately, not all US nuclear power stations are contained within that database.

data(zipcode, package="zipcode")
nuclear <- merge(nuclear, 
                 zipcode[!duplicated(zipcode[,c("city", "state")]),-1], 
                 all.x=TRUE)
nuclear <- nuclear[!is.na(nuclear[,"longitude"]),]

Now that we’ve built our data set, let’s get about plotting. Basically, we just need to plot points at the correct locations. Let’s start out with an empty map:

p2 <- p1 + geom_polygon(colour="white")   
p2

plot of chunk empty-location-plot

We then add a layer of points to that map. The coordinates of the points come from the power stations’ locations. As the map polygons are defined using a group aesthetic, we need to add a constant group variable to the nuclear data as well. It won’t be used, but it needs to be present to appease ggplot.

p2 + geom_point(data=data.frame(nuclear, group=NA), 
                aes(x=longitude, y=latitude), colour="red")

plot of chunk basic-location-plot

Finally, we can add the Output variable to the aesthetics call to map the power plant’s size to the size of the point on the map. To finalize the plot, we apply the usual beautifications:

p2 + geom_point(data=data.frame(nuclear, group=NA), 
                aes(x=longitude, y=latitude, size=Output), colour="red") +
  xlab("Longitude") + ylab("Latitude") + scale_size_continuous(name="Output (MWh)") +
  ggtitle("Location of Commercial Nuclear Power Plants in the US")

plot of chunk final-location-plot

Summary

In this post, I’ve looked at producing attractive maps in R using ggplot2. Two kinds of geo-referenced data has been added to the maps: areas and points. Integrating both visualizations to create a story, e.g. the connection of nuclear power and income, is left as an exercise to the reader. Using even more granular data on say county level, could be used to tell story that nuclear power stations are never located in wealthy communities. Alas, the not-in-my-backyard attitude’s effectiveness is highly dependent on the wealth of their holders. As always, I’ll be excited to read any thoughts you might have on this in the comments.