A lot of websites are generating data which could be really useful outside of a web browser. Having the weather shown on your smartphone’s lock-screen, the delay until your next bus… How can we use this from an application?
This article will explain what’s behind these hidden data flows, and how to use them. For this purpose, I’ll use Trakt.tv as an example. If you don’t know it: Trakt allows you to manage your movie/TV series library, keep track of watch progress, your ratings, comments… and see those of other people.
Some code will show how to send such requests. It will be written in Ruby.
Planet of the APIs
The concept behind these data flows is called an API - Application Programming Interface. While API is a pretty generic term, in this specific case, it specifies how a website will expose its data (and possibly receive some, too) to (and possibily from) a client. A client can be any kind of application sending requests to the website (the server). Not to be confused with the user, which uses a client, which in turn sends requests to a server.
1 2 3 4 5 6 7 8 9 10 11
Trakt’s API uses JSON. As a matter of fact, this JSON bit comes from the API. You can find the same details about the movie directly on Trakt. It’s just not useable from an application directly.
Trakt’s API documentation is available here.
REST in pieces
If an API was made for only one goal - like indicating the weather in a given city -, it would be pretty simple: just access an URL like http://myapi.net/weather?city=Toulouse, return the corresponding data, done. However, for a fully-featured API, dealing with multiple elements like movies, series, series episodes, users… you need something a little more complex than that. And that’s where concepts like REST - Representational state transfer - come to help.
REST is a list of five (and an optional sixth) constraints for the API architecture:
- Client/server oriented: data is stored on a server, and displayed on a client
- Stateless: every request from a client must contain all the information needed to be handled by the server. If any kind of state management is needed, the client is in charge
- Cacheable: responses from the server must specifiy how long they are valid (we’ll get back to that a bit later)
- Uniform interface: this one is a list of four sub-constraints to guide the API organization:
- Identification of resources: the way an item is identified in a response should be consistent with the request, not with the database behind the API
- This identification must be sufficient: a client must be able to modify or delete a resource from the identification given by the server
- Self-descriptive requests: a request must contain everything needed to be processed by the server. Similar to the stateless constraint
- Simple client transitions: if a client has to send a separate request to access more related information, the way to get it must be described in the initial response
- Layered system: a request must be able to be as specific as possible, so the server doesn’t have to send its full database content when replying. A client can ask for a list of movies, but can also ask for details about a specific one if it’s just interested in this one
- Code on demand (optional): the server can send a bit of script, which will be run by the client, either to limit the server’s load, or to occasionally change the client behaviour.
A REST architecture has multiple advantages. The simple fact of being stateless is probably the most important to the end user: the client doesn’t have to maintain a connection to the server between requests, which allows a huge reduction in power consumption on mobile devices. It also simplifies the load-balancing on the server side (any server can process any request without the need of an existing context).
From a developer point of view, a layered architecture tends to make the system easier to maintain, and easier to use.
There are many aspects regarding the authentication when talking about an API. A common need is to identify the user. Who’s trying to mark this movie as watched? Another one is to identify the application using the API. A user may want to see a list of applications accessing his account, remove access to a specific one…
Identifying the application is pretty easy. For APIs where this identification is in place, the developer must register the application. The website then provides a unique identifier for this application. This is called an API key. This key must be placed within every request from this application. The API website can reject every request without a valid key, revoke a specific key…
We now know how a website exposes its data, how to parse them, and how they are organized. Here’s a concrete use case: we want to retrieve the list of the directors of the Finding Nemo movie.
We first need to find the movie. We’ll use the
search/movies API method. It needs three parameters, the last one being optional:
- format: how the response should be formated. Only JSON is supported for this method
- apikey: we need to identify ourselves as a Trakt.tv API user
- query: what are we looking for?
- limit: number of results at max. Defaults to 30
And here’s how the request should look like:
Let’s call it in Ruby:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
And here’s the result, truncated to the first two movies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Look at the first one: it is the movie we were looking for! Now, we want to display its details. To achieve this, we’ll need to use another method:
movie/summary. Here are its parameters:
- format and apikey: same as above
- title: Could be the last part of the
urlattribute we got earlier (for Finding Nemo, it would be
finding-nemo-2003), the IMDB ID (
imdb_idattribute), or the TMDB ID (
tmdb_idattribute). We’ll use the IMDB ID.
Here’s how the request should look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
And the truncated output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
We have our directors in the
people object, which contains a
directors array. Let’s extract it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
And here’s the output:
We can check the result directly on Trakt: looks like we’re good!
Cache me if you can
Now, we know how to retrieve data about movies. We can build an awesome application, using Trakt.tv lists to keep a list of movies we want to watch. However, because we’re spending so much time to work on this application, we don’t have any time left to watch thoses movie. Every time we fire up the application, we fetch, for every movie, all the information back from the API. Even if we’re pretty sure the data hasn’t changed (I mean, have you ever seen a movie director changing after it has been released?), we use bandwith, data plan on a smartphone, battery…
This is bad.
We need to find a way to save this kind of data offline to use less bandwith, but we also need to refresh these data if they have been changed on the server. This is called caching.
A web cache stores copies of documents passing through it; subsequent requests may be satisfied from the cache if certain conditions are met. Wikipedia
There are multiple ways to cache accesses to an API. The simplest is to rely on HTTP cache if the API is accessed over HTTP
Another mechanism, probably more convenient for an API like Trakt, is the ETag. An ETag identifies a specific version of an answer. When we make a request (for example, ask for a movie details), the server can add an
ETag header. If it’s present, we should store it alongside an offline copy of the movie details. Next time we need to use these details, we should ask the server: “Hey, I have a copy of these details matching this ETag! Have you anything newer for me?” If there is a new version available, the server will send it over, with a new ETag. We have lost two dozen bytes of bandwith. Damn. However, if the details haven’t changed, the server will simply answer with an empty
Not Modified response. And that will be most of the time, for this kind of request. Hooray for the planet. And our batteries.
There are tons of ways of caching content. Nearly every API will use a different mechanism, suitable to its data lifespan.
On a fast ADSL connection, a simple request on the Trakt.tv API takes a couple hundred milliseconds. If you run multiple queries in parallel, it will be slower. Now, imagine on a 3G network. Cache does matter.
Requiem for an API
As you can see, once an API has been written, using it is pretty easy. With standards like JSON and XML, REST… they basically all work in the same way.
The crucial part to develop an application is, however, the actual availability of an API. Not all sites provide one. On the other hand, some sites provide excellent ones, some of them being just an API client themselves (that’s a principle called dogfooding).
Code samples presented here are in Ruby, but you can of course use any language you want to access to an API (well, good luck to use an API in brainfuck). Some API providers maintain library in several languages to ease the use of their API, providing high-level objects (example here, a library provided by GitHub to access its API in Ruby).