Currently, TfL have an in-house live tube departures server based on the Trackernet system. This server couldn’t support the volume of external demand when opened through the London DataStore in July and the service had to be taken offline when the load overwhelmed the server. TfL need an interim solution that takes a feed from the internal server and redistributes it.

There are various conceivable solutions to this problem if it is not done in house. For example, a commercial API service like Mashery can do it. And TfL have discussed Google taking the feed and redistributing it for ‘free’. The problem with commercial services are that they cost money directly; and taking up Google’s offer brings Google’s terms and conditions plus data formats into all business models built on top of the service.
We would like to see a solution that creates London jobs and showcases London technology while following open data principles. Why not simply allow some London SME’s access to the raw feed and allow them to cache and redistribute it as real time open feeds? This would be ‘crowd-serving’ open data and could use public domain licences like the MIT licence. The SMEs would innovate around the raw feed, building additional API calls which they could monetise any way they could. Since this would be an intrinsically competitive space, developers wanting access to the live data would have access to the open feed and the choice of suppliers for additional services.
How could this be done in practice? Technically, we would foresee TfL authorising selected SMEs to access the raw feed and pull the tube data into a web, database or memory cache. In practice, we think that updates any more frequent than every 15 seconds would be pointless as system latency and limits to train movement reporting make it unlikely that you could see change in the service at any higher rate. With the tube running 20 hours per day this generates c. 4800 updates per day into the cache. Each SME would then offer developer access to their 15 second cache using whatever system they saw fit to provide e.g. via
- authenticated API key up to a limit of users to preserve service quality, or
- an open API service in which demand would be limited by the physical limits on the server bandwidth
- a Squid proxy to simply pass on the raw data.
If a developer found an open service or proxy to be too slow they would have to move to an authenticated service or moderate their usage.
The costs of providing one node of access to this kind of service are not huge. If a full ‘all stations’ read from the API is c. 200k uncompressed (we estimate), then getting this 4800 times per day requires nearly a gigabyte of data transfer per full user per day. A full user is equivalent in data transfer terms to about half a million users getting one station, once a day. Each SME in the scheme would have work out what sort of server package would provide processor, data transfer and bandwidth capacity sufficient to offer a distribution service. But we believe that the direct costs are affordable, in the order of £100 server costs per month to provide service to 50 full users… or 49 full users and half a million one hit end users. Of course each SME would also incur labour and overhead costs, but most would have the means to develop a freemium business model alongside their open feed.
The six million dollar question would be… who would qualifies to be a distributor? In our view the criteria should be that:
- you meet the EU definition of an SME
- you agree to offer the service for at least 6 months
- you redistribute the raw feed no slower than the ?15 second update guideline, and
- there are no access, charging or licencing constraints other than basic misuse terms and conditions.
TfL can probably handle 50 distributors pulling the raw feed on the 15 second interval base. If you met the basic criteria, participants could simply be the first 50 SMEs to apply. In time TfL could buy into the ‘best’ service or simply adopt the most efficient distribution method itself.
There are bound to be lots more questions, but the core principle of crowd-serving seems appropriate and suitable for one of the crown jewels of London’s open data. Let’s see if there are enough SME’s out there who can step up to the plate to offer an open feed and innovate around the data.
Jonathan Raper, http://www.placr.co.uk/
Provided an alternative way of getting quality data (under some definition of QoS) is available, I believe that a SETI-like approach would suit the distribution of feeds and would be a very interesting experiment of making the public somewhat responsible for the dissemination of public data.
Comment by Giuseppe Sollazzo — August 23, 2010 @ 9:32 am
As the trains run so often, what is the added value of an API providing train timings. Most people using the tube are unlikly to plan their journey based on tube timings.
A utility/app that can provide the current out of service issues and best routes to get around the problem might be beneficial. Would this data have that information?
Comment by Paul Aujla — August 23, 2010 @ 10:11 am
The main uses of the tube API are likely to be for Journey Planner apps and QoS indicators for tube services. Also, early in the morning and late at night actual tube departure times might become important. With the raw data ‘out there’, it will be up to developers to provide innovative services like ‘route-around’.
There seem to be plenty of people who want to use an API… the TfL machine offering the service was swamped when it was opened last month…
Jonathan
Comment by Jonathan Raper — August 23, 2010 @ 11:43 am
Firstly, is TfL really so poor that it can’t run more servers, i.e. do the proxying itself?
Secondly, what data exactly is it that is causing a whopping 200k to need to be downloaded? Is that much really needed for things like journey planner apps and QoS indicators?
Thirdly, can’t a standard API key system with a rate limiter prevent abusive use levels, and effectively force client users to implement caching? For instance, was the traintimes.org.uk API using using caching? If it was, it seems odd that requests should balloon from “180,000 to 10m” a day. [Actually I see it says "10 requests every 2 minutes, so don't think I was responsible for the 10m hits".] Caching even at once every five seconds (17,280 requests a day) would get nothing like that number of requests to the API, and there would need to be a lot of different apps requesting things to get to 10m. If there is demand for this data, the corollary is that upstream clients are actually *saving* TfL the traffic by TfL not having to host these services directly.
I don’t think crowd-serving ought to be disallowed, but it seems over-complex to rely on if these three points aren’t considered first. I think that properly-resourced data feeds from public bodies will increasingly be seen to be as much of a standard requirement as a public website offering the same thing but as a GUI.
Comment by Martin — August 24, 2010 @ 8:24 pm
How many of those people were mindlessly grabbing *all* the data, and how many were making actual use of it? I know of at least one person who polled data for most of the stations every 30 seconds, resulting in 3.3Gb of data over a 24h period. He told me he didn’t actually know what he was going to do with the data. It’s now in my hands, and I’m analysing it to build up a detailed map of the tube network, down to track circuit level – and an interface to produce KPI-style reports, such as frequency at stations, quickest and slowest trains between stations etc. I have a plan for a more detailed real-time track circuit diagram, but that’s more show than substance.
My point is that there was a tonne of initial excitement over the data being available, which I really feel will trail off as time goes on. I’d really like to see something more low-tech, such as API keys and feedback on individual developers’ impacts, tried before resorting to inventing a very ‘new’ distribution mechanism for real-time data.
Comment by Peter Hicks — August 24, 2010 @ 9:21 pm
There’s been a whole conversation on Twitter about this proposal… here are some of the comments back and forth:
puntofisso Fri 20 Aug: I agree with this article> News on the tube API in @londondatastore and a proposal on ‘crowd-serving’ http://bit.ly/dnsUuF (via @MadProf)
paul_clarke Fri 20 Aug: RT @MadProf: Plan to crowd serve the tube API. Comments please on: http://bit.ly/dnsUuF < fascinating approach!
londondatastore Mon 23 Aug: @MadProf interesting idea and look forward to response of the SME's happy to point TfL in your direction if there is a head of steam
rollohome Mon 23 Aug: @MadProf sounds like an interesting approach to OD. I liked the SETI suggestion by 1 commentator: a 'safe' data set on which to experiment!
drewsonix Mon 23rd Aug: @MadProf Hi Jonathan - just read your posting re crowd serving. Wasn't a big part of the problem the sheer size of data for all stations...
drewsonix Mon 23rd Aug: @MadProf ..and simply due to format? Couldn't it be just "each train & its track locn" which we could crossref against cached tables?
drewsonix Mon 23rd Aug: @MadProf Something like Base64 representation of Set#, track loc, destcode, ismoving(y/n) for each train would mean a tiny fraction of data
MadProf Mon 23rd Aug: @drewsonix Data for all tube stations is not that large, nor too complex. But lots of users & some pulling whole feed at infeasible rate
MadProf Mon 23rd Aug: @drewsonix Challenge for tube API is finding a way to distribute data using open principles- hence crowdserving idea at http://bit.ly/dnsUuF
bensmithuk Mon 23rd Aug: @MadProf It’s an interesting idea, but given the loads it seems a shame TfL can’t just host and serve this properly from a CDN.
MadProf Mon 23rd Aug: @bensmithuk Absolutely agree, but meantime need a digital Dunkirk to rescue the feed! Rare opportunity for SMEs to get users & innovate
bensmithuk Mon 23rd Aug: @MadProf True, but isn’t complexity of proposed approach > than just proxying via Google AppEngine or Amazon?
MadProf Mon 23rd Aug: @bensmithuk Talking public sector procurement in a spending firestorm: not happening. If we take it over, its gets done fast, free & smart
bensmithuk Mon 23rd Aug: @MadProf
I didn’t mean them, I meant you / SMEs by collaborating.
MadProf Mon 23rd Aug: @bensmithuk Delighted if people join the party & pay to proxy a few million hits; assumed most would want to offer API calls to generate biz
countculture Mon 23rd Aug: @MadProf Like idea, but wonder if enough SME’s willing to do it; conversely if there were lots, whether it would be economic for each
countculture Mon 23rd Aug: @MadProf …Thus mean that they would drop out, and meaning only small number again…
countculture Mon 23rd Aug: @MadProf However, SME with business idea could always apply to join service, I suppose….
MadProf Mon 23rd Aug: @CountCulture Thanx 4 points on tube API crowdserving plan: idea is that in return for distributor status SME wud hav a chance to advertise
iapainter Mon 23rd Aug: @MadProf Looks great and a great idea. Nothing better than the buzz of a new startup despite its all consuming nature
Good luck #placr
daveaddey Mon 23rd Aug: @MadProf Problem is, SMEs (like ourselves) are hosting consumers, not hosting providers. We’re not set up to support high-load feeds.
daveaddey Mon 23rd Aug: @MadProf Also, demand for real-time transport data is inconsistent, eg v high demand during bad weather. Requires scalability we don’t have.
kemp_harper Mon 23rd Aug: Plan to crowd serve the tube API. Comments please on: http://bit.ly/dnsUuF
poggs Mon 23rd Aug: @MadProf Read that. In short, very fluffy and cute, ignores the fact we don’t know detail about what the issue is – bandwidth or API calls
poggs Tue 24th Aug: @MadProf I can’t see the business model in ‘reselling’ TfL’s data. Initial excitement will tail off; there’s only so much analysis we can do
MadProf Tue 24th Aug: @poggs what is biz model for any open data? No ‘white space’ coz no existing need. Need to make a market: transp apps sell by bucketload
poggs Tue 24th Aug: @MadProf I can only see two: a casual/simple “next trains” app on iOS/Android/web and a few people taking lots of data for detailed analysis
MadProf Tue 24th Aug: @poggs What bout live journey planners/ alerts/ oyster station footfall/ incident impact maps/ season ticket rebates/ station accessibility
poggs Tue 24th Aug: @MadProf Interesting uses, but I still can’t see more than a handful of people wanting to use the ‘big feed’ of raw data
MadProf Tue 24th Aug: @poggs Probably only a handful of SME tube API distributors needed. TfL will be able to it serve themselves eventually
The essence of these comments is that we shouldn’t have to do it ourselves but if we do do it then the business model for SMEs has to be clearer. I plan to post further on this… but the SME opportunity is to get users through their distribution of the TfL feed. Getting users is the hardest part of being a digital services SME, so if you get users but can’t monetise them then that is a more general challenge for open data. Persuading TfL to release data in this way… and then creating business for SMEs in London… would be a fantastic achievement for the open data movement.
Time is short… so let me have your offers to serve the tube feed if you can supply some bandwidth to join London’s great crowd-serving initiative!
Jonathan
Comment by Jonathan Raper — August 26, 2010 @ 9:54 am