Social media aggregator Gnip recently announced its plans to resell portions of its access to the Twitter timeline of public Tweets (better known as the Twitter Firehose). It asks $60,000 per year for 10% of the firehose and $360,000 for half of it.
When mentioning this to people, I am mostly met with blank stares. Even by people who are extremely Twitter-savvy. In fact, it seems the more you know about Twitter, the less this makes any sense. How can you resell something that Twitter is giving away for free, at THAT price? All of Twitter’s data is publicly accessible through their API, right?
Right, and wrong.
The Twitter API is consumed in many ways. Most applications, user applications like Hootsuite, Tweetdeck, etc, only need to deal with one user or a few users’ data at a time, and they have specific parts of the API dedicated to their needs.
Some applications need everything though – all of Twitter’s data. The Firehose. These applications include Google, which includes Tweets in search results, Datasift, which gives you the ability to search Tweets in advanced ways, and the above mentioned Gnip.
Let’s consider using the statuses/public_timeline function of the Twitter API. Makes sense, right?
Consider the following piece of documentation, on top of the same page:
[This function] Returns the 20 most recent statuses, including retweets if they exist, from non-protected users.
The public timeline is cached for 60 seconds. Requesting more frequently than that will not return any more data, and will count against your rate limit usage.
Now, it’s hard to keep track of an unpublished number that grows exponentially, but last I heard, the data rate on the Twitter Firehose was approaching 1000 tweets per second. Let’s have a look at the Twitter API rate limits:
The current technical limits for accounts are:
- Direct Messages: 250 per day.
- API Requests: 150 per hour.
It is quite clear from this that we’re not going to get the full public timeline, 20 tweets at a time, 150 calls per hour. This gives us a possible tweet retrieval rate of 3000 tweets per hour, vs the required 1000 tweets per second! That is roughly 0.08%. A bit off the mark, then.
Let’s consider using the search API. After all, the search API is made to do searches on the public timeline, right?
On the rate limiting documentation page, we see the following piece of information (under the heading Search API Rate Limiting):
Requests to the Search API, hosted on search.twitter.com, do not count towards the REST API limit. However, all requests coming from an IP address are applied to a Search Rate Limit. The Search Rate Limit isn’t made public to discourage unnecessary search usage and abuse, but it is higher than the REST Rate Limit. We feel the Search Rate Limit is both liberal and sufficient for most applications and know that many application vendors have found it suitable for their needs.
This means the search API is also rate limited and the rate limit is unknown. My guess is, while this will yield a slightly higher return than using the statuses/public_timeline method, it will still be way below 1%. This, together with Promoted Tweets, is the sum total of Twitter’s revenue after all.
This also explains why, as Twitter grows, the time a Tweet is searchable for is getting progressively shorter. It used to be longer than a month, but now a tweet decomposes in a matter of days.
Let’s consider using one of the streaming APIs. With them, instead of making successive calls, one call is made and the connection stays open, delivering Tweets as they arrive. Sounds good, right?
Uh oh! – on the Streaming API Concepts page, we read the following:
The current sampling rate is ~1% of public statuses by default (aka Spritzer), and ~5% of public statuses for the Gardenhose role. The algorithm is exactly as follows:
The status id modulo 100 is taken on each public status, that is, from the Firehose. Modulus values from 0-4 are delivered to Spritzer, and values 0-14 are delivered to Gardenhose. Over a significant period, a 1% and a 5% sample of public statuses is approached. This algorithm, in conjunction with the status id assignment algorithm, will tend to produce a random selection.
This is the answer to all of our searching and deliberating. A maximum of 5% of the firehose can be consumed without paying. Furthermore, the algorithm is consistent. Combining more than one such a stream will yield the same 5% of the Firehose.
Now you might be thinking: how is this possible? Aren’t thousands of people using clients like Hootsuite and Tweetdeck to access their own and their friends’ Tweets all the time? How do they manage that?
Thus, while at face value the Twitter Firehose seems to be free and open, it is, on the contrary, a very expensive luxury, even if it is difficult for most people to understand that it costs anything at all.
This is how Twitter makes money.
Posted by Adriaan Pelzer