Here at Twitter HQ, we're not blind to the flurry of discussion over the past weeks about our architecture. For many of our technically-minded users, Twitter downtime is an opportunity to muse about what the source of our problems might be, and to propose creative solutions. I sympathize, as I clearly find our problems interesting enough to work on them every day.
Part of the impetus for this public discussion extends from the sense that Twitter isn't addressing our architectural flaws. When users see downtime, slowness, and instability of the sort that we've exhibited this week, they assume that our engineering progress must be stagnant. With the Twitter team working on these issues on and off for over a year, surely downtime should be a thing of the past by now, right? Shouldn't we be able to just "throw more machines at it"?
To both rhetorical questions, the answer is "not quite yet". We've made progress, and we're more scalable than we were a year ago, but we're not yet reliably horizontally scalable. Why? Because there are significant portions of our system that need to be rewritten to meet that goal.
Twitter is, fundamentally, a messaging system. Twitter was not architected as a messaging system, however. For expediency's sake, Twitter was built with technologies and practices that are more appropriate to a content management system. Over the last year and a half we've tried to make our system behave like a messaging system as much as possible, but that's introduced a great deal of complexity and unpredictability. When we're in crisis mode, adding more instrumentation to help us navigate the web of interdependencies in our current architecture is often our primary recourse. This is, clearly, not optimal.
Our direction going forward is to replace our existing system, component-by-component, with parts that are designed from the ground up to meet the requirements that have emerged as Twitter has grown. First and foremost amongst those requirements is stability. We're planning for a gradual transition; our existing system will be maintained while new parts are built, and old parts swapped out for new as they're completed. The alternative - scrapping everything for "the big rewrite" - is untenable, particularly given our small (but growing!) engineering and operations team.
We keep an eye on the public discussions about what our architecture should be. Our favorite post from the community is by someone who's actually tried to build a service similar to Twitter. Many of the best practices in scalability are inapplicable to the peculiar problem space of social messaging. Many off-the-shelf technologies that seem like intuitive fits do not, on closer inspection, meet our needs. We appreciate the creativity that the technical community has offered up in thinking about our issues, but our issues won't be resolved in an afternoon's blogging.
We'd like people to know that we're motivated by the community discussion around our architecture. We're immersed in ideas about improving our system, and we have a clear direction forward that takes into account many of the bright suggestions that have emerged from the community.
To those taking the time to blog about our architecture, I encourage you to check out our jobs page. If you want to make Twitter better, there's no more direct way than getting involved in our engineering efforts. We love kicking around ideas, but code speaks louder than words.
Thursday, May 22, 2008
Subscribe to:
Post Comments (Atom)
67 comments:
Awesome post. Thanks for taking the time to write clearly and candidly about this. Good luck! (Disclaimer: I am a Twitter investor. But, I am also a huge f'ing fan of the service and am looking forward to it kicking more ass.)
Hi Alex,
You guys have a really good problem. It's actually enviable from the point of view of someone who enjoys that sort of challenge. You have enough usage data to know exactly what to optimize. At the same time, the clock is ticking.
By now I'm sure you have figured out a decent architecture for the problem (what to cache, what should be in memory as opposed to in a db, etc), and now it's all about execution. As a Twitter fan, I appreciate the post and I wish you good luck.
Great post! Being candid and at the same time defending your team is the right move. I agree that this is complex and you have your work cut out for you, but never doubted that you knew that and were working on it...
Wonderfully worded post Al3x!
This is good to hear. Most of the people who complain really have no clue about the complexities in building/maintaining a large, scalable system. So letting everyone know what's going on, and what's being done to address the issues is a great way to put the clamps on the complainers.
Here's looking into the future for a stable, scalable Twitter.
You are rewriting piece by piece. Does this mean you are making a slow exodus from ruby?
Thanks Alex. Twitter, or a Twitter-like application is the future I believe. However, as I point out in my blog about Twitter, reliability is key to successfully leveraging the business potential.
Of course, you know that. However, you need to stay ahead of the game--an application this popular will soon be prey to copycats. Note the fate of Friendster and others...
Thank you for the information. It is so much easier being patient and understanding if one is not left out in the cold wondering what the heck is going on. This helps!
Great to see you engage with the community. It would be great of course if you would even be more open and talk about the specifics :-)
As for scaling I saw the same happening with Second Life and it also seemed to be the same discussion ("is nobody working on that?" etc.). In this case it also meant to refactor and replace many elements to cope with the growth and as twitter not every decision in the past was probably the right one as nobody knew the future.
Right now Linden Lab seems to be quite open when it comes to the technologies used, what problems have been there and how they have been solved. It would be great to hear that from you, too as I guess many people could also learn from it.
Great first step in being open about the issues.
I'd encourage rampant transparency as you move forward.
great post. the realities of clusterable, horizontalable, "throw more machines at it," scale are non-trivial. i think much of the fervor over it comes from the industry coming out of an era where "throwing more machines" at relatively static websites (including shopping sites like ebay/amazon) was indeed as easy as it sounds (often). we're now in a phase where true apps (e.g. messaging systems like twitter) are being built and heavily used.
the game for the next 5-10 years is scaling.
This is a great post and honestly y'all should think about how to post a link to it to the twitter home page.
As someone who works services in a software-as-a-service company, what resonated with me is the issue that while it's used as a social messaging system, it wasn't originally created to do that. To change, it's like rebuilding a boat while still at sea...replacing everything plank by plank. I know how hard it is to alter your tech architecture while still moving at the speed of light. That's a heck of a challenge.
Again, the transparency of this blogpost is kind of thing that wins passionate advocates...and makes us a little more patient when things go wrong. I fall in this category even more so now. Thanks again for making Twitter what it is and what it will be in the future.
Chris
How are you instrmenting the app? Writing code (ye olde PRINT statements technique) or using a more highly evolved technology? I wonde if the issues that you are having are not indicative of a larger issue with your choice of language platform. Is this not a RoR app? I can't imagine that MySQL or Linux are having scalability/reliability/availability issues, but I can see that with RoR. Not knocking RoR, but seeing as Twitter is probably the largest user base (if you include API load) that any RoR app has had, might this not indicate an issue with RoR? My apologies if Twitter is built in something else.
*LOVE* the 3-part Nouncer reference. Serious. This is complicated stuff-- OZ "the man behind the curtain" -- that on the service seems so simple. Simple, it ain't. Thanks for the education. Thanks for your and your team's *passion*, vision, and commitment. Wishing you all the best of continued success!
It'd be interesting to see an overview of what your architecture is headed towards, and why you rejected certain solutions.
Thanks for this post. As a fellow developer, I am constantly amazed by the comments on Tech Crunch, etc., that trivialize the enormous challenges you guys face. I wish you the best of luck.
Guys, thanks for the post. Remember it's always good to communicate, even if it's just a simple paragraph. It keeps everybody happy and especially to fend-off those 'ungrateful' users, yes I'm talking to 'you' !
I understand how hard it must have been in there for all of you, and I want you to know that we are behind you in this. Just do your best mate ! That's all you can do :)
Thank you for this information. This kind of transparent discussion helps me understand and more easily accept the downtime.
Honestly, twitter downtime is frustrating, but knowing that stability and reliability is your focus is comforting.
Thanks for the update. Are there any user-side solutions to help balancing the load in the meantime? Like asking people to use SMS and IM rather than the website during peak times? Would privileging third-party API-based apps over the website help?
I think this kind of open and honest post was the best thing you guys could have done. There's a surprising amount of misunderstanding about what it really means to scale a service. Good luck! I think hiring the right kind of scaling people is exactly what you guys need to do.
We (at grazr.com) are hoping to contribute to the Twitter ecosystem with some new projects we're working on.
I crossposted this to that hueniverse blog, but I think it would be applicable here as well:
Twitter is likely falling into the trap of optimizing for write speed, rather than optimizing for read speed.
Let's break down this paragraph:
"Going through a timeline request, the server first looks up the list of people the user is following. Then for each person, checks if their timeline is public or private, and if private, if the requesting user has the rights to view it. If the user has rights, the last few messages are retrieved, and the same is repeated for each person being followed. When done, all the messages are collected, sorted, and the latest messages are converted from their internal representation to the requested format (JASON, XML, RSS, or ATOM)."
First of all, dealing with a push service or not, the server does not need to look up the list of people the user is following. In fact, it doesn't need to do any of those queries at all. It just needs to know the past X messages sent to me (where X is set by the business; my guess is a value of 15 is perfectly valid for 99% of all Twitter access).
Take the following scenario: I follow you, you write a message, then I stop following you.
Push service or no, I was "sent" that message during that time period, and so that message gets put onto my queue. If you block me, or leave the service, or what have you, that message *at that time period* was still in my queue, and so long as that's within the past X messages, that's going to show up every time I get my most recent messages.
This is how more traditional messaging systems work, and why they're so A) reliable, and B) fast. Instead of writing a query with so many joins against an author table, a permissions table, a "following" table, etc etc, all you need to do is one query against one table, with descending ordering on the primary key.
The penalty should be on message write. Run your calculations there, don't run them on message read. Have a transaction queue, submit jobs to the queue and return control to the user quickly. "Thanks for changing your real name!" becomes a large transaction (update all previous messages to set the name differently) but with a queueing system, people will still find the service to be fast.
This isn't really a "push" system I'm describing; it's more of a hybrid system that is, again, optimized for reading the same data over and over and over again.
Communication is the lifeblood of every relationship. Thanks for keeping us in the loop. That's all I'm looking for...
"Hey, Twitter's probably going to take a break this afternoon. I know it's a bummer. We're working on it as hard and as fast as possible. Promise. But wanted to let you know..."
It's only because I'm addicted.
Twitter's been down? Really? hmm... I use Thunderbird, GApps, a smartphone and the NYC Subways.. I can't tell that Twitters down any more than any of the other services that I've come to know and love everyday.
Plus, if its down, I just get work done.
What REALLY scared me was when Typepad used to go down.
for the record, alex is the only one who seems to know what's going on.
although i liked reading what you had to say and although there are great posts, it does not help your cause when you see you guys Tweet the vacation time you take, the beer you drink or whatever else you are doing besides trying to fix the problems. infact, it just upsets/ annoys more.
my recognition of startup life although never tried it yet is one of 24/7 work. in corporate life, i work till at least 11, in the office by 8. not saying you should too, but your words are meaningless by the same token code speaks louder than words; so do actions and your casual demeanor and bemusing tweets show a lack of seriousness.
if you fail, fail gracefully. right now, just seems like you're jerking around for having the hottest sh*t on the block right now.
I've never used Twitter. But I enjoy reading blogs written by armchair clowns. You know, the bozos claiming your problems are due to RoR. LOL!
Our web site is going through similar transitions. We designed it one way, and now need it designed another way. Pretty standard stuff for most software projects.
Of course, its easier to do when the site growth is slow and steady, rather than viral.
Port your core to Java and use JMS? I couldn't be not the first one to say this:)
anyway, I've been there, keep up the work because I love using twitter.
great post. thanks for opening twitter's kimono on this (just a little anyway).
the real issue/challenge facing y'all is that since twitter is so intertwined with SMS, it creates this unrealistic expectation that twitter will be as reliable as SMS itself.
we all know this will never happen due to the nature of the beast. it's a rough spot to be in but i'm rooting for your team.
I am curious as to why it was written as a content management system and not a messaging system?
It is a legitimate question. I am not being disrespectful, I am very interested in the answer.
Call the guys at TIBCO and ask about their Rendezvous product.
When you think about stocks quotes, Google, email/SMTP, caching, etc., the Twitter problem just doesn't seem too difficult.
Btw,are key Twitter problems dealt with Ruby ones?
"Btw,are key Twitter problems dealt with Ruby ones?"
Ruby is just a programming language - a nice one at that. The problems Twitter are seemingly facing don't relate to those of a programming language itself.
@elithrar Is there heavy site like Twitter on Ruby and without analog problems?
Scale is hard. Not from design but given the resources at hand while assumimg some ballpark growth rates. It's very easy to scale given unlimited resources, however this results in sloppy waste.
What I don't understand is your comment re inital build. What the fuck did you think you were building from day one?
Does anybody actually think that the Twitter folk knew how popular the service would be when they started? Do you think they could have forecast the rate of growth?
No way in 'ell would I have thought something like sending little snippets of text, throughout the day, to lots of folks, over sms would be so popular.
Sure, they probably need to make some architectural changes. I'm sure they already know what changes need to be made.
But right now, they are probably just trying to keep up with growth. Where is the time for rebuilding?
Maybe that new $$$ will help.
But seriously, unless you have faced a problem of the same magnitude directly, you probably shouldn't fling mud.
My recommendation is to look at the tech. offered by kxsystems (www.kxsystems.com) They have an extremely fast, lightweight data processing engine that is used all over Wall St to build program trading systems. Their software is basically a data processing laguage (ala matlab), and database engine which can handle hundreds of thousands of messages per second.
I can vouch these numbers because I am a happy customer.
I'm not going anywhere. (Besides, I feel like I've been through this before with Twitter.)
Microblogging is an incredible way to teach the average Web user about feeds and blogs and all this crazy Web 2.0 stuff. And Twitter is *still* the best service to get them started.
Not. Going. Anywhere.
I'm not a tech person, and so I admit that I'm not qualified to understand all the internal processes that go into making Twitter work. But I do have a suggestion that, it seems to me, might do some good towards easing the burden on Twitter's systems. I've been thinking about the blacklisted users -- people who are following thousands upon thousands of users, with very few following back. These people must be using up valuable system resources, and clearly they aren't adding to the community or they would be followed in return. Why doesn't Twitter put a limit on how many people you can follow without reciprocation? If people were paying for the service, then I could see the logic in letting them use it any way they like... but Twitter is a free service, so why let people abuse it this way? Eliminate the abusers and save system resources for the people who want genuine dialogue with the community.
I do wish you'd hurry on the transition to a proper messaging-oriented architecture. Until uptime is more reliable, I'm not comfortable with the plan to use Twitter accounts as the service bus for our SOA enterprise apps.
Oh, and can you allow jms: URIs as usernames? That'd be convenient.
Check out ESPER: http://esper.codehaus.org/index.html
Have you guys ever considered putting a limit on the number of people a user can follow? Seems to me when that number gets up into the thousands, you have a whole different model going on. I would hope messages from such users (e.g. political candidates) are prioritized WAAAAAY lower than the "average Joe" user.
@Scabr - I think 37signals' Basecamp (as the inaugural Rails app) would qualify as a Ruby application without problems - but without knowing their user-base it's hard to quantify or compare to Twitter.
It's just definitely not a 'language' issue in my mind - horizontal scaling isn't a deficiency of a programming language. A framework, absolutely: Rails brings a lot of benefits but it's well known that *any* ORM won't be as perfect as hand-coded SQL, and the way Twitter uses their database introduces further complexities.
Very nice job on the post. It's nice to have a better understanding of things behind the scenes at Twitter HQ.
Cheers!
@davedelaney
Best wishes. I have some idea of how hard this is. Keep on pounding at it and you'll get there.
Thanks very much for this. I was feeling disheartened this week. I've said it on there, and I'll say it here, I'd gladly pay $5 a month to help Twitter be stable as time goes on, and keep it ad free. Provide a revenue source, without disturbing the pure awesome experience that is Twitter. Thanks for making this!
"given our small (but growing!) engineering and operations team"
This is the statement I don't understand. The freaking eyes of the tech world have been focused on your failure for months now. You should have an army of geniuses working day and night. I can't think of a more public failure of technical leadership than twitter. Get help.
I understand all the criticism of twitter. The recent failures seemed designed to cure what had all the signs of becoming a twitter addiction.
What I don't understand is how you guys actually pay for this stuff. What kind of business model are you using? Where is the revenue stream?
You guys have come in for some pressure in the last few days. Everyone seems to want to be an expert. Good luck, I think you sound like you know where you are going. As a developer, I know only too well how it can be when everyone thinks they have a "simple" solution that you won't have thought of.
Kudos, Alex and the Twitter engineering team, for diagnosing the problem and making the tough call.
Beware the Second System Effect!
As a geek, I'd love to see a technical explanation of Twitter's architecture and it's current problems.
The longer I have worked as a web application developer, the more technology agnostic I have become. I have seen problems and benefits with just about every development environment / language out there. Most of my professional background is in Perl, but the last couple of years I have been working on Ruby on Rails applications. My current opinion is that RoRs is the most enjoyable and rapid to develop in. I have seen some amazing sites built in Ruby on Rails ... not only are they heavy traffic sites, but they were rapidly built.
To answer scabr's question, some large scale Ruby on Rails sites that are out there are:
http://www.yellowpages.com
http://www.basecamphq.com
http://www.shopify.com
http://pv.webbyawards.com
http://www.backpackit.com
http://www.43things.com
If that isn't large scale enough, I don't know what is. Also, a big part of scalability depends on your hosting solution. If you are serious about building a large scale Ruby on Rails application, you should look at Engine Yard, http://www.engineyard.com. They can through slices at your application, and they also have a team of Tech Support folks that will help you work out your scaling issues down to the code level.
If you are still unsure about getting locked into Ruby on Rails, you should check out Merb, http://www.merbivore.com/, which is an MVC framework that is ORM-agnostic, JavaScript library agnostic, and template language agnostic.
All in all, I think a development team needs to find out what framework will work best for their skills, background and project requirements, so there really isn't a single framework that will replace all others in my opinion.
I'm not a twitter user, just curious about the architecture issue.
Assume we can simplify the problem as one user posts a short message, this message was forwarded to many subscribers in various addresses and devices.
To scale it up horizontally, first you need to partition database, otherwise it will become bottleneck. A typical database partition is one db for user profile, many other dbs for user data like text message, photo, video, etc.. An user belongs to one db. As users grow, more dbs are added. A lightweight user profile might need to be replicated to user db for performance like table join.
When an user login, it is directed to an user app connected to its own db, once the user create message, the message is put on a queue. based on subscribers address and routing methods, queue apps simply forwards the message to the subscriber's db or external system. More messages, just add more user apps and more queue apps.
There're many places you could optimize, for example, if the message is read only, then one copy per db is enough. The above approach should allow you to scale and be reliable.
I'm no techie, just a lover of Twitter. You have created something that's really valuable to many people (hence the outrage when it goes away), which is a huge achievement in itself. So kudos for that, and good luck for all the hard rebuilding work ahead.
Code? It's the database and infrastructure supporting it that you need to look at. Designed as a CMS but used as a messaging system? Bring in IBM -they know a thing or two about messaging based operating systems.
If Twitter haven't been programmed in RoR I think Twitter would not be like Twitter is nowadays
I know its not a problem with a particular programming langauge (Ruby), but I do wonder how much the "greenness" of RoR's scalability tools effected Twitter's reliability.
Also, Al3x, I am curious if knowing what you know now, would you use RoR again?
Cheers,
-Daniel
Getting better and better with the communication guys!!
But please, seriously consider adding a field on that Jobs page that reads "Community Manager" in the near future so you can focus more of your efforts on scaling and less on having to nurture the community.
@ColleenCoplick just wrote you a brilliant "open letter" about it. Seriously, consider it?
http://www.buzznetworker.com/an-open-letter-to-twitter/
Hmm,
XMPP is a communication architecture. There was even a google tech talk about how Google Talk scales. Think presence (online/away/etc) messages as tweets - in fact, they're roughly as long (100-200 unicode characters per change)
http://video.google.com/videoplay?docid=6202268628085731280
Would it help if twitter would move to some XMPP PubSub based architecture? (Like the erlang-based ejabberd's pubsub module)
PubSub protocol is detailed here:
http://www.xmpp.org/extensions/xep-0060.html
The ejabberd module is detailed here:
http://www.ejabberd.im/mod_pubsub-usage
http://www.techcrunch.com/2008/05/22/twitter-at-scale-will-it-work/#comment-2325662
Post 123
I have written a small article on a potential solution.
I too am interested in how you finally implement it if a cache-based solution seems to work. But the issue someone mentioned of firing back up a cache after a crash wouldn't necessarily be too bad. I'm assuming that you are all well aware of memcache servers? If not, take a look, its neat stuff.
If you're willing to throw some custom built FPGA, Network Processors, and Cache at the problem like the folks on Wall Street do with Market Data and Program Trading then you can quickly get to 8-10 Million tweets per second on a single box.
http://www.solacesystems.com/
The problem is Ruby on Rails!!!!
Have you ever considered Oracle?
the problem is not ruby on rails!
chris @ www.gofrostfire.com
It works fine for me!
Can you give a comment to my architecture.
See the diagram
http://www.ijoinu.com
let's discuss!
thanks a lot for this post
it made me think of starting to work on this problem as it looks like an interesting one
good luck in your work!
------------------------------------------------
Dave Winer, father of RSS says “Twitter, as it was conceived, was never meant to live.”
“It’s very possible with better engineering its architecture might have gone on for a few more years, but eventually it would have hit this wall, where there were too many people posting too many twits to too many followers. The scale of the system as conceived rises exponentially.”
So is the end of Twitter getting near? I hope not. Twitter I hope that you are listening and you better start taking things more seriously.
-----------------------------------------------
Here's my two cents.
For instance there are about 100m users of yahoo messenger and usually 2-3 of them talk at a time that means scalability of 300m conversations. On the other hand with 100m twitter users who usually send messages to 100-10,000 other users the scalability required is 10,000m to 10^6m I have never known any current architecture based on webservers to handle such a scale. So according to me Twitter was never meant to live. It is like a concept car that will never see production. Users of twitter don't understand this and they don't care.
They don't know whats happening when the website is down. The sad part is that the best analysts claim that Twitter is a billion dollar company in one year of operations. There is an old saying before the days of when people understood permutation combinations. One peasant asked a king to give him rice equal to the total amount gotten by placing double the number of rice grains on a chess square than the previous square, starting with one rice grain. There are 8x8=64 squares. We seriously need to visit grade 7 mathematics.
I know of only one News/Messaging system that supports around 1 billion users sending messages to all 1 billion users each. Thats a scalability of 10^12m. It is not Web based but rather on a massively scalable serverless P2P architecture based. The team is soft spoken and when I last talked to them I was told that they don't care about money or hype or fame but rather for just the passion of next generation global systems that will stand the test of worldwide use. Its called Mermaid News Mermaid
They have other softwares too but this post is about Twitter and Messaging. Once everyone comprehends basic mathematics that goes behind scalable algorithms they would go past the flashy screen and hype to actually want a system they can trust. To the analysts I would say it is easy to create a business plan, create a hype and raise $20m funding it is far more difficult to create something of use.
Post a Comment