Monday, February 11, 2008

Solving "The Case of the Missing Updates"

The Twitter engineering and operations team had a long week. Just several days after moving our cluster of server to a new host, we started to get reports of users missing updates. This, of course, is exactly the sort of thing we've been working to avoid by increasing our capacity and operating in a predictable, thoroughly-tested, reliable environment. So we got on the case like the Scooby gang.
scbmatMM.jpg

Tracing

While the Twitter code base has thorough test coverage for the majority of the application, many of our daemons—small programs that work on tasks retrieved from Starling queues— lack transparency into their inner workings. Tracing a message as it passes through the Twitter stack, traversing a variety of machines and services, is no mean feat. But trace we did, and we now have more "observability" built into Twitter's internals than ever before. (Plus, watching a message wind its way through our servers isn't just useful, it's also kinda neat.)

Mediating Between Starling and memcache-client

The insight gained from the tracing work helped us narrow down the potential culprits to Starling and the memcache-client library. Our newest hire, Robey, discovered that the event-driven version of Starling disconnects idle clients after 60 seconds of inactivity. Makes sense, right? Don't want to keep a zillion old connections around when you're a frequently-contacted network service...

Problem is, memcache-client doesn't check for dropped connections when doing a SET operation. We do a lot of those. Not only do we use memcache for typical cache duties, but we employ the memcache-client library as the base of our own StarlingClient library. Diagnosis: we'd inherited a couple of nasty bugs that were wrecking communication between client and server ends of the queuing system that we rely on to keep everything going. Yikes!

The solution ended up being a patch to the memcache-client library that raises a MemCacheError if an operation is unable to be completed. This allows code that wraps memcache-client to rescue and retry. Additionally, Starling has been improved with more logging and the ability to set the client timeout duration. A new public release of Starling that incorporates these fixes and improvements is coming soon.

Wrapping Up

We also fixed a handful of other minor issues with our system that should generally improve the stability of our database and memcache connections, and identified areas in which we were relying on replicated database that contained stale or inaccurate data. All this should contribute to more solid and dependable service from Twitter. With these bugs behind us, we're hoping that users and third-party developers can really see our new cluster shine!

3 comments:

Damon said...

Great work, guys! Always nice to have a new pair of eyes too.

BTW, if you'd like some *ahem* cool graphs or visualizations of those messages flowing your servers...you just let me know, mmkay? :-D

Augie Schwer said...

Wouldn't the problem with memcache have shown up before the move?

Great job and great project; I use Twitter everyday and i don't know what I would do without it now.

Alan Stevens said...

Sounds like you have found the culprit, but as of this morning, I'm still missing tweets in my timeline. :-(

@alanstevens