
Tracing
While the Twitter code base has thorough test coverage for the majority of the application, many of our daemons—small programs that work on tasks retrieved from Starling queues— lack transparency into their inner workings. Tracing a message as it passes through the Twitter stack, traversing a variety of machines and services, is no mean feat. But trace we did, and we now have more "observability" built into Twitter's internals than ever before. (Plus, watching a message wind its way through our servers isn't just useful, it's also kinda neat.)Mediating Between Starling and memcache-client
The insight gained from the tracing work helped us narrow down the potential culprits to Starling and the memcache-client library. Our newest hire, Robey, discovered that the event-driven version of Starling disconnects idle clients after 60 seconds of inactivity. Makes sense, right? Don't want to keep a zillion old connections around when you're a frequently-contacted network service...Problem is, memcache-client doesn't check for dropped connections when doing a SET operation. We do a lot of those. Not only do we use memcache for typical cache duties, but we employ the memcache-client library as the base of our own StarlingClient library. Diagnosis: we'd inherited a couple of nasty bugs that were wrecking communication between client and server ends of the queuing system that we rely on to keep everything going. Yikes!
The solution ended up being a patch to the memcache-client library that raises a MemCacheError if an operation is unable to be completed. This allows code that wraps memcache-client to rescue and retry. Additionally, Starling has been improved with more logging and the ability to set the client timeout duration. A new public release of Starling that incorporates these fixes and improvements is coming soon.
3 comments:
Great work, guys! Always nice to have a new pair of eyes too.
BTW, if you'd like some *ahem* cool graphs or visualizations of those messages flowing your servers...you just let me know, mmkay? :-D
Wouldn't the problem with memcache have shown up before the move?
Great job and great project; I use Twitter everyday and i don't know what I would do without it now.
Sounds like you have found the culprit, but as of this morning, I'm still missing tweets in my timeline. :-(
@alanstevens
Post a Comment