Once diagnosed, a simple reboot or an even more simple reset of Linux’s timekeeping (e.g., via date `date +”%m%d%H%M%C%y.%S”`) was enough to fix the problem; the only difficulty was in determining the cause.
Initial reporting often fingered Java or even Cassandra as the culprit, which is a testament to the popularity of these systems in high-traffic web sites, but the actual problem was a kind of livelock in the Linux system calls responsible for timers. What made this non-obvious (if you weren’t one of the unlucky admins whose servers actually crashed) is that tools like top would report that the application in question was consuming the CPU; digging deeper to see that the culprit was system calls like futex_wait misbehaving is beyond the scope of most systems administration.
This affected Java systems software like Cassandra, Hadoop, ElasticSearch, and Jetty, as well as non-Java code like MySQL or even client software like Firefox.
A fix for the Linux kernel is in progress as of t
(Curated by Dennis Moore. Read the complete article here)

Comments are closed, but trackbacks and pingbacks are open.