Slaying the system_server Bug

(The following is a technical glimpse at a current XDAndroid development topic aimed at intermediate or advanced users and developers.)

Chances are if you’ve been using XDAndroid on a Rhodium, you’ve been hit by this annoying bug: the device is painfully slow from the time the XDAndroid boot animation begins. This is a bug seen mostly be Rhodium users and of varying degrees of consistency. Some people see it nearly every boot, others see it once in a handful of boots. When investigating the running environment while it’s happening, there is no interesting logcat or dmesg output, and top shows system_server hogging at least 95% of the CPU time.

An effective workaround has been to place a short phone call to the voicemail service and hang up. For whatever reason, this would cause system_server to calm down and act normal for the rest of the XDAndroid session. A more radical (and perhaps less effective) solution was to re-enable Dalvik’s JIT execution and put up with the various bugs it would potentially reintroduce. Somehow, JIT was able to mitigate the bug to the point that casual testing did not reproduce the system_server issue.

Upon further investigation by several XDAndroid team members (most notably arrrghhh and WisTilt2), the bug had been tracked down to the userland libraries used by XDAndroid for hardware support. This meant it was either an issue in the RIL, GPS or sensors (accelerometer) drivers. After even more testing by arrrghhh, who readily volunteered to test an unfinished internal Gingerbread build, it was determined that the likely culprit was the sensors driver, which remains missing in Gingerbread.

At my request, arrrghhh performed repetitive testing on our latest Froyo release (FRX04) with JIT disabled. With the sensors driver in place, the issue was reproduceable on essentially every boot. After removing the sensors driver, it could not be reproduced once. This was a very telling result, so it was time to figure out where the issue was in the sensors driver.

Finding the issue was actually relatively trivial. Such a runaway process usually indicates an uncontrolled loop. The sensors driver continuously checks for data from the sensors devices while Android is running. In our driver, the code responsible for that check is seemingly prone to infinite looping in a specific case where incomplete data is received from the sensors device. In practice, this case occurs frequently on Rhodium and sends system_server into fits. I’m guessing that making a phone call causes Android to query for a proximity sensor, which bails the sensors driver out of that loop to process the request (and then it goes back into the loop and is able to read data normally).

So, since the logic generally looked sound in the loop, the simplest and most likely solution was to add a delay while handling that corner case of incomplete data. For testing, we used an unreasonably large delay and added it in the general case for the loop, ensuring that the loop could never become a runaway in proper runtime conditions. Through more tedious testing, arrrghhh was able to confirm that the delay solved the runaway system_server issue (while making the sensors unusable).

After making that loop delay much shorter and placing it in the corner case exclusively, testing continued to show success. So it seems like the system_server bug is finally defeated. We’ve already released a new rootfs with the relevant change integrated. Give it a try and let us know how it works out, via the IRC channel or the aforementioned bug. Thanks for reading!

2 thoughts on “Slaying the system_server Bug”

  1. Does this mean that as often as the system server bug used to hit, that the sensors driver is still probably stuck in the infinite loop, only with all of the harmful side effects mitigated by the fact that it now sleeps between polling?

    Any other harmful side effects you could see of having the code stuck in that loop?

  2. The sensors driver is in the loop as long as Android is asking for data. If the polling fails, the loop continues until it works. Due to what appears to be improper hardware initialization, it will never read from the sensors device properly without an artificial delay.

    In our systems before the delay was introduced, making a call likely results in Android asking for proximity sensor data, which the sensors driver will initialize differently for, and then the read succeeds and the problem is mitigated for the remainder of that Android session.

Leave a Reply