Sunday, March 4, 2012

Details on the Disappearing SD Media Issue

As recently noted, we're currently testing a set of fixes for the disappearing SD media/ringtone issue that has been plaguing CM users for quite some time.  First of all, we'd like to thank all the users who submitted details, logs, and dumpstates concerning this problem (and others) and being patient while we investigated the cause.  In particular, if you sent any logs/dumpstates to us, we have read them, and although we haven't been able to personally respond to every report, we still want to say "thanks!" for your instrumental role in helping us figure it out.  So, thanks!

Second, this bug was a real challenge to fix.  Given the strange, "perfect storm"-like conditions needed for it to manifest, and the fact that it affects every CM device with removable storage, we feel it's worth a solid write-up.  So, here we go.

Symptoms

Users have been reporting at least since our CM9 alpha1 release that ringtones, notifications, and other media installed to the SD card occasionally "disappear".  In each instance, the actual SD card files remain intact, but they're lost from the media database that's used by Android for selecting ringtones, notifications, Gallery media, etc.  Usually the problem would be resolved in the short term by rebooting or manually forcing a media scan via Dev Tools.  However, the problem would inevitably come back for most users in a couple of days.  Meanwhile, other users (including myself) experienced lost SD media not even once.

Background

An Android provider application, MediaProvider, is responsible for maintaining databases that provide details on all media stored on the device.  There are separate databases for media on "internal" and "external" (e.g., SD card) storage volumes.

When a device is first booted, the MediaScannerService runs to create (if it doesn't already exist) and populate the "internal" media database by scanning every file in /system/media/.  The MediaScannerService also runs every time an SD card is mounted on the device where it creates and populates the "external" media database by scanning every file in /mnt/sdcard/ (except for directories containing a .nomedia file).

There's three other important features to note about MediaProvider:
  1. While a "card-wide" scan is performed on boot and insertion of SD media, individual applications can register new media files in the media database, at any time, by calling the MediaScannerConnection::scanFile method.
  2. MediaProvider is designed to support multiple "external" media databases on devices that feature removable storage.  Specifically, it's designed to maintain the three most recent external databases, one for each of the three most-recently inserted SD cards, so that one could cycle through a few different SD cards on their device without having to rebuild the database each time.
  3. MediaProvider runs in its own process, "android.process.media", which like any non-critical process, is subject to being killed by ActivityManager so that its resources may be reclaimed.

Bug #1: Google+'s Repeated Registration of "hangout_ringtone.m4a"

It turns out that the trigger for SD media disappearance, in every case we've observed, is actually due to the "odd behavior" of a proprietary third-party application, Google+.  In particular, Google+ installs a media file, "hangout_ringtone.m4a", to /mnt/sdcard/Notifications/ on every (or at least, most) device boot-ups.  Its procedure is to first copy "hangout_ringtone.m4a" from its internal resources to /mnt/sdcard/Notifications/, then call the MediaScannerConnection::scanFile method on the new file to register it in the media database.

We're not exactly certain why Google+ does this on each boot-up, but we do know that often when the copy routine is run, Vold is still performing a file-system consistency check on the SD card.  So the SD card has yet to mount, and thus, the "hangout_ringtone.m4a" appears to not be already installed.

It's probably a bug that Google+ doesn't check Environment::getExternalStorageState to see if external media is mounted yet before performing the file copy.  However, it's definitely a bug that on receipt on a FileNotFoundException during the copy, that Google+ calls MediaScannerConnection::scanFile on the (non-existent) file anyways.

Now, this kind of bug would normally be harmless in the absence of additional bugs within Android itself.  Certainly the operating system should be able to tolerate applications doing "odd", but otherwise harmless activities like attempting to media scan a non-existent file.  However, this bug is still notable in understanding the underlying problem as it serves as a trigger condition for underlying OS bugs that we would otherwise not observe in the absence of this program.  In particular, CM users with Google+ installed appear to have experience disappearing SD media rather frequently, whereas users without Google+ experienced it never.  Even among us developers and testers, it was roughly a 50/50 split of who experienced the problem and who didn't, leaving us a bit baffled.

Bug #2: Media Scanning of (Non-Existent) Files, When the SD Card isn't Mounted, Has the Side Effect of Creating a Second External Media Database

A private method, MediaProvider::attachVolume is called anytime a media scan is performed, in order to "attach" to either the internal, or an external media database, creating it if necessary, prior to performing the scan.  When attachVolume is called on behalf of an application invoking the MediaScannerConnection::scanFile routine, there's no explicit check if external media is mounted before attaching to the external media database.  For the most part, this isn't problematic itself as the database is eventually created and written to storage once the media is mounted.

However, on devices featuring removable storage, the attachVolume method selects a unique database name, "external-XXXXXXXX.db" where "XXXXXXXX" is the hexadecimal representation of the SD card's FAT volume ID, e.g., "external-6997f084.db".  The method responsible for looking up the volume ID, FileUtils::getFatVolumeId, returns the ID as an integer quality, and may return "-1" in the event of an error.  Unfortunately, attachVolume doesn't check for this error condition, and thus, if the method is called when an SD card is not (yet) mounted, it will attach to (and later create if necessary) the database "external-ffffffff.db", where "ffffffff" is the 32-bit hexadecimal representation of "-1".

(An aside: It's not clear me that 0xffffffff isn't a valid FAT volume ID that could be present on an SD card, particularly one that's not formatted by Android.  However, Android implicitly treats it as an invalid ID by reserving that value for the condition of FileUtils.getFatVolumeId's returning error.)

Now, if the above problem were the only media database-related bug in Android, it wouldn't be a particularly big issue.  The "wrong" database is only created when an SD card isn't mounted, and the media scanner scans the entire contents of the SD card on media mount events.  Furthermore, MediaProvider is designed to maintain the three most-recently used databases, so even if the "right" database were to be attached on a subsequent reboot (or other) event, it would only be slightly stale and the mount-time SD card media scan would bring it up to date.

Bug #3: The Media Database Garbage Collector is, Well, Broken

Despite the fact that MediaProvider is supposed to preserve the three most-recently used external media databases, and the fact that Bug #2 erroneously creates only two such databases with a single SD card, we observed that these external databases were being created fresh on attach events, implying that the older of the two databases was constantly being deleted.

The method responsible for deleting old databases, MediaProvider.DatabaseHelper::onCreate, implements an LRU (least-recently-used) garbage collection policy.  The first thing it does is call android.app.Context::databaseList to find the list of MediaProvider's databases.  The databaseList method, as implemented by android.app.ContextImpl::databaseList, literally returns the list of files in /data/data/com.android.providers.media/databases/.  Next, the garbage collector checks the list for databases last modified over two months ago and deletes them.  Finally, it checks the remaining external databases, preserves the three most-recently used (including the one being attached) and deletes the rest.

Unfortunately, the directory /data/data/com.android.providers.media/databases/ contains SQLite temporary files (e.g., ".db-shm" and ".db-wal" files) in addition to the SQLite databases themselves.  These files are returned by the ContextImpl::databaseList method, and are considered as distinct "databases" subject to the garbage collector's deletion policies.

It turns out this breaks the garbage collector in at least two ways.  The first is that the garbage collector may delete one or more of these temporary files, including those associated with the actively-in-use "internal" database.  The other is that these extra files inflate the total number of databases, and so external databases end up being deleted well before the three-database limit is reached.  In short, the garbage collector code works only, and coincidentally, in the single circumstance where exactly one external database exists alongside the internal database.  In the presence of two or more external databases, which we have due to bug #2, the older of the two is inevitably deleted.

(Aside #2: Although we consider bug #3 to exist within MediaProvider's garbage collector, the "actual" bug may lie in the ContextImpl::databaseList and ContextImpl::deleteDatabase methods, depending on one's perspective.  It's quite reasonable to assume that these methods should return only actual databases, hiding their associated temporary files and handling their deletion internally.  If these methods, which are part of Android's public API and used by third-party applications, were to be modified in this fashion, then the MediaProvider database garbage collector would be correct as is.)

The "Perfect Storm" Condition

Even with the above two OS bugs, if MediaProvider were to only attach once to an external database for the duration that SD media is mounted, these problems wouldn't be observed by users.  That's because, regardless of which database is attached to, and whether it's up-to-date, stale, or brand new, the mount-time SD card media scan would still bring it up to date.

For SD media to disappear, MediaProvider has to attach to external databases at least twice during the same mount period, which as it turns out, does happen.  The exact sequence of events that results in the disappearing of SD media is as follows:
  • Shortly after boot, a third-party application (e.g., Google+) issues a MediaScannerConnection::scanFile on a (non-existent) SD card media file, prior to the SD card being mounted (bug #1).
  • When the scanFile call is issued, MediaProvider attaches to the wrong external database, "external-ffffffff.db", due to the missing error check on the FileUtils.getFatVolumeId call (bug #2).
  • Once the SD card is mounted, MediaProvider creates the "external-ffffffff.db" database and populates it by scanning every file in /mnt/sdcard/.
  • When creating the new, wrong, "external-ffffffff.db" database, the garbage collector deletes the old, right, "external-XXXXXXXX.db" database (bug #3).
  • Some time later, the MediaProvider process is killed by ActivityManager so that its resources may be reclaimed ("perfect storm" condition).
  • Later yet, a new MediaProvider process is spawned to handle a media database query for an incoming call, notification, etc., at which point SD media disappears.
Now, in the last event, since the SD card is already mounted, the new MediaProvider instance attaches to the right, previously-deleted external database "external-XXXXXXXX.db" and creates it.  Since this happens outside a media mount event, the media scanner does not run and the database remains empty.  And the database will continue to remain empty while processing subsequent ringtone, notification, and other media events until reboot or a forced media scan.

So that's the story on disappearing SD media, in a nutshell. =)

Aftermath

We've created two issues in CM gerrit that patch bugs #2 and #3.  We're currently testing these patches and expect them to be included in the Epic CM9 alpha6 release.  Once they're committed from gerrit, they should make it into following round of CM9 nightlies for other devices.  We're also looking into backporting these patches to CM7 as well.  Hopefully, that'll be the end of it.

If you appreciate our work, please support the Epic CM team.  Follow us on Google+, Twitter or Facebook.

9 comments:

  1. mkasick, you guys are brilliant. I've often wondered how many devs know what's going on under the hood, and you guys obviously do.

    I also commend you for the clear way in which you were able to spell this out. Bravo! I'm sure you've helped out more than our little corner of the Android arena.

    Three cheers for the CM team.

    ReplyDelete
  2. I have to give much praise for catching that one.

    I want to add though, it's not just a perfect storm. I'd call it a shell game played from inside a dark windowless room inside an unstable pirate ship DURING a perfect storm in the Bermuda Triangle...But I guess you can downplay it if you like ;)

    ReplyDelete
  3. You guys are amazing. The Epic 4G Touch is a great phone but I miss the developers for the OG Epic more than anything else.

    ReplyDelete
  4. Mkasick, you are the rare example of someone who is able to understand and solve a complicated technical issue and then turn around and explain it in terms that anyone can understand. A double threat, so to speak. You, and the whole team, never cease to amaze me.

    ReplyDelete
  5. Great job guys! That's ridiculous. One always assumes that the API/OS you are dealing with is not the problem or others would have made a stink about it already. Sometimes though, it really is the operating system that's wrong.

    ReplyDelete
    Replies
    1. Well, ICS /is/ the only OS that results in random reboots for me when running stock -_-

      Delete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. I just figured out that hangout_ringtone.m4a was the issue with dumping the media scanner on Saturday. I wish I had posted about it sooner. :-)

    At any rate, I'm testing putting a dummy file in its place hoping that the Google+ init script is dumb enough to NOT -f the copy...


    M.

    ReplyDelete
  8. This is a very good fix! I'm eagerly waiting for the day when CM9 is out of beta (releasing on time? ;-) to install on my Epic 4G. I might even have some cake after the install...

    Currently I'm running stock rooted Gingerbread and it periodically forgets the custom ringtones on the SD except for the one I have as default, which is one I made from a recording of a real Western Electric phone bell ringing. (Why? Because I'm 40+ and a phone ought to sound like a phone.)

    I don't have Google+ so it can't be that causing the bug with FroYo and GB on this phone.

    No problem now since ICS *won't* have this bug on this phone!

    ReplyDelete