Files are fraught with peril

This is a psuedo-transcript for a talk given at Deconstruct 2019. To make this accessible for people on slow connections as well as people using screen readers, the slides have been replaced by in-line text (the talk has ~120 slides; at an average of 20 kB per slide, that's 2.4 MB. If you think that's trivial, consider that half of Americans still aren't on broadband and the situation is much worse in developing countries.

Let's talk about files! Most developers seem to think that files are easy. Just for example, let's take a look at the top reddit r/programming comments from when Dropbox announced that they were only going to support ext4 on Linux (the most widely used Linux filesystem). For people not familiar with reddit r/programming, I suspect r/programming is the most widely read English language programming forum in the world.

The top comment reads:

I'm a bit confused, why do these applications have to support these file systems directly? Doesn't the kernel itself abstract away from having to know the lower level details of how the files themselves are stored?

The only differences I could possibly see between different file systems are file size limitations and permissions, but aren't most modern file systems about on par with each other?

The #2 comment (and the top replies going two levels down) are:

#2: Why does an application care what the filesystem is?

#2: Shouldn't that be abstracted as far as "normal apps" are concerned by the OS?

Reply: It's a leaky abstraction. I'm willing to bet each different FS has its own bugs and its own FS specific fixes in the dropbox codebase. More FS's means more testing to make sure everything works right . . .

2nd level reply: What are you talking about? This is a dropbox, what the hell does it need from the FS? There are dozenz of fssync tools, data transfer tools, distributed storage software, and everything works fine with inotify. What the hell does not work for dropbox exactly?

another 2nd level reply: Sure, but any bugs resulting from should be fixed in the respective abstraction layer, not by re-implementing the whole stack yourself. You shouldn't re-implement unless you don't get the data you need from the abstraction. . . . DropBox implementing FS-specific workarounds and quirks is way overkill. That's like vim providing keyboard-specific workarounds to avoid faulty keypresses. All abstractions are leaky - but if no one those abstractions, nothing will ever get done (and we'd have billions of "operating systems").

In this talk, we're going to look at how file systems differ from each other and other issues we might encounter when writing to files. We're going to look at the file "stack" starting at the top with the file API, which we'll see is nearly impossible to use correctly and that supporting multiple filesystems without corrupting data is much harder than supporting a single filesystem; move down to the filesystem, which we'll see has serious bugs that cause data loss and data corruption; and then we'll look at disks and see that disks can easily corrupt data at a rate five million times greater than claimed in vendor datasheets.

File API

Writing one file

Let's say we want to write a file safely, so that we don't want to get data corruption. For the purposes of this talk, this means we'd like our write to be "atomic" -- our write should either fully complete, or we should be able to undo the write and end up back where we started. Let's look at an example from Pillai et al., OSDI’14.

We have a file that contains the text a foo and we want to overwrite foo with bar so we end up with a bar. We're going to make a number of simplifications. For example, you should probably think of each character we're writing as a sector on disk (or, if you prefer, you can imagine we're using a hypothetical advanced NVM drive). Don't worry if you don't know what that means, I'm just pointing this out to note that this talk is going to contain many simplifications, which I'm not going to call out because we only have twenty-five minutes and the unsimplified version of this talk would probably take about three hours.

To write, we might use the pwrite syscall. This is a function provided by the operating system to let us interact with the filesystem. Our invocation of this syscall looks like:

pwrite(
  [file], 
  “bar”, // data to write
  3,     // write 3 bytes
  2)     // at offset 2

pwrite takes the file we're going to write, the data we want to write, bar, the number of bytes we want to write, 3, and the offset where we're going to start writing, 2. If you're used to using a high-level language, like Python, you might be used to an interface that looks different, but underneath the hood, when you write to a file, it's eventually going to result in a syscall like this one, which is what will actually write the data into a file.

If we just call pwrite like this, we might succeed and get a bar in the output, or we might end up doing nothing and getting a foo, or we might end up with something in between, like a boo, a bor, etc.

What's happening here is that we might crash or lose power when we write. Since pwrite isn't guaranteed to be atomic, if we crash, we can end up with some fraction of the write completing, causing data corruption. One way to avoid this problem is to store an "undo log" that will let us restore corrupted data. Before we're modify the file, we'll make a copy of the data that's going to be modified (into the undo log), then we'll modify the file as normal, and if nothing goes wrong, we'll delete the undo log.

If we crash while we're writing the undo log, that's fine -- we'll see that the undo log isn't complete and we know that we won't have to restore because we won't have started modifying the file yet. If we crash while we're modifying the file, that's also ok. When we try to restore from the crash, we'll see that the undo log is complete and we can use it to recover from data corruption:

creat(/d/log) // Create undo log
write(/d/log, "2,3,foo", 7) // To undo, at offset 2, write 3 bytes, "foo"
pwrite(/d/orig, “bar", 3, 2) // Modify original file as before
unlink(/d/log) // Delete log file

If we're using ext3 or ext4, widely used Linux filesystems, and we're using the mode data=journal (we'll talk about what these modes mean later), here are some possible outcomes we could get:

d/log: "2,3,f"
d/orig: "a foo"

d/log: ""
d/orig: "a foo"

It's possible we'll crash while the log file write is in progress and we'll have an incomplete log file. In the first case above, we know that the log file isn't complete because the file says we should start at offset 2 and write 3 bytes, but only one byte, f, is specified, so the log file must be incomplete. In the second case above, we can tell the log file is incomplete because the undo log format should start with an offset and a length, but we have neither. Either way, since we know that the log file isn't complete, we know that we don't need to restore.

Another possible outcome is something like:

d/log: "2,3,foo"
d/orig: "a boo"

d/log: "2,3,foo"
d/orig: "a bar"

In the first case, the log file is complete we crashed while writing the file. This is fine, since the log file tells us how to restore to a known good state. In the second case, the write completed, but since the log file hasn't been deleted yet, we'll restore from the log file.

If we're using ext3 or ext4 with data=ordered, we might see something like:

d/log: "2,3,fo"
d/orig: "a boo"

d/log: ""
d/orig: "a bor"

With data=ordered, there's no guarantee that the write to the log file and the pwrite that modifies the original file will execute in program order. Instesad, we could get

creat(/d/log) // Create undo log
pwrite(/d/orig, “bar", 3, 2) // Modify file before writing undo log!
write(/d/log, "2,3,foo", 7) // Write undo log
unlink(/d/log) // Delete log file

To prevent this re-ordering, we can use another syscall, fsync. fsync is a barrier (prevents re-ordering) and it flushes caches (which we'll talk about later).

creat(/d/log)
write(/d/log, “2,3,foo”, 7)
fsync(/d/log) // Add fsync to prevent re-ordering
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig) // Add fsync to prevent re-ordering
unlink(/d/log)

This works with ext3 or ext4, data=ordered, but if we use data=writeback, we might see something like:

d/log: "2,3,WAT"
d/orig: "a boo"

Unfortunately, with data=writeback, the write to the log file isn't guaranteed to be atomic and the filesystem metadata that tracks the file length can get updated before we've finished writing the log file, which will make it look like the log file contains whatever bits happened to be on disk where the log file was created. Since the log file exists, when we try to restore after a crash, we may end up "restoring" random garbage into the original file. To prevent this, we can add a checksum (a way of making sure the file is actually valid) to the log file.

creat(/d/log)
write(/d/log,“…[✓∑],foo”,7) // Add checksum to log file to detect incomplete log file
fsync(/d/log)
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig)
unlink(/d/log)

This should work with data=writeback, but we could still see the following:

d/orig: "a boo"

There's no log file! Although we created a file, wrote to it, and then fsync'd it. Unfortunately, there's no guarantee that the directory will actually store the location of the file if we crash. In order to make sure we can easily find the file when we restore from a crash, we need to fsync the parent of the newly created log.

creat(/d/log)
write(/d/log,“…[✓∑],foo”,7)
fsync(/d/log)
fsync(/d) /// fsync parent directory
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig)
unlink(/d/log)

There are a couple more things we should do. We shoud also fsync after we're done (not shown), and we also need to check for errors. These syscalls can return errors and those errors need to be handled appropriately. There's at least one filesystem issue that makes this very difficult, but since that's not an API usage thing per se, we'll look at this again in the Filesystems section.

We've now seen what we have to do to write a file safely. It might be more complicated than we like, but it seems doable -- if someone asks you to write a file in a self-contained way, like an interview question, and you know the appropriate rules, you can probably do it correctly. But what happens if we have to do this as a day-to-day part of our job, where we'd like to write to files safely every time to write to files in a large codebase.

API in practice

Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.

When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?

Concurrent programming is hard

There are a number of reasons for this. If you ask people "what are hard problems in programming?", you'll get answers like distributed systems, concurrent programming, security, aligning things with CSS, dates, etc.

And if we look at what mistakes cause bugs when people do concurrent programming, we see bugs come from things like "incorrectly assuming operations are atomic" and "incorrectly assuming operations will execute in program order". These things that make concurrent programming hard also make writing files safely hard -- we saw examples of both of these kinds of bugs in our first example. More generally, many of the same things that make concurrent programming hard are the same things that make writing to files safely hard, so of course we should expect that writing to files is hard!

Another property writing to files safely shares with concurrent programming is that it's easy to write code that has infrequent, non-deterministc failures. With respect to files, people will sometimes say this makes things easier ("I've never noticed data corruption", "your data is still mostly there most of the time", etc.), but if you want to write files safely because you're working on software that shouldn't corrupt data, this makes things more difficult by making it more difficult to tell if your code is really correct.

API inconsistent

As we saw in our first example, even when using one filesystem, different modes may have significantly different behavior. Large parts of the file API look like this, where behavior varies across filesystems or across different modes of the same filesystem. For example, if we look at mainstream filesystems, appends are atomic, except when using ext3 or ext4 with data=writeback, or ext2 in any mode and directory operations can't be re-ordered w.r.t. any other operations, except on btrfs. In theory, we should all read the POSIX spec carefully and make sure all our code is valid according to POSIX, but if they check filesystem behavior at all, people tend to code to what their filesystem does and not some abtract spec.

If we look at one particular mode of one filesystem (ext4 with data=journal), that seems relatively possible to handle safely, but when writing for a variety of filesystems, especially when handling filesystems that are very different from ext3 and ext4, like btrfs, it becomes very difficult for people to write correct code.

Docs unclear

In our first example, we saw that we can get different behavior from using different data= modes. If we look at the manpage (manual) on what these modes mean in ext3 or ext4, we get:

journal: All data is committed into the journal prior to being written into the main filesystem.

ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.

writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

If you want to know how to use your filesystem safely, and you don't already know what a journaling filesystem is, this definitely isn't going to help you. If you know what a journaling filesystem is, this will give you some hints but it's still not sufficient. It's theoretically possible to figure everything out from reading the source code, but this is pretty impractical for most people who don't already know how the filesystem works.

For English-language documentation, there's lwn.net and the Linux kernel mailing list (LKML). LWN is great, but they can't keep up with everything, so LKML is the place to go if you want something comprehensive. Here's an example of an exchange on LKML about filesystems:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
FS dev: Probably from some 6-8 years ago, in e-mail postings that I made.

While the filesystem developers tend to be helpful and they write up informative responses, most people probably don't keep up with the past 6-8 years of LKML.

Performance / correctness conflict

Another issue is that the file API has an inherent conflict between performance and correctness. We noted before that fsync is a barrier (which we can use to enforce ordering) and that it flushes caches. If you've ever worked on the design of a high-performance cache, like a microprocessor cache, you'll probably find the bundling of these two things into a single primitive to be unusual. A reason this is unusual is that flushing caches has a significant performance cost and there are many cases where we want to enforce ordering without paying this performance cost. Bundling these two things into a single primitive forces us to pay the cache flush cost when we only care about ordering.

Chidambaram et al., SOSP’13 looked at the performance cost of this by modifying ext4 to add a barrier mechanism that doesn't flush caches and they found that, if they modified software appropriately and used their barrier operation where a full fsync wasn't necessary, they were able to achieve performance roughly equivalent to ext4 with cache flushing entirely disabled (which is unsafe and can lead to data corruption) without sacrificing safety. However, making your own filesystem and getting it adopted is impractical for most people writing user-level software. Some databases will bypass the filesystem entirely or almost entirely, but this is also impractical for most software.

That's the file API. Now that we've seen that it's extraordinarily difficult to use, let's look at filesystems.

Filesystem

If we want to make sure that filessystems work, one of the most basic tests we could do is to inject errors are the layer below the filesystem to see if the filesystem handles them properly. For example, on a write, we could have the disk fail to write the data and return the appropriate error. If the filesystem drops this error or doesn't handle ths properly, that means we have data loss or data corruption. This is analogous to the kinds of distributed systems faults Kyle Kingsbury talked about in his distributed systems testing talk yesterday (although these kinds of errors are much more straightforward to test).

Prabhakaran et al., SOSP’05 did this and found that, for most filesystems tested, almost all write errors were dropped. The major exception to this was on ReiserFS, which did a pretty good job with all types of errors tested, but ReiserFS isn't really used today for reasons beyond the scope of this talk.

We (Wesley Aptekar-Cassels and I) looked at this again in 2017 and found that things had improved significantly. Most filesystems (other than JFS) could pass these very basic tests on error handling.

Another way to look for errors is to look at filesystems code to see if it handles internal errors correctly. Gunawai et al., FAST’08 did this and found that internal errors were dropped a significant percentage of the time. The technique they used made it difficult to tell if functions that could return many different errors were correctly handling each error, so they also looked at calls to functions that can only return a single error. In those cases, depending on the function, errors were dropped roughly 2/3 to 3/4 of the time, depending on the function.

Wesley and I also looked at this again in 2017 and found significant improvement -- errors for the same functions Gunawi et al. looked at were "only" ignored 1/3 to 2/3 of the time, depending on the function.

Gunawai et al. also looked at comments near these dropped errors and found comments like "Just ignore errors at this point. There is nothing we can do except to try to keep going." (XFS) and "Error, skip block and hope for the best." (ext3).

Now we've seen that while filesystems used to drop even the most basic errors, they now handle then correctly, but there are some code paths where errors can get dropped. For a concrete example of a case where this happens, let's look back at our first example. If we get an error on fsync, unless we have a pretty recent Linux kernel (Q2 2018-ish), there's a pretty good chance that the error will be dropped and it may even get reported to the wrong process!

On recent Linux kernels, there's a good chance the error will be reported (to the correct process, even). Wilcox, PGCon’18 notes that an error on fsync is basically unrecoverable. The details for depending on filesystem -- on XFS and btrfs, modified data that's in the filesystem will get thrown away and there's no way to recover. On ext4, the data isn't thrown away, but it's marked as unmodified, so the filesystem won't try to write it back to disk later, and if there's memory pressure, the data can be thrown out at any time. If you're feeling adventurous, you can try to recover the data before it gets thrown out with various tricks (e.g., by forcing the filesystem to mark it as modified again, or by writing it out to another device, which will force the filesystem to write the data out even though it's marked as unmodified), but there's no guarantee you'll be able to recover the data before it's thrown out. On Linux ZFS, it appears that there's a code path designed to do the right thing, but CPU usage spikes and the system may hang or become unusable.

In general, there isn't a good way to recover from this on Linux. Postgres, MySQL, and MongoDB (widely used databases) will crash themselves and the user is expected to restore from the last checkpoint. Most software will probably just silently lose or corrupt data. And fsync is a relatively good case -- for example, syncfs simply doesn't return errors on Linux at all, leading to silent data loss and data corruption.

BTW, when Craig Ringer first proposed that Postgres should crash on fsync error, the first response on the Postgres dev mailing list was:

Surely you jest . . . If [current behavior of fsync] is actually the case, we need to push back on this kernel brain damage

But after talking through the details, everyone agreed that crashing was the only good option. One of the many unfortunate things is that most disk errors are transient. Since the filesystem discards critical information that's necessary to proceed without data corruption on any error, transient errors that could be retried instead force software to take drastic measures.

And while we've talked about Linux, this isn't unique to Linux. Fsync error handling (and error handling in general) is broken on many different operating systems. At the time Postgres "discovered" the behavior of fsync on Linux, FreeBSD had arguably correct behavior, but OpenBSD and NetBSD behaved the same as Linux (true error status dropped, retrying causes success response, data lost). This has been fixed on OpenBSD and probably some other BSDs, but Linux still basically has the same behavior and you don't have good guarantees that this will work on any random UNIX-like OS.

Now that we've seen that, for many years, filesystems failed to handle errors in some of the most straightforward and simple cases and that there are cases that still aren't handled correctly today, let's look at disks.

Disk

Flushing

We've seen that it's easy to not realize we have to call fsync when we have to call fsync, and that even if we call fsync appropriately, bugs may prevent fsync from actually working. Rajimwale et al., DSN’11 into whether or not disks actually flush when you ask them to flush, assuming everything above the disk works correctly (their paper is actually mostly about something else, they just discuss this briefly at the beginning). Someone from Microsoft anonymously told them "[Some disks] do not allow the file system to force writes to disk properly" and someone from Seagate, a disk manufacturer, told them "[Some disks (though none from us)] do not allow the file system to force writes to disk properly". Bairavasundaram et al., FAST’07 also found the same thing when they looked into disk reliability.

Error rates

We've seen that filessystems sometimes don't handle disk errors correctly. If we want to know how serious this issue is, we should look at the rate at which disks emit errors. Disk datasheets will usually an uncorrectable bit error rate of 1e-14 for consumer HDDs (often called spinning metal or spinning rust disks), 1e-15 for enterprise HDDs, 1e-15 for consumer SSDs, and 1e-16 for enterprise SSDs. This means that, on average, we expect to see one unrecoverable data error every 1e14 bits we read on an HDD.

To get an intuition for what this means in practice, 1TB is now a pretty normal disk size. If we read a full drive once, that's 1e12 bytes, or almost 1e13 bits (technically 8e12 bits), which means we should see, in expectation, one unrecoverable if we buy a 1TB HDD and read the entire disk ten-ish times. Nowadays, we can buy 10TB HDDs, in which case we'd expect to see an error (technically, 8/10th errors) on every read of an entire consumer HDD.

In practice, observed data rates are are significantly higher. Narayanan et al., SYSTOR’16 (Microsoft) observed SSD error rates from 1e-11 to 6e-14, depending on the drive model. Meza et al., SIGMETRICS’15 (FB) observed even worse SSD error rates, 2e-9 to 6e-11 depending on the model of drive. Depending on the type of drive, 2e-9 is 2 gigabits, or 250 MB, 500 thousand to 5 million times worse than stated on datasheets depending on the class of drive.

Bit error rate is arguably a bad metric for disk drives, but this is the metric disk vendors claim, so that's what we have to compare against if we want an apples-to-apples comparison. See Bairavasundaram et al., SIGMETRICS'07, Schroeder et al., FAST'16, and others for other kinds of error rates.

One thing to note is that it's often claimed that SSDs don't have problems with corruption because they use error correcting codes (ECC), which can fix data corruption issues. "Flash banishes the specter of the unrecoverable data error", etc. The thing this misses is that modern high-density flash devices are very unreliable and need ECC to be usable at all. Grupp et al., FAST’12 looked at error rates of the kind of flash the underlies SSDs and found errors rates from 1e-1 to 1e-8. 1e-1 is one error every ten bits, 1e-8 is one error every 100 megabits.

Power loss

Another claim you'll hear is that SSDs are safe against power loss and some types of crashes because they now have "power loss protection" -- there's some mechanism in the SSDs that can hold power for long enough during an outage that the internal SSD cache can be written out safely.

Luke Leighton tested this by buying 6 SSDs that claim to have power loss protection and found that four out of the six models of drive he tested failed (every drive that wasn't an Intel drive). If we look at the details of the tests, when drives fail, it appears to be because they were used in a way that the implementor of power loss protection didn't expect (writing "too fast", although well under the rate at which the drive is capable of writing, or writing "too many" files in parallel). When a drive advertises that it has power loss protection, this appears to mean that someone spent some amount of effort implementing something that will, under some circumstances, prevent data loss or data corruption under power loss. But, as we saw in Kyle's talk yesterday on distributed systems, if you want to make sure that the mechanism actually works, you can't rely on the vendor to do rigorous or perhaps even any semi-serious testing and you have to test it yourself.

Retention

If we look at SSD datasheets, a young-ish drive (one with 90% of its write cycles remaining) will usually be specced to hold data for about ten years after a write. If we look at a worn out drive, one very close to end-of-life, it's specced to retain data for one year to three months, depending on the class of drive. I think people are often surprised to find that it's within spec for a drive to lose data three months after the data is written.

These numbers all come from datasheets and specs, as we've seen, datasheets can be a bit optimistic. On many early SSDs, using up most or all of a drives write cycles would cause the drive to brick itself, so you wouldn't even get the spec'd three month data retention.

Corollaries

Now that we've seen that there are significant problems at every level of the file stack, let's look at a couple things that follow from this.

What to do?

What we should do about this is a big topic, in the time we have left, one thing we can do instead of writing to files is to use databases. If you want something lightweight and simple that you can use in most places you'd use a file, SQLite is pretty good. I'm not saying you should never use files. There is a tradeoff here. But if you have an application where you'd like to reduce the rate of data corruption, considering using a database to store data instead of using files.

FS support

At the start of this talk, we looked at this Dropbox example, where most people thought that there was no reason to remove support for most Linux filesystems because filesystems are all the same. I believe their hand was forced by the way they want to store/use data, which they can only do with ext given how they're doing things (which is arguably a mis-feature), but even if that wasn't the case, perhaps you can see why software that's attempting to sync data to disk reliably and with decent performance might not want to support every single filesystem in the universe for an OS that, for their product, is relatively niche. Maybe it's worth supporting every filesystem for PR reasons and then going through the contortions necessary to avoid data corruption on a per-filesystem basis (you can try coding straight to your reading of the POSIX spec, but as we've seen, that won't save you on Linux), but the PR problem is caused by a misunderstanding.

The other comment we looked at on reddit, and also a common sentiment, is that it's not a program's job to work around bugs in libraries or the OS. But user data gets corrupted regardless of who's "fault" the bug is, and as we've seen, bugs can persist in the filesystem layer for many years. In the case of Linux, most filesystems other than ZFS seem to have decided it's correct behavior to throw away data on fsync error and also not report that the data can't be written (as opposed to FreeBSD or OpenBSD, where most filesystems will at least report an error on subsequent fsyncs if the error isn't resolved). This is arguably a bug and also arguably correct behavior, but either way, if your software doesn't take this into account, you're going to lose or corrupt data. If you want to take the stance that it's not your fault that the filesystem is corrupting data, your users are going to pay the cost for that.

FAQ

While putting this talk to together, I read a bunch of different online discussions about how to write to files safely. For discussions outside of specialized communities (e.g., LKML, the Postgres mailing list, etc.), many people will drop by to say something like "why is everyone making this so complicated? You can do this very easily and completely safely with this one weird trick". Let's look at the most common "one weird trick"s from two thousand internet comments on how to write to disk safely.

Rename

The most frequently mentioned trick is to rename instead of overwriting. If you remember our single-file write example, we made a copy of the data that we wanted to overwrite before modifying the file. The trick here is to do the opposite:

Make a copy of the entire file
Modify the copy
Rename the copy on top of the original file

This trick doesn't work. People seem to think that this is safe becaus the POSIX spec says that rename is atomic, but that only means rename is atomic with respect to normal operation, that doesn't mean it's atomic on crash. This isn't just a theoretical problem; if we look at mainstream Linux filesystems, most have at least one mode where rename isn't atomic on crash. Rename also isn't guaranteed to execute in program order, as people sometimes expect.

The most mainstream exception where rename is atomic on crash is probably btrfs, but even there, it's a bit subtle -- as noted in Bornholt et al., ASPLOS’16, rename is only atomic on crash when renaming to replace an existing file, not when renaming to create a new file. Also, Mohan et al., OSDI’18 found numerous rename atomicity bugs on btrfs, some quite old and some introduced the same year as the paper, so you want not want to rely on this without extensive testing, even if you're writing btrfs specific code.

And even if this worked, the performance of this technique is quite poor.

Append

The second most frequently mentioned trick is to only ever append (instead of sometimes overwriting). This also doesn't work. As noted in Pillai et al., OSDI’14 and Bornholt et al., ASPLOS’16, appends don't guarantee ordering or atomicity and believing that appends are safe is the cause of some bugs.

One weird tricks

We've seen that the most commonly cited simple tricks don't work. Something I find interesting is that, in these discussions, people will drop into a discussion where it's already been explained, often in great detail, why writing to files is harder than someone might naively think, ignore all warnings and explanations and still proceed with their explanation for why it's, in fact, really easy. Even when warned that files are harder than people think, people still think they're easy!

Conclusion

In conclusion, computers don't work (but you probably already know this if you're here at Gary-conf). This talk happened to be about files, but there are many areas we could've looked into where we would've seen similar things.

One thing I'd like to note before we finish is that, IMO, the underlying problem isn't technical. If you look at what huge tech companies do (companies like FB, Amazon, MS, Google, etc.), they often handle writes to disk pretty safely. They'll make sure that they have disks where power loss protection actually work, they'll have patches into the OS and/or other instrumentation to make sure that errors get reported correctly, there will be large distributed storage groups to make sure data is replicated safely, etc. We know how to make this stuff pretty reliable. It's hard, and it takes a lot of time and effort, i.e., a lot of money, but it can be done.

If you ask someone who works on that kind of thing why they spend mind boggling sums of money to ensure (or really, increase the probability of) correctness, you'll often get an answer like "we have a zillion machines and if you do the math on the rate of data corruption, if we didn't do all of this, we'd have data corruption every minute of every day. It would be totally untenable". A huge tech company might have, what, order of ten million machines? The funny thing is, if you do the math for how many consumer machines there are out there and much consumer software runs on unreliable disks, the math is similar. There are many more consumer machines; they're typically operated at much lighter load, but there are enough of them that, if you own a widely used piece of desktop/laptop/workstation software, the math on data corruption is pretty similar. Without "extreme" protections, we should expect to see data corruption all the time.

But if we look at how consumer software works, it's usually quite unsafe with respect to handling data. IMO, the key difference here is that when a huge tech company loses data, whether that's data on who's likely to click on which ads or user emails, the company pays the cost, directly or indirectly and the cost is large enough that it's obviously correct to spend a lot of effort to avoid data loss. But when consumers have data corruption on their own machines, they're mostly not sophisticated enough to know who's at fault, so the company can avoid taking the brunt of the blame. If we have a global optimization function, the math is the same -- of course we should put more effort into protecting data on consumer machines. But if we're a company that's locally optimizing for our own benefit, the math works out differently and maybe it's not worth it to spend a lot of effort on avoiding data corruption.

Yesterday, Ramsey Nasser gave a talk where he made a very compelling case that something was a serious problem, which was followed up by a comment that his proposed solution will have a hard time getting adoption. I agree with both parts -- he discussed an important problem, and it's not clear how solving that problem will make anyone a lot of money, so the problem is likely to go unsolved.

With GDPR, we've seen that regulation can force tech companies to protect people's privacy in a way they're not naturally inclined to do, but regulation is a very big hammer and the unintended consequences can often negate or more than negative the benefits of regulation. When we look at the history of regulations that are designed to force companies to do the right thing, we can see that it's often many years, sometimes decades, before the full impact of the regulation is understood. Designing good regulations is hard, much harder than any of the technical problems we've discussed today.

Acknowledgements

Thanks to Leah Hanson, Gary Bernhardt, Kamal Marhubi, Rebecca Isaacs, Jesse Luehrs, Tom Crayford, Wesley Aptekar-Cassels, Rose Ames, chozu@fedi.absturztau.be, and Benjamin Gilbert for their help with this talk!

Sorry we went so fast. If there's anything you missed you can catch it in the pseudo-transcript at danluu.com/deconstruct-files.

This "transcript" is pretty rough since I wrote it up very quickly this morning before the talk. I'll try to clean it within a few weeks, which will include adding material that was missed, inserting links, fixing typos, adding references that were missed, etc.

Thanks to Anatole Shaw, Jernej Simoncic, @junh1024, Yuri Vishnevsky, and Josh Duff for comments/corrections/discussion on this transcript.