Maximizing Flash Lifetimes

Flash memory is cool to keep around because it’s non-volatile, it’s small, it’s convenient, and, dressed up in all manner of stickwear, well, who can deny that it’s just so darn cute. But peel back those outer attractive layers, and you find much complexity lurking within. For, while we like to think of them as just another memory that we can plug into our PCs, in fact they’re nothing like it. A surprising amount of work goes into writing and reading the gotcha picture you took of grandma stashing her weed below the false bottom of the cookie jar.

Unlike rotating or DRAM storage, we know we can’t write to Flash memory infinitely many times. At some point it craps out, and at some time before that it becomes unreliable. A given cell can be written only 100,000 or maybe a million times before it starts threatening to flake out. Does that mean we can write to our memory stick only a million times? (Horrors!) Well, no, actually.

The answer would be “yes” if we always wrote from the bottom up. Let’s say we took one picture of grandma every week and processed it (meaning we took it off the stick and lost it deep within the unfathomable depths of Photoshop, from which most pictures never re-emerge). Because we’re always putting the new picture where the old one was, we could do this for only 1,000,000 weeks. Which is, like, approximately, 19,230 years, 9 months, 7 days, 4 hours, 48 minutes, and 0 seconds. Um… ok, bad example. Let’s say it could be rewritten only 100,000 times. No, that’s still almost 2000 years.

Ah, OK. Here’s a better example. Let’s say I’m using Flash as a hard drive. And all I’m doing is saving my articles there. If I’m doing that manually, it will never wear out because I would always forget to save and be totally hosed when the power went out. Which can happen at any time here. Seriously, we’re in California. We can lose power because it’s hot. Or cold. Or because the wind blew. A little. Or not. Anyway, so thankfully Microsoft has noticed my slacking ways and does a background save for me every so often as long as I’m making changes. And I make a lot of changes. At this rate, I could use up a memory in a week. (For those of you keeping score, that’s approximately 1.653 changes per second. You’re welcome.)

Which would be a waste, since these articles aren’t that long (with apologies to those of you who think it’s gone on too long already), so most of the Flash cells would never be written to: while the bottom wears out, the top would still be fresh as a daisy. Instead, “wear-leveling” policies have been established to even out the usage. This could mean much longer cell life, since each cell isn’t being written to each time a write occurs. If a given cell is written only one in ten times, then the useful life of the memory has gone up by a factor of 10. Obviously the actual numbers here will vary by the kind of stuff being stored.

So now we need some magic to level the wear. But, on the other side of things, something tells me that Windows and Linux don’t really want to be bothered by the details of wear-leveling when writing files to storage that happens to be Flash. And even if they did, you don’t want them to, since that would mean you’d have to have special drivers for Flash, which might be different for different kinds or brands of Flash. Imagine plugging your memory stick in and getting a message like, “Unreadable media, install driver from mfr.” Only probably not quite so helpful. More likely it would say something like “Error. Solve problem and try again.” Or “FAILED CODE 4D65AB19”

Hopefully not lost in translation

So you Flash makers have kindly taken it upon yourselves to make your storage look like a hard disk to the operating system. That means you take on the burden of wear-leveling in the Flash Translation Layer (FTL). Which would sound simple enough: just a matter of adding circuitry to the Flash chip, perhaps some other chips to the module, and it’s straightforward. Right?

Wrong. In fact, new approaches to this are still being proposed. This thing is way more complicated than you would think. Or than I would think, anyway. You maybe already knew. The deal is, from an OS standpoint, you’re writing to some address – the logical block address (LBA). But because of the wear-leveling, that may end up pretty much anywhere in the Flash – there’s a separate physical block address (PBA). Exactly how the next physical address to be written is chosen is based on whatever the wear-leveling policy is; we won’t worry about that here. The trick we want to pay attention to is how you get from LBA to PBA when using the memory.

There’s one more complicating factor: Flash memories aren’t random-access. They’re divided into blocks, which are divided into pages. You read and write a page at a time; you erase a block at a time. So a physical location for reading and writing would consist of a block number and a page number within that block. When erasing, you’d be erasing every page in the block, which could be problematic if done willy-nilly. In addition, you can’t rewrite a page without erasing it. Since you have to erase an entire block, it’s better to “move” a page than actually rewrite it. So if some data was written in page x and then that data needs to be updated, page x is invalidated and the new data is written to page y. The translation mechanism has to keep track of these moving targets.

The oldest and most straightforward way of doing this is simply referred to as FTL. It, like all of these techniques, involves some DRAM and processing. The DRAM contains the mapping table from LBA to PBA. With FTL, a simple pair of numbers is stored, reflecting the block and page number for a given LBA. When trying to read an address, it’s pretty straightforward to go to the LBA in the table and from there go directly to the block and page; this happens quickly. If a page is rewritten (hence, if it moves), the table can simply be updated with the new page location.

This is all easy to manage while the system is up. But what happens when the power goes off? The Flash keeps its contents, of course, but the translation table dies. How to recover from that? Well, actually, each Flash page has “spare” storage. In addition to the basic data area, a page has extra cells for storing things like ECC bits and, more importantly here, the LBA to which the page pertains. So on power up, the management system can read through the Flash chip and rebuild the translation table.

So FTL is pretty straightforward, but it gets to be problematic with larger and larger memories. Because the table has to have one entry for each written page, it can get into the megabytes for large Flash memories. This is an issue for small, cheap implementations. So a new approach was devised, called NFTL (No, “N” doesn’t stand for “New”; it stands for “NAND.”) Where FTL is fine-grained, NFTL is coarse-grained; it has one entry per block, not per page. So the table gets much smaller.

Of course, this comes at a cost. Since each page isn’t specified, there’s no explicit way to note when a page is moved due to a rewrite. So the address translation becomes more complex. Each LBA gets a virtual block address (VBA); this location is derived by dividing the LBA by the number of pages in a block: the quotient becomes the VBA – that is, the address in the translation table, and the remainder becomes the offset into the block of the desired page. The VBA gets you to the right row in the translation table, where you find the actual target block address.

But there’s a catch here – since the table has only block information, it can’t store which pages have changed; it doesn’t have page-level granularity. Instead, two blocks are designated: a primary physical block address (PPBA) and a replacement physical block address (RPBA). The primary block is used the first time a particular page is written, and the actual page is found by indexing into the primary block by the offset. But when a rewrite happens, then the new write happens in the replacement block at the first available page; the offset isn’t used anymore.

This means that, when writing, first you have to check the primary block to see if it’s already written. If not, then write and you’re done. If so, then go to the replacement block and search through it linearly until you find the first available free cell and write it there, noting the LBA in the spare cells of the page. When reading, first you go to the primary block location and then scan through the replacement block, picking the most recently written page that has the desired LBA. Yeah, it takes a while. That’s the tradeoff for the smaller translation table.

There are also complications with the fact that, sooner or later, you will have filled your memory with rewrites, meaning that much of it is full of useless invalid data; you have to reclaim that at some point – your basic garbage collection job. Since most blocks will have a mixture of valid and invalid data, there aren’t many blocks you can just erase; you have to copy the valid stuff into new blocks so that you can reclaim the old blocks.

This is required for both FTL and NFTL, but the nuances are such that you can end up having less available space before garbage collection with NFTL than with FTL. For example, if you had only one page being rewritten a bunch of times for a block, then it would be written the first time into the primary block and then over and over in the replacement block until the replacement block was full. Recycling of both the primary and replacement blocks then happens even though the primary block had only one page written; the other pages were still empty. Some folks might consider that wasteful.

Splitting the difference

In order to improve upon NFTL and strike a happy medium between it and FTL, two new hybrid approaches have been put forth. One is called “Adaptive” FTL (AFTL), from Prof. Tei-Wei Kuo et al at National Taiwan University; the other, published this last December, is called “Clustered” FTL (CFTL), proposed by Park et al from Inha University in South Korea. The goal of both is to improve performance (in terms of read and write times as well as utilization) and keep the translation tables to a reasonable size.

Both use hash tables instead of linear tables for translation. This makes the table sizes independent of any physical memory attribute; the size can be adjusted for the best tradeoff with performance. AFTL uses one fine-grained hash table and one coarse-grained hash table. This gets a little hairy, so stick with me folks. To begin with, a coarse-grained approach is used. But when the replacement block fills up, it isn’t recycled right away. The theory is that any of the valid data in the replacement block is probably frequently-used data, so pointers to it are put in the fine-grained table, and the block is left alone. Subsequent reads of that data will go through the fine-grained table and will happen much more quickly. Meanwhile, the primary block remains in place, with, for the moment, no replacement block – speeding up reads from the primary block, since no replacement block has to be scanned. Of course, when another write happens to data in the primary block, then a new replacement block is allocated, but without erasing the primary block, so better usage can be made of the primary block.

In this manner, data that started out being addressed through the coarse-grained approach is switched to fine-grained. This can happen until the fine-grained table fills up. And here’s where we start borrowing from conventional memory caching techniques. A double-linked least-recently-used (LRU) list is maintained so that when the fine-grained table is full, the oldest data can be put back into the coarse-grained table. This means rewriting it (using the normal approach) and then nullifying the page pointed to by the fine-grained table.

In other words, when a coarse-grained replacement table is full, valid data is switched to the fine-grained approach. When the fine-grained table is full, then the oldest data is switched back to the coarse-grained approach. There’s overhead in all of this, but the intent is that there’s less overhead than in the plain-old coarse-grained approach.

The CFTL idea takes this one step further. Instead of one fine-grained table, there are two fine-grained tables: a short one for the most frequently accessed pages and a longer one for the next-most frequently accessed pages. “Clustered” hash tables are used, reducing the need for doubly-linked lists (and their attendant pointer storage). But along with the location information, the tables store hit counts and not-recently-used (NRU) information. When data is accessed in the coarse-grained table, the hit count is updated and, if it goes over a threshold, the data is moved into the long fine-grained table. If its hit count there goes over another threshold, it’s promoted to the short fine-grained table.

The NRU status is made up of a couple of bits indicating whether a page has been read or written to recently, as determined by a timer. If not, the page can be demoted back into the coarse-grained world. The details on all of this (hashing schemes, promotion and demotion thresholds, not to mention overall wear-leveling policies) get kind of mind-numbing, and I defer it to minds less numb than my own.

CFTL also does a couple other things. First, it keeps track of “continuity” in the coarse-grained hash table: when contiguous memory is stored, the first page can be put in the table along with the number of contiguous following pages so that each page doesn’t have to be explicitly tagged in the table, thus reducing the size of the table. Pre-fetching is also provided: when data from a contiguous set is requested, the entire “bucket” of data is pre-fetched so that, in the case where the next request is for a related piece of data, it’s already been fetched and can be accessed much more quickly.

So… in case you were getting bored with the standard old ways of managing Flash data, there are plenty of head-spinning new ideas to keep your neurons fresh and engaged. At about this point, my neurons haven’t been appropriately wear-leveled and are starting to show signs of data retention failure and random firing. Time to go see whether brain function relocation works as well as Flash block recycling.

Links:

ICCAD AFTL presentation (PPT)

ETRI Journal CFTL paper (PDF)