Communication Solves Flash Unpredictability

That flash thumb drive you have in your pocket is a beast of a memory. We store so much stuff in so little space; that form factor has completely democratized the storing of masses of data that would once have been assigned to a giant data center of yore. And all in your pocket now.

Of course, this comes at a cost. This is NAND flash, and when your main goal is to minimize memory area – and maximize capacity – you must trade some things off. Most notably, you can’t randomly access an individual cell – or word or any other small unit of data – for programming or erasing. (OK, you can’t erase a NOR flash memory randomly either, but you can program and read the cells randomly). You must handle NAND flash a page (for reading and programming) or a block (for erasing) at a time.

For you flash experts, this is obvious – as will be some of what follows. But let’s work through one aspect of what’s needed in order to motivate why NAND flash can be a challenge for safety-critical systems and why HCC Embedded has launched management software that can is suitable for use in such systems.

A Lot of Work for a Little Thing

All this page- and block-level stuff adds management overhead. Need to change a bit somewhere from a 0 back to a 1? (Yeah, the default “erased” state isn’t 0… it’s 1…) No can do, buckaroo… Changing a 0 to a 1 is erasing, and you do that only for an entire block. So what you end up having to do is to copy the entire block containing the one bit you want to change into a fresh new block – copying everything except for that one bit, of course. It’s already erased in the fresh block, and you want to keep it that way.

Wait – you say there are no free blocks left to put this in? Oh, well, in that case we need to look for a block that’s no longer used (just like our current block soon will be), erase it, and then do the copy thing.

Or, perhaps there are no available blocks? Then you need to copy the block out, erase the block, and then rewrite it the way you want it. (And hope that there’s no power outage while this is happening, since you’ll probably be storing the contents in volatile memory until you can get it written back in.)

And all of this simply to turn one bit from a 0 to a 1.

And that’s only where the management part starts. Flash cells can wear out if programmed too many times – the so-called endurance spec tells you how many chances you get to write to a cell. But it’s not black-and-white: it’s not like a cell will up and stop working one day. The contents of a cell will also leak out over time – that’s the data retention spec. As you program more times, you create nano-damage (or is it pico-damage?) that accelerates the data loss.

So a worn-out cell isn’t one that stops working: it’s one that can no longer hold data for the specified time. It might still hold data; just not for as long. Or you might get some errors here and there as noise makes the read process less certain.

This noise can also happen at some level on a fresh memory, so error-correcting codes (ECC) are used to correct for a certain number of failures. At some point, however, the failures will be more than the ECC can correct, and that’s another symptom of wear-out.

Naturally, then, if you treat a NAND flash memory like a DRAM, you’ll have a few lower addresses you bombard constantly with new data. Assuming you’re not completely filling the memory, you’ll have the high-address cells that you never really get to. So, while the low-address cells are wearing out, the high-address ones are untouched – and the entire memory may be deemed worn out based only on some cells, while many other perfectly good cells remain.

Or you might have some cells that have values that change infrequently while others change a lot. Those infrequently-changed cells aren’t wearing out nearly as fast as the quick-change ones are.

I’ve Been Moved

Which is why wear-leveling was invented. Not only do you put new data into random different blocks; you might even need to pick up a static block and rewrite it somewhere else to keep things moving around. The idea is that, over the life of the memory, all cells should more or less wear out at the same rate so that when the memory is finally done, there are no islands of untouched, wasted cells.

But here’s the thing: wear-leveling means that you have to have a table mapping the addresses the system knows to whatever the actual cell addresses are at any given moment – since they move around. This is done by low-level software. And how long it takes will vary, depending on what needs moving and how many operations are needed. For instance, can you move to a fresh block? Or do you need to find and erase a block first?

A deeply embedded application – especially one for a safety-critical system – will run an RTOS, so there is some level of timing predictability as compared to a non-real-time OS. But still, they get lots of interrupts, and it can be impossible for the OS to predict how long these flash management operations will take. And, when it comes to safety, predictability is important. This has made NAND flash tough to use where safety is a requirement.

“What about NOR flash?” you may ask. NOR is often the preferred choice for safety-critical systems; it’s also used for executing code in place. Because reading can be done in random locations, you simply read and execute – a useful thing for a stored BIOS. If the code doesn’t change, then you’re not programming or wearing out the device. That said, a flash cell is – in theory – a flash cell, and it’s not immediately clear why hooking them up differently would change any reliability characteristics.

Well, it turns out that, yes, the cells are the same in theory, but they’ve been designed with different goals. The NAND goal was capacity – meaning super small cells. The NOR goal is more about reliability, since critical code like BIOS boot code relies on a faithful reading of the contents each and every time. So the cells are larger, and no ECC is used. Which is why it usually gets the nod for safety-critical systems.

Translation Please

This and other related functions are part of a low-level software layer called the Flash Translation Layer (FTL). It handles all of the management overhead so that the rest of the system doesn’t have to worry about it. And it involves more than just wear leveling; HCC Embedded’s FTL also handles:

The ECCs
Bad-block management – if some block has a failure for some reason, then it can be removed from the map so that you don’t have to throw away the rest of the memory
Read-disturb management: if you read one bit, you can, in principle, accidentally change (or “disturb”) another unrelated bit. The more times you read in a row without erasing and refreshing the block, the greater is this risk. HCC Embedded helps to manage this issue.

So that’s all well and good, but we are still challenged to use this in a safety-critical environment. The goal here, as described by HCC Embedded, is referred to as Safety Elements Out of Context, or SEOOC. The idea is to reduce the amount of common system-component redesign from scratch for each new safety-critical project. Instead, the intent is to have modular pieces – like a memory or an FTL – that have been shown to be safe – provided certain assumptions are adhered to.

Those assumptions are documented in an accompanying safety manual. If you use the module in accordance with that manual, then you can pass safety muster without redoing all of the work necessary to requalify the module.

It’s for this reason that HCC Embedded has developed a failsafe FTL. Which might seem to be a non-starter, since, by their very nature, these management functions aren’t predictable. You don’t know ahead of time when you’re going to need them, and, when you do need them, you don’t know how long it will take.

But let’s be more specific about this: the RTOS doesn’t know how long an FTL function will take. And the FTL has no idea how many times it might be interrupted before finishing a task. Does that make this completely unsuitable for safety?

No. The problem isn’t that the timing might be different every time; the problem is that the OS doesn’t know what the timing is going to be, so it can’t handle FTL functions deterministically. And determinism is the name of the game here.

But, as with so many problems – especially human ones – good communication can save the day. Here’s the deal: the FTL can look at what needs doing and determine how many cycles it will take. The OS won’t know that – unless the FTL tells it. And that’s the difference with a failsafe FTL: the OS can ask the FTL how many cycles it needs for some operation, and the OS can then schedule that operation according to the priorities in play at the moment.

Yeah, on a large scale, there’s still some unpredictability here, but, at a local scale, where it really matters, the OS now gets enough visibility into what’s happening to calmly, predictably manage the operations.

Meaning that the FTL (and its associated safety manual) are now suitable for safety-critical systems. It can be leveraged for both NAND and NOR memories And it makes NAND memory much more viable when safety matters.

More info:

HCC Embedded Failsafe FTL