SSD block fragmentation and benchmarking

Discussion in 'Storage & Backup' started by oupimiquo, Mar 12, 2009.

  1. oupimiquo

    oupimiquo Member

    Joined:
    Sep 20, 2007
    Messages:
    520
    With the number of SSDs coming out, and the discussion of the various performance issues as the drives age, I thought it'd be good to make some diagrams of what exactly is going on to cause these issues. While I do not know the exact algorithms being used inside the Intel/Vertex/etc drives, there are some basic limitations imposed due to the nature of flash RAM. The discussion below takes a "best case" approach, where the processing power of the controller is assumed to be infinite, and the mapping tables somehow magically stored and available for instant access, etc. In the real world, these assumptions don't hold, and performance is obviously worse. Also, real SSDs have a large number of individual flash devices (the discussion below is for a single device). Although this introduces some complexity in the controller, the overall operation of the drive is still much the same.

    The smallest unit of allocation in a solid state drive is the page. For MLC drives this is typically 2048 bytes. Pages are grouped together in blocks. Within a block, pages must be written sequentially. Once a page has been written to, it cannot be written to again without being erased. However, the erasure operation is done a the block level - the entire block of pages must be erased together. A block contains typically somewhere around 64 pages, give or take a few factors of two. To a first approximation, a SSD is simply a large collection of these blocks, as shown below.

    [​IMG]

    All flash devices have an abstraction layer over the top which presents the device as a sequence of logical blocks (here assumes to be 2048 bytes to match the underlying allocation size, but for SSDs it's usually 512 bytes). The naive way of mapping logical blocks to physical flash is to assign logical block 0 to physical page 0, logical block 1 to physical page 1, and so on. This creates a problem, as writing to logical block 0 (for example) requires reading all 64 pages from physical block 0, erasing the physical block, and then writing the data back with the updated physical page 0. Besides the low performance of this method (and the rapid using up of erase cycles), there is the danger of data loss if power is lost before everything is written back.

    A slightly more advanced method is to allocate at the physical block level. This still has much the same problems as the naive method, though goes some way to fixing the problem.

    A much better solution is to allocate at a much finer level, shown in the diagram below as allocations at the page level. As writes come in to the controller, they are sequentially written out to the currently "active" (partially full) block. Once the block is full, a new (completely empty) block is chosen and writes are directed into that one. Various behind-the-scenes record keeping (not shown) is done to keep track of the assignment of physical pages to logical blocks.

    [​IMG]

    [​IMG]

    A simple mapping is shown above. This sort of layout (logical blocks assigned to their corresponding physical pages) would come about if a new drive was sequentially written to. Now, consider what would need to be done if logical block 5 was written to again. Since we cannot write directly into the first physical block, we must follow what we always have to do in this model - write into the currently partially full block. This leaves us with a "hole" in the first physical block where the old data still sits. This physical page is wasted space that cannot be used, because we would have to erase the whole of the block. This new layout is shown below.

    [​IMG]

    [​IMG]

    Eventually, the drive is down to it's last completely empty block, shown in the figure above (again, we're wanting to write to logical block 5). It can't be used like previously, since once that block is full it is likely the drive would become stuck as no more data could be written (excluding the possibility that the final few writes make some other blocks become completely full of stale pages). Instead, a block that has some stale pages must be chosen for copying. The choice is typically made based on the number of erase cycles each block has endured, and/or the number of stale pages in the block.

    [​IMG]

    The "used" pages in this selected block are copied to the empty block as shown above. Stale pages do not need to be copied, so there will be some unused pages in the destination block. The new data for logical block 5 is appended to this new partially-filled block, and the old block is erased. This final state is shown below. There's still space for one more logical block to be written, after which the copy-erase process must be done again.

    [​IMG]

    This process is clearly much slower than simply writing the logical block into an empty slot. The write amplication factor (how many physical page are writes needed to update a single physical page) can go up to a significant fraction of the number of pages per block. And since write speed is somewhat proportional to the reciprocal of the write amplication factor, performance falls off dramatically.

    To improve the situation, more free space is left at the physical level. A "30 GB" SSD, for example, has ~30*10^9 bytes of logical space, but 32*1024^3 = 34.36*10^9 physical bytes of space. While some of this space is used for the record keeping, having only 87.3% of the physical flash in use at any one point in time makes empty or nearly-empty blocks more likely (so speeds up the copy-erase process).


    At this point, I'd like to show a little movie I made showing the operations occuring on a simulated SSD. The simulated SSD doesn't do wear leveling so it can make better choices for the copy-erase step. Additionally, since I've assumed infinite controller computational power, the blocks it selects are probably better than those in a controller that has limited power. The video is encoded using the the MSU Screen Capture Lossless Codec to make it a sensible size, and looks shockingly bad if uploaded to youtube. If there's somewhere to upload a 316 KB AVI file, let me know :) At the moment I've just stuck it on Mininova: http://www.mininova.org/tor/2369653

    The colors are, before I forget:
    Blue to green = "used", the color indicating the position of the mapped logical block. Blue = the start of the drive, green = the end.
    Grey = unused page
    Yellow = stale page
    White/grey flashing = block erase in progress
    Red = orphaned block (temporary state, basically ignore it)
    It's fast-forwarded in some parts to avoid boredom :) It shouldn't be too hard to tell where ...

    The movie starts off with two linear passes through the entire logical space (logical to physical space ratio = 87%) on an initially empty SSD. You can see how as new data is written, entire blocks become stale and are reused. After the two linear passes, 512 single-page random writes are done. Although initially the random writes are fast due to streaming into the completely stale blocks, they rapidly slow down as these blocks run out. The controller has to start doing copy-erase cycles, and since there's only a couple of free pages per block, the write amplification factor goes way up, to 1/(1-0.87) + 1 = 8.9 in theory. Note that this write amplification factor is only related to the relative "fullness" of the SSD. The low write amplification factor advertised by Intel for their drives is for a "typical" workload, almost certainly on a nearly-empty drive. In these cases, there's plenty of empty or nearly-empty blocks to go around, so there's little to no copying in the copy-erase step. Looking at the Intel 80 GB drive in particular, it had 80 * 10^9 logical bytes, and 10 * 8 * 1024^3 = 85.9 * 10^9 physical bytes, so assuming no space for mapping tables that gives 93.1 % utilization. So an unfriendly workload (such as IOMeter) will have a write amplification factor of 15.6 (actually a bit higher because of the mapping tables).

    After the random writes are done, the test pattern goes back to linear writes. This isn't immediately visible at first, because a linear write only ends up being a couple of pages per block. However, this leads to larger and larger stale spaces being opened up, so longer linear portions in a block, and so on. This is visible as smooth-gradient stripes appearing in the layout. Eventually, the linear writes end up bringing the drive back to the original state of smooth gradients across every block. The vertical layout is all jumbled, but the performance is the same.

    Note that if there's insufficient free space (so too high utilization of the physical space), the drive can get stuck in a cycle where the streaming writes don't fully pull the drive out of the hole again. Consider two blocks of 32 pages. The first block contains 4 linear streams, each of 8 pages. Another block has 3 streams, each with 8 pages, and 8 stale pages. One of the 8-page regions from the first block gets written again. What happens is that the second block has a copy-erase cycle, and ends up with 4 streams of 8 pages each. The first block now has 3 stream of 8 pages, and 8 stale pages. Obviously, this pattern can continue indefinitely. All that happens is that the clumps of stale pages bounce around, and in this case the write amplification factor stays at 4. This is my guess as to what is happening with the Intel drive, as shown my pcper. The marketing department at Intel presumably decided that 80 GB was a nice round number over the protestations of the engineers, with the result that they cut it just a little too fine.


    Finally, I'd like to mention benchmarking of SSDs a bit. Very first of all, I'd say forget all the filesystem level "low-level" tests like Atto and Crystal. These are dramatically affected by the caching and filesystem handling that occurs in Windows, and don't really benchmark the underlying disk very well. HDTach/HDTune are much better, as is IOMeter (though only if you use IOMeter in the raw disk mode, not the filesystem level mode). Essentially, if you're pointing the low-level benchmark to a drive letter then you're not benchmarking the actual drive. Note that I'm not against "high level" tests like level loading times and the like, just low-level tests that aren't actually testing the drive very well.

    Secondly, running read tests on a brand new (or secure erased, in the case of Intel) SSD is pointless. The flash isn't even getting accessed in these cases - there's just a mapping table lookup, which shows that there's no mapping for that logical block, and the controller returns all zeros.

    Thirdly, write tests on new SSDs don't tell an accurate story either, since they can simply be linearly written out into the large amounts of free space. Prior to doing any benchmarks, a linear write should be done across the whole SSD. This can easily be done with any of the "secure wipe" tools, dd if you use *nix, etc. This will put the SSD into a state which is the best that you could hope for after a couple of months of use. A somewhat more realistic case would be to fire up IOMeter and run some write benchmarks. Doing a pure 2 KB random write benchmark is probably overkill if you're trying to get the disk into a "desktop used" state, but would be useful for doing worst-case benchmarking. Setting it up to do a selection of writes from 2 KB to 4 MB or so would result in a much more "realistic" SSD state.

    Once you've done this prepping of the SSD, then run whatever benchmarks you want. One good thing is that, by definition, doing things like installing software onto the drive won't affect your ability to do "used drive" benchmarks, since this write pattern is what we're in fact trying to emulate. Note that writing to a drive prepped for worse-case benchmarking will "improve" it a bit, so writes (and especially streaming writes like re-imaging) should be minimized between the conditioning of the drive and the benchmark.
     
  2. Dezza Bot

    Dezza Bot Member

    Joined:
    Aug 16, 2004
    Messages:
    4,609
    Location:
    Central Coast, NSW
    It's a good explanation of the problems that ssd's face and why I'm not yet convinced that my money is better spent on a decent ssd rather than a raptor.
     
  3. The Mafia

    The Mafia Member

    Joined:
    Mar 26, 2003
    Messages:
    1,469
    Location:
    Mackay
    thanks a lot for this - it gives me a greater understanding of how this technology works.

    Looking forward to some wicked SSDs when this problem is fixed.
     

Share This Page

Advertisement: