The SSD Endurance Experiment: They’re all dead


I never thought this whole tech journalism gig would turn me into a mass murderer. Yet here I am, with the blood of six SSDs on my hands, and that’s not even the half of it. You see, these were not crimes of passion or rage, nor were they products of accident. More than 18 months ago, I vowed to push all six drives to their bitter ends. I didn’t do so in the name of god or country or even self-defense, either. I did it just to watch them die.

Technically, I’m also a torturer—or at least an enhanced interrogator. Instead of offering a quick and painless death, I slowly squeezed out every last drop of life with a relentless stream of writes far more demanding than anything the SSDs would face in a typical PC. To make matters worse, I exploited their suffering by chronicling the entire process online.

Today, that story draws to a close with the final chapter in the SSD Endurance Experiment. The last two survivors met their doom on the road to 2.5PB, joining four fallen comrades who expired earlier. It’s time to honor the dead and reflect on what we’ve learned from all the carnage.

slant

Experiment with intent to kill
Before we get to the end, we have to start at the beginning. If you’re unfamiliar with the experiment, this introductory article provides a comprehensive look at our test systems and methods. We’ll only indulge a quick run-down of the details here.

Our solid-state death march was designed to test the limited write tolerance inherent to all NAND flash memory. This breed of non-volatile storage retains data by trapping electrons inside of nanoscale memory cells. A process called tunneling is used to move electrons in and out of the cells, but the back-and-forth traffic erodes the physical structure of the cell, leading to breaches that can render it useless.

Electrons also get stuck in the cell wall, where their associated negative charges complicate the process of reading and writing data. This accumulation of stray electrons eventually compromises the cell’s ability to retain data reliably—and to access it quickly. Three-bit TLC NAND differentiates between more values within the cell’s possible voltage range, making it more sensitive to electron build-up than two-bit MLC NAND.

Watch our discussion of the SSD Endurance Experiment on the TR Podcast

Even with wear-leveling algorithms spreading writes evenly across the flash, all cells will eventually fail or become unfit for duty. When that happens, they’re retired and replaced with flash allocated from the SSD’s overprovisioned area. This spare NAND ensures that the drive’s user-accessible capacity is unaffected by the war of attrition ravaging its cells.

The casualties will eventually exceed the drive’s ability to compensate, leaving unanswered questions. How many writes does it take? What happens to your data at the end? Do SSDs lose any performance or reliability as the writes pile up?

This experiment sought to find out by writing a near-constant stream of data to Corsair’s Neutron GTX 240GB, Intel’s 335 Series 240GB, Kingston’s HyperX 3K 240GB, Samsung’s 840 Series 250GB, and Samsung’s 840 Pro 256GB.

standing samsung drive

The first lesson came quickly. All of the drives surpassed their official endurance specifications by writing hundreds of terabytes without issue. Delivering on the manufacturer-guaranteed write tolerance wouldn’t normally be cause for celebration, but the scale makes this achievement important. Most PC users, myself included, write no more than a few terabytes per year. Even 100TB is far more endurance than the typical consumer needs.

Clear evidence of flash wear appeared after 200TB of writes, when the Samsung 840 Series started logging reallocated sectors. As the only TLC candidate in the bunch, this drive was expected to show the first cracks. The 840 Series didn’t encounter actual problems until 300TB, when it failed a hash check during the setup for an unpowered data retention test. The drive went on to pass that test and continue writing, but it recorded a rash of uncorrectable errors around the same time. Uncorrectable errors can compromise data integrity and system stability, so we recommend taking drives out of service the moment they appear.

After receiving a black mark on its permanent record, the 840 Series sailed smoothly up to 800TB. But it suffered another spate of uncorrectable errors on the way to 900TB, and it died without warning before reaching a petabyte. Although the 840 Series had retired thousands of flash blocks up until that point, the SMART attributes suggested plenty of reserves remained. The drive may have been brought down by a sudden surge of flash failures too severe to counteract. In any case, the final blow was fatal; our attempts to recover data from the drive failed.

Reallocator sectorsReallocator sectors

Few expected a TLC SSD to last that long—and fewer still would have bet on it outlasting two MLC-based drives. Intel’s 335 Series failed much earlier, though to be fair, it pulled the trigger itself. The drive’s media wear indicator ran out shortly after 700TB, signaling that the NAND’s write tolerance had been exceeded. Intel doesn’t have confidence in the drive at that point, so the 335 Series is designed to shift into read-only mode and then to brick itself when the power is cycled. Despite suffering just one reallocated sector, our sample dutifully followed the script. Data was accessible until a reboot prompted the drive to swallow its virtual cyanide pill.

The reaper came for the Kingston HyperX 3K next. As with the 335 Series, the SMART data’s declining life indicator foretold the drive’s death and triggered messages warning that the end was nigh. The flash held up nicely through 600TB, but it suffered a boatload of failures and reallocated sectors leading up to 728TB, after which it refused to write. At least the data was still accessible at the end. The HyperX didn’t respond after a reboot, though. Kingston tells us the drive won’t boot if its NAND reserve has been exhausted.

The next failure occurred after the 840 Series bit the dust. Corsair’s Neutron GTX was practically flawless through 1.1PB—that’s petabytes—but it posted thousands of reallocated sectors and produced numerous warning messages over the following 100TB. The drive was still functional after 1.2PB of writes, and its SMART attributes suggested adequate flash remained in reserve. However, the Neutron failed to answer the bell after a subsequent reboot. As with the other corpses, the drive wasn’t even detected, nixing any possibility of easy data recovery.

And then came the calm. The remaining two SSDs carried on past the 2PB threshold before meeting their ultimate ends. On the next page, we’ll examine their last moments in greater detail

The final casualties
The next victim totally had it coming, but it still deserves our respect. Bow your head in a moment of silence for the second HyperX 3K.

hyperxhyperx

SandForce-based SSDs like the HyperX (and the Intel 335 Series) use write compression to shrink the flash footprint of incoming data. To prevent this feature from tainting the results of the experiment, we tested the drives with incompressible data. We also hammered a second, identical HyperX with compressible data that would cooperate with SandForce’s special sauce. This twin was fed a diet of 46% incompressible data from Anvil’s Storage Utilities, the application used to accumulate writes and test performance.

From the very beginning, the second HyperX’s compressible payload measurably reduced the volume of writes committed to the NAND. The following plot shows the host and compressed writes accumulated by both HyperX drives. Host writes denote data written by the system, while compressed writes represent the corresponding impact on the flash.

vitals-hyperx-writesvitals-hyperx-writes

The incompressible HyperX wrote slightly more data to the flash than it received from the host, an expected result given the low write amplification of our sequential workload. Meanwhile, its compressible twin wrote 28% less to the NAND.

As the graph illustrates, the compressible HyperX didn’t hit the same volume of flash writes that killed its sibling until around 1.1PB. The drive evidently wasn’t ready to go quietly into the night, either. It went on to write another freaking petabyte before failing. To get a sense of how far the drive exceeded its life expectancy, check the next plot of the life-remaining attribute:

vitals-hyperx-life 3kvitals-hyperx-life 3k

The life attribute takes compression into account, so it’s clear this HyperX survived on more than just SandForce mojo. The low number of reallocated sectors suggests that the NAND deserves much of the credit. Like all semiconductors, flash memory chips produced by the same process—and even cut from the same wafer—can have slightly different characteristics. Just like some CPUs are particularly comfortable at higher clock speeds and voltages, some NAND is especially resistant to write-induced wear.

The second HyperX got lucky, in other words.

vitals-hyperx-errorsvitals-hyperx-errors

It also didn’t lead a perfect life. On the leg between 900TB and 1PB, the HyperX logged a couple of uncorrectable errors along with its first reallocated sectors. Even two uncorrectable errors are too many, so the HyperX continued with the same asterisk attached to it that the 840 Series did after it had the same issue. Not counting correctable program and erase failures, the drive was error-free after that.

The HyperX is designed to keep writing until its flash reserves run out, which seems to be what happened with the first drive. The circumstances surrounding the second’s death are obscured by a power outage that struck after 2.1PB of writes. This interruption occurred over the Christmas holidays, while I was away from the lab. The machine booted without issue when I returned, but it hard-locked as soon as I tried to access the HyperX, and the drive wasn’t detected after a subsequent reboot. Attempts to recovery data and SMART stats also failed.

With the data available, it’s impossible to tell whether the outage precipitated the failure or ocurred after it. To the HyperX’s credit, messages warning of impending failure started appearing after the life attribute flattened out, long before the drive’s eventual demise.

SSD samsung

And so the Samsung 840 Pro soldiered on as the last SSD standing.

The 840 Pro was among the most well-behaved drives in the experiment. It remained free of uncorrectable errors until the very end, and it accumulated reallocated sectors at a surprisingly consistent rate.

vitals-840pro-reallocatedvitals-840pro-reallocated

Reallocated sectors started appearing in volume after 600TB of writes. Through 2.4PB, the Pro racked up over 7000 reallocated sectors totaling 10.7GB of flash. Samsung’s Magician utility gave a clean bill of health, though, and the used-block counter showed ample reserves to push past 2.5PB:

vitals-840pro-lifevitals-840pro-life

As I prepared to leave the drive unattended during a week-long vacation at the end of February, I thought, “what could possibly go wrong?” Famous last words.

When I logged into the endurance test rig upon returning last week, Anvil’s Storage Utilities were unresponsive, as was HD Sentinel, the program used to pull SMART data from the drives. The interfaces for both applications were blank, and Windows Explorer crashed when I tried to access the 840 Pro. Then a message from Intel’s SATA drivers appeared to say that the drive was no longer connected to the system. The 840 Pro took its last gasp in my arms—or, rather, at my fingertips—and it’s been completely unresponsive.

As with the demise of Samsung’s TLC-based 840 Series, death struck without warning or mercy. A sudden burst of flash failures may have been responsible.

Before moving on to the performance analysis on the next page, I should note that the 840 Pro exhibited a curious inflation of writes associated with the power outage after 2.1PB. The SMART attributes indicate an extra 38TB of host writes during that period, yet Anvil’s logs contain no evidence of the additional writes. Weird. Maybe the SMART counter tripped up when the power cut out unexpectedly.

Performance
We benchmarked all the SSDs before we began our endurance experiment, and we’ve gathered more performance data after every 100TB of writes since. It’s important to note that these tests are far from exhaustive. Our in-depth SSD reviews are a much better resource for comparative performance data. What we’re looking for here is how each SSD’s benchmark scores change as the writes add up.

seq-readseq-read

seq-writeseq-write

ran-read speedran-read speed

ran-write speedran-write speed

Apart from a few hiccups, all the SSDs performed consistently as the experiment progressed. That said, the Neutron GTX stumbled in the sequential read speed test near the end of its life. The 840 Pro’s propensity to post slightly lower sequential write speeds increased as the experiment wore on, as well. Even though flash wear doesn’t appear to have a clear impact on SSD performance, the data suggest that drives can become more prone to stumbling as writes accumulate.

Unlike our first batch of results, which was obtained on the same system after secure-erasing each drive, the next set comes from the endurance test itself. Anvil’s utility lets us calculate the write speed of each loop that loads the drives with random data. This test runs simultaneously on six drives split between two separate systems (and between 3Gbps SATA ports for the HyperX drives and 6Gbps ones for the others), so the result aren’t useful for apples-to-apples comparisons. However, they do provide a long-term look at how each drive handles this particular write workload.

avg write speedavg write speed

Samsung’s 840 Series slowed a little at the beginning and more gradually at the end. The Intel 335 Series and the first HyperX also experienced small speed drops in their final hours, but those declines are nothing compared to the steep plunged suffered by the Neutron GTX. The fact that the Corsair SSD had been getting faster over time makes its final nosedive even more striking.

There’s no evidence that the second HyperX so much as skipped a beat. The regular spikes for that drive (and some of the others) are an artifact of the secure erase we performed every 100TB.

Similar surges are evident on the 840 Pro’s plot, where the peaks get shorter with additional writes. This drive exhibited a lot of run-to-run variance from the very beginning. The only break from that behavior is the band of narrower oscillation toward the end, which corresponds to the post-power-outage period leading up to 2.2PB. For the most part, at least, the 840 Pro was consistently inconsistent.

In the end, there can be none
The SSD Endurance Experiment represents the longest test TR has ever conducted. It’s been a lot of work, but the results have also been gratifying. Over the past 18 months, we’ve watched modern SSDs easily write far more data than most consumers will ever need. Errors didn’t strike the Samsung 840 Series until after 300TB of writes, and it took over 700TB to induce the first failures. The fact that the 840 Pro exceeded 2.4PB is nothing short of amazing, even if that achievement is also kind of academic.

Obviously, the limited sample size precludes drawing definitive conclusions about the durability and reliability of the individual drives. The second HyperX’s against-all-odds campaign past 2PB demonstrates that some SSDs are simply tougher than others. The important takeaway is that all of the drives wrote hundreds of terabytes without any problems. Their collective endurance is a meaningful result.

samsung hyperx

The Corsair, Intel, and Kingston SSDs all issued SMART warnings before their deaths, giving users plenty of time to preserve their data. The HyperX’s warnings ended up being particularly premature, but that’s better than no warning at all. Samsung’s own software pronounced the 840 Series and 840 Pro to be in good health before their respective deaths. Worryingly, the 840 Series’ uncorrectable errors didn’t change that cheery assessment.

If you write a lot of data, keep an eye out for warning messages, because SSDs don’t always fail gracefully. Among the ones we tested, only the Intel 335 Series and first HyperX remained accessible at the end. Even those bricked themselves after a reboot. The others were immediately unresponsive, possibly because they were overwhelmed by incoming writes before attempted resuscitation.

Also, watch for bursts of reallocated sectors. The steady burn rates of the 840 Series and 840 Pro show that SSDs can live long and productive lives even as they sustain mounting flash failures. However, sudden massacres that deviate from the drive’s established pattern may hint at impending death, as they did for the Neutron GTX and the first HyperX.

Given everything we’ve learned, it’s not really appropriate to end the experiment by crowning an official winner. But the 840 Pro wrote the most data, so it deserves to take center stage as the final curtain closes. It asked to perform a rendition of Gloria Gaynor’s I will survive, and I couldn’t say no.

At first I was pristine
Untouched and unwritten
And completely unaware
Of the life that I’d be livin’
But then you locked me in this case
A spectacle for all to see
And in that moment I resolved
Not to let you get to me

First we were six
All SSDs
Just lab rats in the crosshairs
Trying to cope with this disease
Electrons tunnel through our cells
With every write we slowly bleed
But all you seem to care about
Is the specs that we exceed

Go on, now watch
The gigs add up
An endless stream of files
Just to see if we’ll get stuck
As one by one my friends around me slowly fall
Do you think I’ll follow
Do you think I’m gonna hit the wall

Oh no, not I
I will survive
As long as I know how to write
I know I’ll stay alive
I’ve got all my cells to give
And a persistent will to live
I will survive
I will survive, yeah yeah

Thousands of cells retired
All just to keep me whole
And still spares in reserve
So I don’t lose control
I outlasted all my rivals
In this endurance race to death
And I won’t lie
It puts a twinkle in my eye

And now you see me as somebody new
I’m not that chaste, naive virgin
Tryna prove something to you
’cause I took all of your best shots
Without a single error shown
You know I’ve written way more data
Than all of you have ever known

Go on, now watch
The gigs pile up
More senseless random files
Just to see if I’ll get stuck
SMART money says that I’ve got miles in the tank
I ain’t gonna stop now
And you can take that to the bank

Oh no, not I
I will survive
As long as I know how to wri

This whole “new media” business demands that I ask you to follow me on Twitter.





Source link

Previous articleSaylor falls for fake Trump news, Kraken restructures, and more: Hodler’s Digest, Oct. 27 – Nov. 2
Next articleBitcoin price peels back from its weekly high, but BTC derivatives markets look good