An SSD Revolution?

June 9th, 2008

I just read a very interesting on Computerworld.com article written by Jim Damoulakis http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9092918

Jim suggests that we’re on the verge of a “SSD Revolution” because of the significant performance advantage of Flash over disk drives. He makes several compelling and informative points regarding the pros and cons of SSD technology; however, I really wish he didn’t begin the article by discussing SSD in laptops. I believe the enterprise storage industry is missing an important distinction by making comparisons to consumer SSD products when discussing the adaptation of the technology in the enterprise. There are fundamental differences between commodity consumer-grade and enterprise-class products. Comparing consumer and enterprise flash based products is a little like comparing myself to Tiger Woods; we both play golf but, believe me, the similarities end there.

The intense I/O and reliability needs of 24/7 enterprise data centers are the most demanding applications and require more than the products found in Jim’s laptop. Enterprise Flash Drives (EFDs) that will meet these requirements will not be the same as those used in notebooks, cameras or MP3 players.

-Amyl Ahola.

Storage reliability for the enterprise

May 20th, 2008

I’ve written a lot about I/O performance on this blog, and with good reason.  When I discuss Pliant’s EFD device and enterprise IT system performance issues with partners and the press, one of the questions that almost always comes up is about performance.  But, I often point out that, just like when considering a sports car, performance is only part of the equation.  Reliability is of equal importance as well.

Enterprise storage applications are demanding, and it is essential that reliability specifications are met at a 100-percent duty cycle operation on a 24/7/365 basis.  Those in the industry know that true enterprise-class disk drives are required for this environment, and that disk drives designed for low cost and low duty cycle laptop/desktop applications literally fall apart when employed in an enterprise application.  Likewise, SSDs designed for laptop/desktop applications also do not even come close to meeting the need.  So, for Enterprise Flash Drives to be accepted in the enterprise they must meet or exceed enterprise class HDD reliability. This is not a trivial task. 

The primary enterprise reliability specifications take the form of MTBF (or more meaningfully: annualized failure rate) and non-recoverable error rates (lost data).  Flash technology has three primary failure phenomenon that have a significant impact on reliability:

  • Write Endurance – the limit on how many times a cell can be written/erased before it becomes damaged
  • Write/Program Disturb –  writing to a given page in a Flash chip can alter bit(s) in a page that is not being written (does not damage the cell); this is sometimes referred to as “bit flip”
  • Read Disturb – similar to Write Disturb, reading a page in a Flash chip can alter bit(s) in a page not being read (does not damage the cell)

A further complication is that these failure modes are not independent.  For example, the read disturb error rate is related to the number of writes or erases so that write endurance and read disturbs (and write disturbs) must be holistically considered.  It is obvious that they all contribute to non-recoverable errors, but perhaps not as obvious that they contribute to MTBF as well.  MTBF is a measurement of performance to specification, not just to some catastrophic event, as is typical with a disk drive.  This includes meeting performance and capacity specifications.

A common approach used in typical SSDs to deal with write endurance is to incorporate a wear-leveling algorithm to distribute writes across blocks within the chip(s), together with error correction (ECC), so that any damaged cells can be corrected when read.  This same ECC can then be applied for all reads to detect and correct altered bits (‘bit flips’) independently of how they became defective, i.e., write endurance, read disturb, or write disturb.  If the number of defective bits exceeds the ECC threshold, the sector(s) being read would then have to be marked as defective (non-recoverable error) and made unavailable to the system.  Depending on the amount of spare Flash capacity, at some point the resulting system capacity may well drop below the specification.

As an example, a well-known supplier of SSDs advertises an ECC that corrects up to 8 bytes in 1024 bytes, while another supplier advertises 6 bytes in 528 bytes.  At the same time, both talk about program erase/write cycles well in excess of 1 million.  However, tests show that both ECC levels would frequently result in non-recoverable errors after as few as 200,000 write/erase cycles.  These error rates result in SSD reliability falling far short of disk drive reliability in terms of non-recoverable error rate.  At the same time, overall capacity begins to erode and eventually falls below the device specification, resulting in an MTBF failure. 

And, that’s not all.  There is also a significant performance impact resulting from the management of these high error rates (It drops dramatically!).

The primary point is that enterprise-level reliability, whether it’s MTBF or non-recoverable error rate, can not be addressed with just traditional ECC.  Other techniques must be employed in addition to ECC to manage errors.  In addition, these additional techniques cannot be allowed to significantly impact performance (IOPs or bandwidth).

Sounds like a daunting task…or is it??  Stay tuned.

 Amyl Ahola

The energy of Earth Day

April 22nd, 2008

My guess is that today’s gaggle of green events, speeches and articles will focus on inspiring each of us to raise our environmental consciousness by rethinking the way we use energy.  No question, a noble and necessary exercise. 

However, one topic that I’m afraid may not receive its fair share of MSM attention is the rapidly growing problem of data center power consumption. 

Here’s the issue.  According to a recent report (http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf), servers and data centers account for about 1.5 percent of all U.S. energy consumption, or 61 billion kilowatt-hours (kWh).  This is more than the electricity consumed by the nation’s color televisions in a year, and about as much energy used to power 5.8 million average U.S. households.  And, at the rate our digital information requirements are growing, server/data center energy consumption will nearly double to 100 kWh by 2011, which represents about $7.4 billion in annual electricity costs. 

So what can we do to move rapidly to greener data centers?  The worst offending part of the system, the misuse of HDDs, should be among the first to be dealt with.  The fact is that many IT managers are using 3 to 4 times more HDDs than they need from a capacity perspective just to meet growing I/O performance requirements.  This “over provisioning” does not only fail to meet I/O performance needs as I noted in earlier posts, but it’s probably one of the most inefficient uses of IT technology I’ve ever seen.  Talk about a waste of space and power (not to mention money)!

With data centers under constant pressure to operate more efficiently and reduce costs, this type of waste is ridiculous, especially when there are other viable alternatives available.  One technology that deserves serious attention is the Enterprise Flash Drive, which is based on solid state technology to offer extremely high data I/O performance. 

Here’s an example of the benefits of deploying EFDs in the enterprise, without breaking the bank.  A hybrid solution combining existing hard drives (preserving some of the initial investment) with selectively deployed EFDs can greatly enhance I/O performance while eliminating the need for HDD over-provisioning.  Best of all, this type of approach can slash data center energy consumption – up to 80 percent in some cases.

Let’s face it, enterprise data centers will continue to push the envelope in terms of performance and capacity requirements.  The trick is finding ways to meet these demands in the most efficient and cost-effective way possible, and EFDs can be a great option for many organizations.

Amyl Ahola 

Never send HDD to do the job fit for EFD…

April 14th, 2008

Who could ask for more than seeing a new storage industry product announcement to highlight the points you’ve been trying to make?
 
I found myself in that position, and was quite surprised (well not really surprised…more like incredulous) to see a recent announcement of what had been frequently referred to as the Seagate “brick” project (not related to MiniScribe), but minimally disguised within a Seagate-funded private company.  The product that was announced is another version of a sealed unit consisting of multiple hard drives “purpose-built to maximize performance and reliability.”  The announcement makes it clear that many new techniques must have been employed to achieve “self-healing,” and to enable the product to essentially repair itself in place “to the equivalent of a fresh, factory-manufactured drive.”  Wow!  I will leave it up to people smarter than me to respond to this.

What I’d like to discuss is the price performance aspect of this announcement.  The systems tested were fully mirrored, making comparisons never quite “apples to apples.”  However, one needs to keep in mind that the MTBF of the drives employed require mirroring to reach any reasonable reliability level.  While I could not find any real price or performance data on the company’s web site, the reference to their SPC benchmarks provided considerable data.
 
From a pricing standpoint, the 1.03TB configuration sells for more than $36 per gigabyte (after a 40% discount from $60/GB)…and, flash-based SSD at $30/GB is considered expensive?
 
This benchmark is also said to be record-breaking with the lowest cost per SPC-1 IOPs.  I’m not suggesting that $36/GB is unreasonable, only that it illustrates the true cost of hard drives in high-performance environments.  A closer look at the benchmark is even more telling.  This “record-breaking” performance correlates to a response time of nearly 30 milliseconds.  In fact, response time increases dramatically starting at about 50% of the max IOPs, which is certainly troublesome for high transaction-rate systems.

This project was started a few years ago, apparently to address the growing price, performance and reliability gap in enterprise applications, as we have been talking about, and to hold off the encroachment of solid state storage devices.  However, with today’s technology, well designed Enterprise Flash Drives will not only be lower in cost per GB, less than 1/4th the cost per IOP, and more reliable.  And, did I mention power:  EFD’s will be well less than 1/100th the watts per IOPs.  I cannot help but be reminded of the Anderson Cooper segment on CNN:  “What were they thinking!”

Amyl Ahola

Hard disk is free…hardly!

March 26th, 2008

The dramatic reductions in HDD cost per GB have resulted in many system/storage architects (and application/operating system programmers) treating primary storage as though it is free.

Some of the results are:

  • Exponential increases in the size of operating systems and applications
  • Mass deployment of low-end and midrange servers with multiple copies of data (and applications)
  • Over-provisioning of storage to satisfy future needs projections (which also likely adopt the concept of free storage)
  • Adoption of power-hungry DRAM cache appliances to mask HDD performance shortfalls
  • Over-provisioning of HDDs to mask HDD performance shortfalls

These all result in inefficient use of storage that has many costs, not the least of which is the increasing cost of energy consumption.  Some of the energy data becoming available paints a sobering picture:

  • Data centers account for 1.5% of ALL U.S. electrical consumption, and this is expected to double in a few years
  • Power consumption per $1,000 of server spending has increased by a factor of 4 since 2000
  • Power failure and availability is expected to halt data center operations at more than 90% of all companies over the next few years
  • Fifty percent of current data centers will have insufficient power and cooling capacity this year

HDDs are clearly not the only contributor to the rapid acceleration of data center power consumption, but their inefficient use is likely one of the largest contributors.  Data that suggests more than one third of data center power consumption is storage related.

Trends and techniques such as consolidation, virtualization and thin provisioning should all contribute to improved efficiencies.  But while doing so, these approaches will put increased performance demands on the HDDs.  The result:  an increased need for higher performance (i.e., higher RPM……read that as ‘power consuming’) drives and even further over-provisioning for performance – and therefore once again increased energy consumption.

It’s time for new metrics to be considered in the data centers, which take into account energy usage to aid the system designers as they optimize their systems.  Several metrics are identified at the www.greendatastorage.com website; examples cited include activity per watt, such as transactions/Watt, IOPs/Watt, and bandwidth/Watt.

I believe that Enterprise Flash Drives (EFDs) will play a major role in reversing these trends. EFDs can provide over 1000x improvement in IOPs/Watt, and an order of magnitude or more improvement in bandwidth/Watt over the highest performing HDD’s.

Amyl Ahola

Addressing the enterprise performance gap

March 17th, 2008

Ok, I want to get back to the issue I was discussing a couple of posts ago.

The question I’m exploring is what can be done about the growing gap between disk drive and enterprise network performance, as well as the escalating inefficiencies?   One only has to look at the root cause:  the mechanical nature of disk drives.  The solution is obvious; eliminate the mechanics. 

Easier said than done! 

The Holy Grail for primary storage has always been directly addressable low latency, non-volatile random access memory.  This remains a long way off, but it is time to begin the next evolutionary step.  Solid state technology (particularly Flash) cost and performance continues to improve geometrically, and new and even more competitive semiconductor storage technologies are around the corner.  Meanwhile, disk drive performance (seek, latency) is stagnating, with only limited foreseeable improvements and with cost per I/O leveling off or even beginning to increase with time. 

Last year Greg Schulz of the StorageIO Group predicted the increasing use of solid state technology in enterprise storage applications, saying (paraphrased) that 2008 will be the year of awareness and early adoption by vendors and early deployment by customers, while 2009 will be the broader adoption phase.   Supporting that projection, EMC has recently announced their commitment to Flash and is the first major enterprise storage company to do so (http://www.emc.com/about/news/press/us/2008/011408-1.htm).  Although it was a limited announcement with what I consider an ‘entry level’ SSD technology, it is the first step towards validation of Flash technology as an enterprise primary storage device.

Solid state storage has the potential to be transformational, relegating disk drives to applications that better match their strengths, low cost per GB and large block sequential applications (for the old timers amongst us, it should be noted this is similar to the role disk played years ago with respect to magnetic tape drives).

But Flash comes with its own set of problems…(Stay tuned)

Amyl Ahola

A different “Green IT” point of view

March 10th, 2008

I was intrigued by a blog post by Chuck Hollis of EMC last week (“Chuck’s Blog”) offering an interesting perspective on the whole “green IT” issue:  http://chucksblog.typepad.com/chucks_blog/2008/03/green-it—-are.html

Chuck suggests that the IT industry may be missing the point on “green IT.”  Specifically, while it is important to pursue green IT goals from an energy efficiency perspective, the real goal should be “efficient IT,” which can, as a result, generate a number of green benefits, including improved power consumption and footprint reductions.  He further suggests that just because something is green, it doesn’t necessarily mean it is efficient.

I couldn’t agree more.  While I believe that green IT is a critical objective for virtually all enterprise IT environments, there are a number of IT efficiency issues that must be addressed now for their own sake.  Take the use of enterprise HDDs for example.  Many of today’s IT managers are using 3 to 4 times more HDDs than they need from a capacity perspective just to meet growing I/O performance requirements.   This “over provisioning” is at best a band-aid approach to improving I/O, and is probably one of the most inefficient uses of IT technology I’ve seen.

I’ll have more to say on this later, but for now I just wanted to note the importance of pursuing efficient IT for its own sake.  The benefits can be many, not the least of which is a greener IT environment.

Amyl Ahola

Point made

March 3rd, 2008

I have decided to depart from my planned comments because I can’t help but mention a recent storage announcement to help make the point of how far one can go to try and overcome HDD shortfalls.  The announcement references an array of small form factor drives (2 ½”) configured to achieve a very high ‘Actuator Density’ array — a noble objective.  According to the company’s white paper, the unit is sealed and contains 160 or more drives in a 3RU rack.  The drives apparently counter rotate and are offset from each other to overcome shock and vibration issues.

The white paper also talks about 160 drives plus 10 spares in this sealed unit with a three-year minimum life, resulting from the ‘Failure in Place’ ability to dynamically swap failed drives with self-contained spare drives.  Using their own MTBF data and typical failure statistics, this configuration would result in more than 10 failures in a three-year period in more than 50 percent of the installations.  Putting this in perspective:  more than 10 failures means replacing the entire sealed unit, and in the best case, requiring many hours recovering tens-of-TBs of data.  I would not want their warranty bill!

Given the actual failure rate of HDDs (especially consumer grade HDDs used in high duty cycle environments), and not the inflated specification numbers from HDD suppliers, I can’t imagine anyone who has experienced a catastrophic HDD failure event willing to take the risk on such a ‘sealed’ configuration that doesn’t allow hot swaps.  This seems to be marketing 101 at its best:  if you have a downside, feature it!

The white paper also propositions improvements in performance and power that, at least in this author’s opinion, when subjected to similar tests of objective analysis do not hold up.

My point is not to throw stones at this particular approach, but rather use it to illustrate the magnitude of the challenge in overcoming the inherent performance and reliability shortcomings of HDDs.  When the fundamental problem is the mechanical nature of the beast, the solution is not to keep adding more of the same. 

In one sense, this is not all that different from the political discussions of the day…the question boils down to, do you want more of the same or is it time for a new paradigm?

Amyl Ahola

The storage industry has come a long way, but… (Part 2)

February 25th, 2008

As I thought more about the topic of my last post – how far the storage industry has come since its inception – another point occurred to me.  While disk drive capacity and cost achievements have been incredible, orders of magnitude improvement, disk drive performance gains are unremarkable – especially when you compare them to the significant advances in CPU and network performance.

Now, I don’t want to receive an inbox full of angry emails (angry comments are welcomed!) about this, so let me make it clear that I truly appreciate the technological challenges and the progress that has been made towards reducing disk latency and positioning times.  But, at the end of the day performance improvement is less than 40x in nearly 50 years!  This compares to multiple orders of magnitude improvement in CPU performance during the same period.

Amdahl’s (other) Law requires that I/O performance improve at the same rate as CPU performance to maintain balanced system performance.  However, with the lag in disk drive performance I/O over the years, what we have now is a growing gap that system designers have had to cope with in their attempt to balance system performance.   The result:  the birth of new industries so that the system designers can add additional hardware – such as cache and RAID together with short-stroking and over-provisioning the disk drives – in an attempt to overcome the performance and reliability shortfalls due to the mechanical nature of HDD’s.

While these approaches do improve performance to some degree, they also carry a significant cost to customers.  This is due not only to the cost of the additional hardware and software but increased system complexity, increased power consumption, reduced reliability, increased floor space, increased maintenance expense, and on and on.  What is the true cost of HDD performance….it is anybody’s guess, but I’d argue that it is far greater than what is generally believed!

I believe that this is the most important data storage issue that needs to be addressed.   In particular, how can the industry solve the I/O performance problem without even more patches (e.g., more cache) and ever increasing over provisioning? 

I have some thoughts that I’ll share next time.  In the meantime, I’d love to hear from you.

Amyl Ahola

The Storage Industry has come a long way…

February 15th, 2008

I was recently going through some storage Web sites and came across a press release from 1961 about a new file I remember seeing when I first worked in the storage industry.  It was an announcement from Bryant about a new “High-Speed Parallel-Access Disk File.”  It was a disk drive with nearly 80 MBs that weighed 1700 lbs and sold for over $100,000.  And, this was back when 100 grand was real money.

The release got me thinking about the history of the storage industry and all the major milestones and breakthroughs over the past four decades.  All I can say, is WOW, we’ve come a very long way!  When you look at the capacity that can be stored on a device today, with over 10,000 times the capacity in less than one thousandth of the space, it’s really remarkable what this industry has accomplished.

Which brings me to the point of this post:  today, we launched a new – and I think pretty exciting – storage company, Pliant Technology (more on us later), and this blog.  I plan to use this blog as our outlet to share the experiences storage veterans and myself have  gathered over the past 40+ years (has it been that long?) in the storage industry, and provide our take on industry news, current trends and the latest technologies that are making waves.

My particular passion is analyzing where storage technology needs to go from here –particularly how it needs to evolve to higher levels of performance and efficiency to keep up with the advances of other enterprise technologies, while at the same time contributing to the WW Green effort by providing solutions enabling dramatic reductions in data center power consumption.

My goal here is to provide information that interests, challenges and even bothers you.  I’ve been in this business a long time and have seen many changes, and I’ve had the good fortune to work with some of the best in the business.  My plan is to bring all of this and more out on this blog.

I hope you’ll take the time to stop by to read, comment and even criticize on the blog.  If something strikes a chord – good or bad – I’d love to hear from you. 

Amyl Ahola