Enterprise Strategy Group | Getting to the bigger truth.TM
Search

Are we missing the point with SSD/Flash?

Sometimes we do things “just because” it’s how we’ve always done things – regardless of whether or not it makes sense to do. It seems to me that all the excitement, money, and noise around Flash these days is creating one of those vacuums that sucks everyone along, but I’m not sure the end game is going to make life better.

Here’s what I mean; Flash or SSDs make an “element” of infrastructure faster. It makes it faster because the sub-component of the element in question has I/O done on mechanical disk – and Flash/SSD is memory, which is much faster than disk. That part I get – and – I’m all for it. Go nuts. Kill the disk. Sorry Seagate. I am not at all saying faster stuff is bad. I’m saying that in context, it isn’t as good as it needs to be.

Why do we need those elements to go faster? The answer is – eventually – that we want an application to go faster – which means, eventually – that what we REALLY want is the USER (be it a person, server, process, etc.) to be able to access (to manipulate or display) the DATA associated with the application being run.

Therefore – the only relationship we should ultimately be concerned with is the relationship between the requester (user) and their data. Everything else in the middle is infrastructure. Infrastructure breaks into zillions of ELEMENTS. Thus, making some of the elements faster CAN be good – but unless you make EVERY element faster, it won’t ever be perfect or optimized.

Thus, my contention is that Flash/SSDs are great as a natural evolutionary component upgrade in the sub-relevant world of gizmos, but will never be completely relevant to the real mission of globally enhancing and automatically optimizing the performance (and availability) of the primary relationship between the user and their associated data. Read that again if you must. It can be a piece of the puzzle, but will never be the whole enchilada, therefore, it should not carry such outrageous exuberance beyond its legitimate business opportunities.

Case in point – the processor. Do you need bigger, faster processors today? Not really. Does the fact that the economics of processors is such that we can afford to have super mega huge processors on every element in the infrastructure guarantee that the HOLISTIC infrastructure dynamically optimizes the performance/availability relationship between the user and their data? No. That’s not to say processors are bad – far from it – but it makes the point that you never solve any systemic issue with only a single view. Faster processors and VMware enable us to get rid of many elements!!! But if/when we do, we expose all sorts of new problems that didn’t exist prior – like I/O, and all the operational baggage that goes along with any change in the data center.

If you are a company with over 100TB of file data, growing at 50% or more per year, our research says that you support a single critical application with 39 file serving devices on average. The reason you have 39 of these file serving devices is because A: that’s just the way it happened, B: you need all of these for performance/load balancing, C: capacity requirements caused it, or D: see A. The ONE application using this data may have 39 of its own servers associated with it – supporting X number of users. Adding Flash/SSD to 1 of the file servers might make the data on that 1 server faster – fact. But it won’t do jack for the other 38 file servers, nor will it do diddly for the application servers, nor will it do diddly to reconfigure the network to take advantage of it. It will make the files that sit on that one file server accessible faster if the data you stuff into the Flash or SSD happens to be there.

I would argue that stuffing Flash into that 1 file server/array/gizmo will cause MORE work to be done operationally – because now you have to move and migrate things to take advantage of that new sub-component – and every time you change anything, you open yourself up to new SYSTEMIC issues. Data/System migrations for capacity optimization or load balancing is a never-ending tactical nightmare of IT operations. Flash maybe kept someone sane by masking the need to do one of these migrations on that system, but only for a little while. In reality, it probably causes more work to leverage the benefits of the Flash than it’s worth.

Thus, since no human can really optimize the overall environment the way things currently happen, the only real way to drive value from this exercise seems to be that you would have to make sure you upgrade the sub-component that you believe causes you the problem du jour – storage I/O for example – systemically (add it to EVERY file server), and then manually tune/alter the rest of the network and server infrastructure accordingly.

I contend that there must be a better way. It is simply not practical or reasonable to systemically apply higher performing sub-components at all levels in today’s world and expect the ultimate issue to be solved. Even if you could afford to do so, the incremental operational burden of manually optimizing the environment would take an army of PhDs in a STATIC environment – and I don’t think it’s possible in the real world dynamic, always changing, never ending data growth world in which we live.

Therefore, what we need is a way to automatically/dynamically optimize the connection performance/availability of the user/data relationships on what we have today, tomorrow, and the next day – no matter what changes or when. The only way I can think that can happen is to CENTRALIZE the control (for availability and routing – and application prioritization/policy) and the performance. We need to take the intelligence from all the elements and create uber control and cache systems. Central Flash married with real application/infrastructure intelligence – sort of how a contained system operates. Sort of what VMware aspires to do. The problem is that even if VMware pulls off the data center coup of the century, they can only puppet master the individual elements – not control the optimization flow between those elements.

So, there you have it. Go make that. Send me a check.

  • Share/Bookmark

Related posts:

  1. F5
  2. Marketing101
  3. More on Internal IT Misalignment
  4. IPO Anyone?
  5. Latest Storage Mag article

Tags: , , , , , , ,

All views and opinions expressed in ESG blog posts are intended to be those of the post's author and do not necessarily reflect the views of Enterprise Strategy Group, Inc., or its clients. ESG bloggers do not and will not engage in any form of paid-for blogging. Click to see our complete Disclosure Policy.

4 Responses to “Are we missing the point with SSD/Flash?”

  1. Tony Asaro says:

    Steve – disk drives are the slowest thing in the performance chain which is why solving this problem does become valuable. There already is caching in multiple places – the host systems and the storage controllers – and cache + cache is additive. However, there will be cache misses and that is why technologies like wide stripping make a real impact on overall performance. Additionally, there is a need for faster processors – if Intel didn’t drive this then VMware couldn’t exist. Who knows what unforeseen value improving performance at levels will manifest? I agree that SSD doesn’t necessarily solve all problems and remember it is price/performance that is what ultimately matters and right now SSD has not hit the “no-brainer” curve. There is no panacea for solving this problem when all the parts of the data center are so disaggregated and heterogeneous. That is why having the market drive improvement with each vendor motivated to do so.
    —-I agree that we absolutely need to constantly improve at the component level, and apologize if I didn’t state that clearly. However, the concept of wide-striping is a great example of my point – it is only relevant if the data I’m striping is within that system. If the data most important at this point in time sits elsewhere, the fact that array A is wide striped is meaningless to me. The same holds true if that array contains zero disks – and all SSD. Success is still predicated on the data required BEING in that place.
    With data growth never abating, it signifies that you will never have all your data in any one container, thus there will always be a growing percentage of that data that is sub-optimzed. I guess what I’m rambling about is that even if EVERY storage array were 100% SSD – it would still not make for 100% optimized IT infrastructure hollistically – it would just remove the storage end of the wire from being the issue and push it elsewhere. Thanks — Steve

  2. the storage anarchist says:

    Pragmatically speaking, I agree with you that centralizing storage resources is more efficient than distributing them out into the servers. This is the foundation of the external storage value proposition.
    And I believe that centralizing expensive resources designed to improve I/O performance is also most prudent – we’ve done that for years with large DRAM caches and the compute power to fuel intelligent pre-fetch to increase cache hit rates, for example.
    But I take exception with the assertion that adding Flash to the centralized resource pool in an external array necessarily increases management complexity – in fact, I believe that the opposite is true.
    I contend that with the appropriate intelligent algorithms and architecture, the external array can leverage Flash to improve performance in much the same way that it leverages DRAM cache today.
    Furthermore, extensive installed-base analysis shows that the vast majority of workloads are far more static than people imagine. This is especially true the larger your granularity of analysis: DRAM cache is appropriate for block-level optimization because the “busy” block set will always be larger than DRAM. But when you move up in granularity, you find that the vast majority of systems service 80-90% of all IO cache-miss requests from only 5-10% of the LUNs.
    And perhaps surprisingly, the set of “workhorse” LUNs don’t really change all that much over time.
    This fact is indeed the foundational reason why tiering across 15K, 10K and 7200rpm SATA drive works for so many customers. And it is ALSO the foundational reason why wide-striping can work – each drive has to support a small fraction of the workhorse LUNs, and the rest of the LUNs don’t see much I/O anyway so their impact on the workhorses is rare.
    What most people haven’t considered is that taking the HEAVIEST workhorse LUNs – the ones that service the MOST IOs per day/week/month – and moving just these to Flash has two immediate benefits: 1) these busy LUNs get the benefit of the drastically reduced response times afforded by Flash, and 2) the rest of the workload left on spinning rust ALSO gets better response time BECAUSE THE DISK DRIVES DON’T HAVE TO WORK AS HARD! Less head contention = better performance!
    In fact, customers often find that they can wide-stripe the remaining workload on SATA drives, using a combination of Flash (lowest $/IOP) and SATA (lowest $/GB) to reduce their overall purchase cost and environmental footprint – without sacrificing performance OR increasing operational overhead!
    And given that the workload distribution of most arrays is typically very stable, you usually won’t have to add any overhead to constantly rebalance or redistribute the data in/out of Flash.
    And for those environments where the workloads ARE more dynamic, EMC has announced FAST – Fully Automated Storage Tiering. FAST will dynamically analyze running workloads and adjust the distribution of LUNs and/or sub-LUNs across the available storage tiers.
    FAST is probably the “that” you’re looking for, but it’s not necessarily the case that everyone needs FAST to benefit from centralized Flash.
    —-Well at least pragmatically you agree with me! A few points:
    1. I am NOT suggesting that improving ANY end node (array, server, filer, etc.) element is bad – I like “better” given the alternative. If the problem can be contained to a single system and isolated to a single tier, then go nuts. If you can fix any problem by adding some memory, why wouldn’t you?
    2. Your argument is completely valid (albeit only for the storage end of the equation) PRESUMING that all my data is in your stuff – and the magic that your stuff contains makes sure that the right data is in the right tier at all times. If all I had were super-mega Symms and FAST, then yes, I could theoretically solve all of my optimization/automation issues at the STORAGE layer. I also acknowledge that this scenario would be wicked good – for EMC! However, this is not the case with MOST of the real world, who do deal with (gasp!) non-EMC systems, and even those who deal with multiple EMC systems – it’s a fact that there are lots and lots of disparate elements – and that is the real issue. Plus, your argument falls down if the real issue turns out to be external to the actual storage layer. If the problem is at the load balancer, the server, or the network, you could sell me an array made out of pure gold that squirts Silver Oak and it won’t do me any good. Until I get drunk and sell the thing, of course.
    3. In one sense the unrealistic answer to this is that we take all of the individual elements we deal with and throw them out and buy a 1973 Mainframe. Surely as an industry we are attempting to do that by virualizing the disparat elements. Pragmatically however, I’m realistic enough to know that we aren’t going to be able to do that, not in short order anyway, and as such will continue to have a zillion elements that comprise our “infrastructure”. Storage is only 1 layer. Storage by itself is useless. We either implode back to a single system or we deal with the distributed nature of the stack as best we can.
    4. Finally, in terms of making things harder, I revert to the statements above. FAST is an awesome concept but is predicated upon having the data under it’s control – and that the storage layer is the entire issue. What happens if after I have FAST put my proper data into my properly sized SSD and Caches – and the user experience still sucks? What happens is the storage guy says, “ain’t me man”, and drinks the Silver Oak while the server, application, and network guys sweat out the problem. They ultimately find out that because the VMware experiment that moved Exchange from 25 shittly little servers down to 4 mongo-ass kicking blade servers as well as 11 Sharepoint servers they piled on but forgot to upgrade to SQL 2007, that the workload got all funky and even though CPU is only at 45% now, performance still sucks. So it ain’t the storage, and it ain’t the processor – but it sure is something. Your system is spitting love on demand, but do you think the user cares? Do you think the operations staff is psyched with their Flash investment now or the magic new Nehalems? Nope. They want to shoot someone.
    Which I guess is my point here – systemic issues can only be masked by point improvements – but as long as change is the rule of the day, eventually the next weakest link will be exposed. Flash, nor FAST, nor Silver Oak can solve every problem – only the ones it can touch. —–Steve

  3. the storage anarchist says:

    A very thought provoking conversation – thank you sir.
    (shrugging off my storage-centric blinders)…and yes, I have to agree, in the big picture, all we can struggle to accomplish is relocate the bottleneck somewhere else.
    And we (by definition) will NEVER succeed in eliminating all the bottlenecks, unless we stop “improving” software, that is.
    Heady stuff – I think there’s a bottle of Silver Oak waiting at home that needs my attention.
    Thanks for making Friday Fun!

  4. Mike Workman says:

    Steve,
    I think the guys above said it very well, except I think they were too nice :-)
    Until we come up with breakthrough architecture that solve ALL the problems, we make incremental improvements. Always have, always will. Faster processors, more cache, higher RPM’s and the like.
    Adding SSD to the storage pool just gives us one more weapon in the war on storage *subsystem* performance. It is probably, for now, not as great a weapon as the hype would suggest, but it is certainly a good one, and it will get better over time. Someday, it will be stupid not to have it in your array. In the meantime people will argue against it only if their subsystem design cannot make effective use of it, or if their management system turns it into a huge kludge to manage. We all know who these folks are by the structure of their arguments for or against SSD.
    In my first Blog on this topic I made exactly your first point – put the SSD closest to the processors doing the work – Fusion IO for example connected with PCIe busses – far better than accessing a shared network with SSD’s hanging off the back of it. That is all fine if we are willing to give up the shared storage architecture. It some cases, it is the best weapon for performance.
    Finally, I am curious as to how all of a sudden people think FAST is so cool, when Pillar did this Tiering in the box for the last last 3+ years? Our QoS does exactly that, automated movement of data between different Tiers of non-volatile storage components. Out management structure makes addition of SSD into the storage pool trivial. Before people beat me up over this – I didn’t say using SSD effectively is trivial, the technology certainly presents it’s challenges.
    Now for the Silver Oak: I made mine Shafer One Point Five. If you haven’t tried it, you should. At least you should if you like big wines. If you are partial to things closer to standard Pinot’s then stay away from any Shafer – it will cause you to convulse….if you have a big budget, go for Hillside Select.
    Have a great weekend.
    Mike

Add a comment

Switch to our mobile site