An all-flash data center can respond instantly to user requests for information, and it requires less power and less physical floor space. The problem is moving to an all-flash storage infrastructure is expensive. However, it can be an affordable reality if the data center were only responsible for its active data. In reality, most of an organization’s data is not active and should not be on flash.
Of course deleting the non-flash data set is not an option. This data has to be stored, managed and maintained, immediately eliminating those reductions in power and floor space requirements. Moving non-flash data to the cloud brings those reductions back into play, but it also introduces its challenges.
What is Non-Flash Data Center?
Data not accessed in the last 90 days is called non-flash data. 90% or more of the data that organizations store falls into this category. In fact, most data is never modified again a few weeks after its creation and is rarely accessed. With rare exceptions, flash is not the place to store this data. But just because this data has been inactive for more than 90 days does not mean that the data will never be accessed. In fact, some percentage of it will be accessed, at some point in the future, and when it is users will want to access it seamlessly and quickly.
The unpredictable need to access some portion of non-flash data is why the typical practice for most IT professionals is to continue to expand primary storage. This expansion is done by adding more shelves to storage systems or adding more storage systems into the data center. While it looks like the path of least resistance, the continued expansion of primary storage increases capital and operating costs significantly so that the “just in case” recovery demand is met.
A Whiteboard Full of Failures
In reality, inactive data has been a problem for IT professionals almost as long as there have been data centers. The rapid increase in the capacity requirements plus the desire to move active data to all-flash storage has only exacerbated the issue. In the past, technologies like hierarchal storage management (HSM), information lifecycle management (ILM), archive and file virtualization have tried to solve the inactive data problem.
The legacy solutions to the inactive data problem faced four key challenges. The first challenge was latency; the time it takes to return data to the requesting user or application. In most cases the response time or time to data was too slow, causing users to complain and applications to hang. The latency of secondary storage to respond to user requests lead to passive archiving, data had to be inactive for years before being removed from primary storage or, more typically, it lead to the abandonment of the project.
The second challenge was creating a seamless link between the original file location and the secondary file location. These systems used stub files or symbolic files to create the relationship between the two storage areas. Those stub files caused a variety of problems. Users didn’t know what the files were, so they assumed they could delete them. Backup applications didn’t know what to do with them, so they copied them into the backup storage, which meant the tracking of stub files in addition to the original file.
A third challenge was with the tertiary storage system that held the inactive data. It was still on-site and, as a result, had to be powered, cooled and managed, which lowered the potential savings that an archive system should deliver. It may have been less expensive per GB than primary storage, but it was not operationally less expensive.
Finally, these systems had to inspect, continuously, every file stored on primary storage. These solutions typically did this by continuously crawling through the file system, file by file, looking for files that met a certain criterion, typically last accessed date.
[to continue, click HERE]