VMworld this year feels like a confessional – operators coming clean about their lack of data visibility, heads hung in shame. The reasons are many, but foremost is workload – once you provision a data volume you never have time to get back to it and ask why, or update it really to mesh with current initiatives. Nowhere is this more apparent than in the case of semi-orphaned data that is still sitting on a disk somewhere (you hope).

Face it, if you had to catalog all your data right now off the top of your head, you’d probably fail. The biggest culprit is that LUN, old RAID, or disks that are hanging off some USB port for ‘temporary’ data transfer that you forgot about. And by temporary, we mean that you were distracted and forgot it was still attached, but you were going to remove it “Real Soon Now”. But soon never comes, or if it does, only months or years later.

So this year at VMworld, folks finally admitted publicly that as developers they have the same data orphan bloat as the rest of us.

How to solve it? First, by cataloging what you actually have on disk, and where. Then, determining context for that data, especially the stuff that no one remembers you have.Next, find ways to schedule and then integrate it into data stores that make more sense and are guided by policy.

"Getting policy that makes technical sense means you have to have technical chops high up the organizational flowchart to get buy-in along with budget."

What, you don’t have a policy? There’s a confession booth for that too here. You see, getting policy that makes technical sense (especially across large enterprises/networks) means you have to have technical chops high up the organizational flowchart to get buy-in along with budget. Then once you do, it can take months to roll that out onto the front lines of technical folks building the boxes.

But then do you really have the tools? One session I attended here focused on creating an abstraction layer that can apply context to your data, which then, tied to a meaningful policy framework, can provision volumes based on the type of data you need. For example, large bulk storage has very different I/O needs than super-fast databases that are used daily but need super low latency.

In the end, though, if you can start to develop this sort of a framework, you can re-integrate your orphaned data, have visibility for what you actually have, and then free up a whole ton of underutilized or redundant data, especially if you de-duplicate it inline.

I’m going to try an experiment on some data I have sitting in bulk on some volumes: I’m going to keep the original and then provision some new volumes and enable de-duplication, to see just how much disk I’m wasting. My guess is I will recover at least 1/3 of my space needed, maybe more.

And that’s sort of the point: If you really know where your data is and what its context really is, it would be much simpler to justify to upper management, should you need more capacity. But my guess is you’ll find that if you put an intelligent abstraction layer in play, you’ll find – like I will – that you have tons of wasted space, costing you power, rack space, bandwidth and clutter in your head and in your documentation in general.

So if you think you can prove this to management, you know, saving a bunch of money, time, wasted resources, and have higher visibility into your data in the event of disaster recovery or rapid response to threats, you might want to start along this path. Judging by the swarm of attendees at these sessions (sitting on the floor, hanging out the doors), you most certainly won’t be alone.