Software or hardware dedupe?

By Juan Orlandini

One of the questions we're asked all the time is whether customers should implement software or hardware dedupe in their data protection environments. It's a good question, and as with most good questions, the answer is simple: it depends.

"Gee, thanks!" You say. Well, the truth is that it really does. It's even worse than that. There is no "right" answer regardless of what manufacturers tell you. If you talk to a software-only vendor, they'll tell you "SOFTWARE!" If you talk to a hardware vendor, their unequivocal answer is "HARDWARE!" The rub is that they both have good reasons. Interestingly, some of the big players are starting to straddle the fence. More and more, we're seeing software vendors selling hardware, and hardware vendors selling software. I'm not so smart, but if you read between the lines, it means the answer is in the middle. My take is that you probably need both.

Yup, you read that right. I'm taking a stance. You probably need both.

So how do you choose the right solution if the answer is probably?

You have to spend some time figuring out what your environment really needs for recovery. I first talked about that here. You see, the reason we have both hardware and software solutions is that none of them are the ideal solution for all problems. Software solutions tend to be more flexible but at the cost of complexity and in most cases performance. Hardware solutions are likely simpler to deploy but have much more rigorous configuration options. And if you're not careful, neither will recover your data the way you want.

Some good rules of thumb to help you out:

  • Small remote sites tend to do well with software-based dedupe back to the central mothership.
  • Larger remote sites are usually better with hardware (or dedicated software "appliance") that replicates back to the central mothership.
  • Corporate environments tend to do well with software-based client-side dedupe for desktops and things like web farms (things that change very little).
  • Traditional large iron servers (i.e., big databases) tend to do better with hardware appliances.
  • Everything else depends on the cost of restore.

What do I mean by cost of restore? Restore is unfortunately not the primary design goal of dedupe solutions. Look for restore speeds on spec sheets from any manufacturer. It's not there. Don't believe me? Try it. For a long time, just getting dedupe accepted was hard enough. You had to compete with fast and cheap tape drives. For most things, restore was more than good enough for the simple cases: restore of small sets of files. The vendors focused on that and used it as the bright shiny object to keep you from noticing one thing: recovering an entire server or data center. Recovering a single file from a dedupe disk solution is way faster than tape. Common sense. However, recovering an entire server, or even worse an entire data center, is another matter. Recovering bulk volumes of data from dedupe can be very expensive (both in terms of hard and soft dollars). Only recently have vendors started to focus on this side of the equation. Oh boy. That means you have to be careful with this now.

If you build your environment to have dedupe as the sole recovery technology, it better be the right dedupe technology to do the recovery. It's not uncommon for software-based dedupe solutions to have a tenth (yup 1/10th) of the recovery speed versus backup speed. Hardware appliances tend to do much better. Depending on the manufacturer (and many crazy details), we see as little as a 20% decrease in performance of restores versus backups (this is a broad statement painted with a huge brush – just take it for what it's worth right now). There are cases where it's worse, and some cases where it's even better. The message is that hardware tends to do better than software on the recovery side (for large volumes of data in a single data center).

So balance the flexibility of software, the rigor of hardware, and the speed of recovery of both to determine which one to use. In many cases, the right answer is both.

Like I said. It depends.