On 1/19/07, Joe Landman <landman at scalableinformatics.com> wrote: > Malay wrote: > > > Interesting discussion everyone. My limited experience says given the > > price of redundant but cheap systems and reliable but expensive system, > > one should go for cheapest systems that serves your purpose and > > redundancy than reliable and more expensive system. To elaborate, two > > There is a story running around about this. Some airplane manufacturer > built a small plane with your choice of engines. First engine was a > single (unknown manufacturer) high quality and more expensive turboprop. > Second was a dual (also unknown manufacturer) "reasonable" quality > piston engine. Turns out that the company sold the benefits of "cheaper > but redundant" to its audience. The buyers who purchased them, looked > at the statistics for failures, noted that even with the redundant pair, > if one failed, you were pretty much in quite a bit of trouble. MTBF is a statistical measure based on failure rates for a large number of fresh units. You may have a component with 10 year MTBF that whose mechanical bits will wear out in 5 years. Vendors have become very adept at designing hardware that wears out a couple days after the warranty expires. I hear horror stories about the 2nd disk failing while rebuilding a RAID, but how many sites have a schedule to replace drives before they actually fail in service rather that waiting until the first one fails. I've experienced too many cases where a number of identical parts (disks, power supplies, fans) in workstations purchased at the same time all fail at roughly the same time. Sometimes there is a trigger event (A/C failure) that stresses systems within limits that they would handle when new, but after 2-3 years cooling fans are less effective due to dust buildup, added components have increased heat production in the machine room, etc. so you get a cluster of failures. Rebuilding a RAID is also a stressor. > The point being, if you are going to bet your life, or your data on > something, it makes sense to go with hard data as compared to speculation. > > The cheapest drives around, Maxtors and their ilk have seen failure > rates higher than 3-4% in desktop and other apps. Sure, you will save a > buck or two on the front end (acquisition). Unless you can tolerate > data loss, do you want to deal with the impact on the back end? Without > trying to FUD here, how much precisely is your data worth, how many > thousands or millions of dollars (or euros, or ...) have been spent > collecting it? Once you frame the question in terms of how much risk > you can afford, you start looking at how to ameliorate the risk. > > There are simple, (relatively) inexpensive methods. N+1 supplies adds > *marginal* additional cost to a unit. Using better drives (notice I > didn't say FC/SCSI/SATA), adds minute costs to the unit. Using > intelligent redundancy (RAID6 with hot spares, mirrored,...) reduces > risk at an increase in cost. So does a sensible schedule to replace older units before they fail. For organizations where unscheduled downtime is expensive, the benefits include being able to schedule replacements to minimize disruptions. > We are not talking about EMC costs here. Or NetAPP. If you are > spending north of $2.5/GB of space you are probably overspending, though > this is a function of what it is and what technology you are buying. > > > separate machines with cheap components (chapest SATA drives with single > > power supply) is better that one expensive machine (higher quality hard > > drives, redundant power supply). What you Gurus say? You have given an ill-posed question. The answer is very sensitive to the I/O profile of your workload. There can be a big performance hit for the I/O it takes to replicate the data between the boxes. Some workloads will have low I/O windows where replication can be done. How robust is your processing if the the separate machines get out of sync? One approach is to keep the filesystem metadata on a small highly reliable machine. > I believe that you can save money at the most appropriate places to do > so. Im not sure this is it. Its your data, and you have to deal > with/answer for what happens if a disk or machine demise makes it > un-recoverable. People whom have not had a loss event usually dont get > this (e.g. it hasnt bitten them personally). If you have ever lost data > due to a failure, and it cost you lots of time/energy/sweat/money to > recover or replicate this, you quickly realize that the "added" cost is > a steal, a bargin in comparison with your time. Which you should value > highly (your employer does, and rarely do they want you spending time on > data recovery, unless this is your job, as compared to what you are paid > to do). There are usually people (who won't be around when the problems appear) telling management "cheap, secure, and reliable? -- no problem!". In large organizations, the time/energy/sweat includes sitting in the committees make the recommendations to management. Many large organizations have people running spreadsheets to look at the cost of data storage/processing in various sites. The results are then used to require every site to use the approach that looks cheapest -- often without appropriate consideration of the risks or of differences in workloads. -- George N. White III <aa056 at chebucto.ns.ca> Head of St. Margarets Bay, Nova Scotia