6th Jun, 2007

A Recipe for Failure

Recently, I’ve seen numerous examples of what we call the “throw the data over the wall” model of data analysis, mostly in the form of “challenge problems.” Typically, participants are given data sets, without much background information, and asked to analyze them or develop models to make predictions on new data sets. Tests of this sort are of questionable value, as this mode of data analysis is really a recipe for failure. In some instances, I’ve wondered if the tests weren’t set up intentionally to produce failure. So it got me to thinking about how to set up a test to best meet this sinister objective.

If you want to assure failure in data analysis, just follow this recipe:

1) Don’t tell the analyst anything about the chemistry or physics of the system that generated the data. Just present it as a table of numbers. This way the analyst won’t be able to use any physical reasoning to help make choices in data pre-treatment or interpretation of results.

2) Be sure to not include any information about the noise levels in measurements or the reference values. The analyst will have a much harder time determining when the models are over-fit or under-fit. And for sure don’t tell the analyst if the noise has an unusual distribution.

3) Leave the data in a form that makes it non-linear. Alternately, preprocess it in a way that “adds to the rank.”

4) Don’t tell the analyst anything about how the data was obtained, e.g. from “happenstance” or designed data. If its designed, make sure that the data is taken in such a way that system drift and/or other environmental effects are confounded with the factors of interest.

5) Make sure the data sets are “short and fat,” i.e. with many times more variables than samples. This will make it much harder for the analyst to recognize issues related to 2), 3) and 4). And it will make it especially fun for the analyst if the problem has anything to do with variable selection.

6) Compare the results obtained to those from an expert who knows everything about the pedigree of the data. Additionally, its useful if the expert has a lot of preformed “opinions” about what the results of the analysis should be.

If you stick to this recipe, you can certainly improve the odds of making the analyst, and/or the software they are using, look bad. Significantly, it won’t guarantee it, as some analysts are very knowledgeable about the limitations of their methods and are very good with their tools. Obviously, if your goal is to make software and/or chemometric methods unsuccessful, choose a less knowledgeable analyst in addition to 1) – 6) above.

Of course, if you actually want to succeed at data analysis, try to not follow the recipe. At EVRI we’re firm believers in understanding where the data comes from, and anything about it that will help us make intelligent modeling choices. Software can do alot, and we’re making it better all the time, but you just can’t replace a well educated and informed analyst.



II greatly enjoyed this. It is very funny. Sadly, this recipe seems to be followed especially by people want to test software/algorithms.

When showing proof of capabilities on data supplied by prospective customers, we often find ourselves looking at data of incredibly poor quality. As for chemical and physical background information, this is often hidden under a trade secrets curtain. In fact historical background of any kind is a rarity.

Obvious artifacts and systematic noise in the data sometimes jump at you as a glance, making you wonder how any self-respecting analysts would even admit to having produced this… The only plausible explanation is that they are trying to test the limits of the software by presenting the absolute worse case scenario. However, it is hard to escape the garbage in–garbage out rule. If the data is truly unusable, we usually end up asking for the experiment to be rerun their samples following best practices.

The best data set to present for testing software is one typical of normal operations. If the software is being purchased to deal specifically with some thorny or non-routine problems, then the data should be representative of that. But in fairness, accompanying metadata that would be reasonably be available to you should also be forwarded. Furthermore, allowance should be made for answering additional queries that the tester may have regarding the sample system.

Very good post Barry! Sorry for the overlong comment: you exposed and hit a nerve.


Leave a response

You must be logged in to post a comment.