FAQ  Frequently Asked Questions
Issue:

Why do I get different results when I preprocess before doing my crossvalidation vs. doing preprocessing inside crossval?
Possible Solutions:

In general, preprocessing should be done inside of crossvalidation routine. If you preprocess outside of the crossvalidation algorithm (before calling crossval), you will bias the crossvalidation results and likely overfit your model. The reason for this is that preprocessing will be based on the ENTIRE set of data but the crossvalidation's validity REQUIRES that the preprocessing be based ONLY on specific subsets of data. Why? Read on:
Crossvalidation splits your data up into "n" subsets (lets say 3 for simplicity). Let say you have 12 samples and you're only doing mean centering as your preprocessing (again, for simplicity). Crossvalidation is going to take your 12 samples and split it into 3 groups (4 samples in each group).
In each cycle of the crossvalidation, the algorithm leaves out one of those 3 groups (=4 samples="validation set") and does both preprocessing and model building from the remaining 8 samples (="calibration set"). Recall that the preprocessing step here is to calculate the mean of the data and subtract it. Then it applies the preprocessing and model to the 4sample validation set and looks at the error (and repeats this for each of the 3 sets). Here, applying the preprocessing is to take the mean calculated from the 8 samples and subtract it from the other 4 samples.
That last part is the key to why preprocessing BEFORE crossval is bad: when preprocessing is done INSIDE crossvalidation (as it should be), the mean is calculated from the 8 samples that were left in and subtracted from them, and that same 8sample mean is also subtracted from the 4 samples left out by crossvalidation. However, if you meancenter BEFORE crossvalidation, the mean is calculated from all 12 samples. The result is that, even though the rules of crossvalidation say that the preprocessing (mean) and model are supposed to be calculated from only the calibration set, doing the preprocessing outside of crossvalidation means all samples are influencing the preprocessing (mean).
With meancentering, the effect isn't as bad as it is for something like GLSW or OSC. These "multivariate filters" are far stronger preprocessing methods and operating on the entire data set can have a significant influence on the covariance (read: can have a much bigger effect of "cheating" and thus overfitting)
The one time it doesn't matter is when the preprocessing methods being done are "rowwise" only  that is, methods that operate on samples independently are not a problem. Methods like smoothing, derivatives, baselining, or normalization (other than MSC when based on the mean) operate on each sample independently and adding or removing samples from the data set has no effect on the others. In fact, to save time, our crossvalidation routine recognizes when rowwise operations come first in the preprocessing sequence and does them outside of the crossvalidation loop. The only time you can't do these in advance is when another nonrowwise method happens prior to the rowwise method.
Still having problems? Check our documentation Wiki or try writing our helpdesk at helpdesk@eigenvector.com