FAQ  Frequently Asked Questions
Issue:

How does PCA crossvalidation work in PLS_Toolbox and Solo and how do I set up the commandline options to best use it?
Possible Solutions:
Missing Data Approach to PCA CrossValidation
PCA crossvalidation in PLS_Toolbox and Solo is very different from other software packages. It does crossvalidation using a missing data approach where it tests to see how well the PCA model does at replacing columns of the leftout samples (So it is leaving both samples AND variables out). This gives an estimate of how well the PCA model is fitting systematic information (that needed to replace the missing variables) versus noise (which won't be useful in replacing missing variables, and will, in fact, be detrimental in doing so.) This is more diagnostic than the standard residuals test which will continuously decrease asymptotically towards zero.
For a description of how this method is done and how it compares to other PCA crossvalidation methods, see:
CommandLine Settings for PCA CrossValidation
The unusual approach to crossvalidation means that, when calling the crossval function for PCA, you need to define both a pattern to leave out samples (the "cvi" input to crossval) AND a pattern to leave out variables (defined by the "pcacvi" option in the options input.) The I/O for crossval is:
results = crossval(x,y,rm,cvi,ncomp,options);By default, the pcacvi option is 'loo' meaning it leaves one variable out at a time. Thus, for each split of samples, crossval splits the data again into as many sets are there are variables.
In the Analysis window, we use logic which defines what pcacvi to use which, given the number of included variables (n), does the calculation:
if n>25; cvopts.pcacvi = {'con' min(10,floor(sqrt(n)))}; else cvopts.pcacvi = {'loo'}; endThis says that, if the number of included variables is 25 or fewer, it does a "leave one out" pattern on the variables. Otherwise, it does a contiguous block leaveout of variables split into either the square root of the number of variables or 10, whichever is less. For example, with 1500 variables, it would choose 10 splits (because sqrt(1500) = about 38). This means that crossval would take the variables and split them into 10 groups and leave out one group at a time.
Using a contiguous block split of variables when lots of variables are present will be both significantly faster as well as more accurate (with 1500 variables, you often expect there to be a lot of correlated noise between variables). It is worth noting that this assumes adjacent variables are correlated with each other. Obviously there are cases where that may not be the case. In such cases, you can use other leaveout patterns on the variables, including custom sets where you choose the pattern.
So, to maximize the accuracy and speed of PCA crossvalidation, modify the pcacvi option when calling crossval. For example:
opts = crossval('options'); opts.pcacvi = {'con' 10}; results = crossval(x,[],'pca',{'con' 5}, 15, opts);The commandline function crossval does not "automatically" adjust the pcacvi because many commandline users want more control over such options and switching from one method to another could cause unexpected results when doing very specific testing.
Still having problems? Check our documentation Wiki or try writing our helpdesk at helpdesk@eigenvector.com