New for this year, Batch Multivariate Statistical Process Control for PAT combines the technical aspects of developing chemometric models for monitoring batch processes with the practical aspects of implementing and deploying models, particularly in the pharmaceutical industries. Our DOE course, which debuted last year, has been updated and expanded to become Design of Experiments for QbD (Quality by Design). Also updated this year, Advanced Preprocessing for Spectral Applications has been refocused on spectroscopy.
The PLS_Toolbox/Solo User Poster Session returns with Apple iPod prizes for the two best posters. New and advanced features of our software will be highlighted in the PowerUser Tips & Tricks evening session. And of course our traditional group dinner will be held at Torchy’s in the WAC.
Our most popular classes usually fill up, so register early! Discount registration rates apply for registrations received with payment by April 11, 2012.
See you in Seattle!
BMW
]]>I would like to thank you for all your help with the Eigenvector products. With your help, I was able to successfully carry out detailed investigations using chemical imaging and chemometric evaluation in such a way that I could publish these results in relevant international journals. I would like to draw your attention to the following publications where (only) PLS_Toolbox was used for chemometric evaluation:
These may be considered as showcases of using PLS_Toolbox in Raman chemical imaging, and – which is maybe even more interesting in the light of your collaboration with Horiba Jobin Yvon – the joint use of PLS_Toolbox and LabSpec. The following studies have also been published where MCR-ALS and SMMA (Purity) were carried out with PLS_Toolbox and were tested along with other curve resolution techniques.
I just wanted to let you know that these publications exist, all using PLS_Toolbox in the evaluaton of Raman images, and that I am very grateful for your help throughout. I hope you will find them interesting.
Best regards,
Balázs
–
Balázs Vajna
PhD student
Department of Organic Chemistry and Technology
Budapest University of Technology and Economics
8 Budafoki str., H-1111 Budapest, Hungary
Thanks, Balázs, your letter just made our day! We’re glad you found our tools useful!
BMW
]]>Although it was shown previously that PCA can be used to perfectly impute missing values in rank deficient, noise free data, it’s not hard to guess that PCA might be suboptimal with regards to imputing missing elements in real, noisy data. The goal of PCA, after all, is to estimate the data subspace, not predict particular elements. Prediction is typically the goal of regression methods, such as Partial Least Squares. In fact, regression models can be used to construct estimates of any and all variables in a data set based on the remaining variables. In our 1989 AIChE paper we proposed comparing those estimates to actual values for the purpose of fault detection. Later this became known as regression adjusted variables, as in Hawkins, 1991.
There is a little known function in PLS_Toolbox, (since the first version in 1989 or 90), plsrsgn, that can be used to develop collections of PLS models, where each variable in a data set is predicted by the remaining variables. The regression vectors are mapped into a matrix that generates the residuals between the actual and predicted values in much the same way as the I-PP‘ matrix from PCA.

We can compare the results of using these collections of PLS models to using the PCA done previously. Here we created the coeff matrix using (a conservative) 3 LVs in each of the PLS submodels. Each sub model could of course be optimized individually, but for illustration purposes this will be adequate. The reconstruction error of the PLS models is compared with PCA in the figure shown at left, where the error for the collection of PLS models is shown in red, superimposed over the reconstruction via the PCA model error, in blue. The PLS models’ error is lower for each variable, in some cases, substantially, e.g. variables 3-5.
![]()
The second figure, at left, shows the estimate of variable 5 for both the PLS (green) and PCA (red) methods compared to the measured values (blue). It is clear that the PLS model tracks the actual value much better.
Because the estimation error is smaller, collections of PLS models can be much more sensitive to process faults than PCA models, particularly individual sensor faults.
It is also possible to replace missing variables based on these collections of PLS models in (nearly) exactly the same manner as in PCA. The difference is that, unlike in PCA, the matrix which generates the residuals is not symmetric, so the R12 term (see part one) does not equal R21‘. The solution is to calculate b using their average, thus
b = 0.5(R12 + R21‘)R11-1
Curiously, unlike the PCA case, the residuals on the replaced variables will not be zero except in the unlikely case that R12 = R21‘.
In the case of an existing single PLS model, it is of course possible to use this methodology to estimate the values of missing variables based on the PLS loadings. (Or, if you insist, on the PLS weights. Given that residuals based on weights are larger than residuals based on loadings, I’d expect better luck reconstructing from the loadings but I offer that here without proof.)
In the next installment of this series, we will consider the more challenging problem of building models on incomplete data records.
BMW
B.M. Wise, N.L. Ricker, and D.J. Veltkamp, “Upset and Sensor Failure Detection in Multivariate Pocesses,” AIChE Annual Meeting, 1989.
D.M. Hawkins, “Multivariate Quality Control Based on Regression Adjusted Variables,” Technometrics, Vol. 33, No. 1, 1991.
]]>The session was organized by Katherine Bakeev of CAMO. Pictured below are Katherine, Tormod, Cary, Chuck, EAS President David Russell, and myself.

The session provided ample evidence of the intertwined evolution of chemometrics and NIR, with two primarily chemometric talks and two NIR talks with aspects of chemometrics.
I was also our representative at the session honoring Beata Walczak of the University of Silesian in Poland. Beata was the recipient of the EAS Award for Outstanding Achievements in Chemometrics, sponsored once again by Eigenvector Research. Beata and I are pictured below with the award.

The award session, organized by Peter D. Wentzell of Dalhousie University, had an “omics” theme with talks on metabolomics and proteomics. Speakers included Peter, Tobais Karakach of the Institute for Marine Biosciences, Sarah Rutan of Virginia Commonwealth University and Michal Daszykowski, also of Silesian. Beata presented “Chemometrics in Proteomics,” an overview of her work in the field highlighting methods for aligning samples from 2-D gel electrophoresis.
Congratulations to both Chuck and Beata on two very well-deserved awards!
BMW
]]>xb = -xgR21R11-1
In our short courses, I advise students to be wary any time a matrix inverse is used, and this case is no exception. Inverses are defined only for matrices of full rank, and may be unstable for nearly rank-deficient matrices. So under what conditions might we expect R11 to be rank deficient? Recall that R11 is the part of I-PP‘ that applies to the variables which we want to replace. Problems arise when the variables to be replaced form a group that are perfectly correlated with each other but not with any of the remaining variables. When this happens the variables will either be 1: included as a group in the PCA model (if enough PCs are retained) or 2: excluded as a group (too few PCs retained). In case 1, R11 is rank deficient and the inverse isn’t defined. In case 2, R11 is just I, but the loadings of the correlated group are zero, so the R12 part of the solution is 0. In either case, it makes sense that a solution isn’t possible–what information would it be based on?
With real data, of course, it is highly unlikely that R11 will be rank deficient to within numerical precision (or that R12 will be zero). But it certainly may happen that R11 is near rank deficient, in which case the estimates of the missing variables will not be very good. Fortunately, in most systems the measured variables are somewhat correlated with each other and the method can be employed.

In their 1995 paper, Nomikos and MacGregor estimated the value of missing variables using a truncated Classical Least Squares (CLS) formulation. The PCA loadings are fit to the available data, leaving out the missing portions, to estimate scores which are then used to estimate missing values. This reduces to:
xb = xg(PgPg‘)-1PgPb‘
where Pb and Pg refer to the part of the PCA model loadings for the missing (bad) and available (good) data, respectively. In 1996 Nelson, Taylor and MacGregor noted that this method was equivalent to the method in our 1991 paper but offered no proof. The proof can be found in “Refitting PCA, MPCA and PARAFAC Models to Incomplete Data Records” from FACSS, 2007.

So how does this work in practice? The topmost figure shows the estimation error for each of the 20 variables in the melter data based on a 4 PC models with mean-centering. The model was estimated with every other sample and tested on the other samples. The estimation error is shown in units of Relative Standard Deviation (RSD) to the raw data. Thus, the variables with error near 1.0 aren’t being predicted any better than just using the mean value, while the variables with error below 0.2 are tracking quite well. An example is shown in the middle figure, which shows temperature sensor number 8 actual (blue line) and predicted (red x) for the test set as a function sample number (time).

The reason for the large differences in ability to replace variables in this data set is, of course, directly related to how independent the variables are. A graphic illustration of this can be produced with the PLS_Toolbox corrmap function, which produced the third figure. The correlation matrix for the temperatures is colored red where there is high positive correlation, blue for negative correlation, and white for no correlation. It can be seen that variables with low estimation error (e.g. 7, 8, 17, 18) are strongly correlated with other variables, whereas variables with high estimation error (e.g. 2, 12) are not correlated strongly with any other variables.
To summarize, we’ve shown that missing variables can be imputed based on an existing PCA model and the available measurements. This success of this approach depends upon the degree to which the missing variables are correlated with available variables, as might be expected. In the next installment of this Missing Data series, we’ll explore using regression models, particularly Partial Least Squares (PLS) to replace missing data.
BMW
P. Nomikos and J.F. MacGregor, “Multivariate SPC Charts for Monitoring Batch Processes,” Technometrics, 37(1), pps. 41-58, 1995.
P.R.C. Nelson, P.A. Taylor and J.F. MacGregor, “Missing data method in PCA and PLS: Score calculations with incomplete observations,” Chemometrics & Intell. Lab. Sys., 35(1), pps. 45-65, 1996.
B.M. Wise, “Re-fitting PCA, MPCA and PARAFAC Models to Incomplete Data Records,” FACSS, Memphis, TN, October, 2007.
]]>
When Eigenvector was working on a new website design in ~1998 we hired Chris Raines of Sun Graphic and now Cevado. Chris thought that we should first design a new logo and started by asking questions about what we do and how we got the name Eigenvector. I explained that we basically analyzed large tables of data, i.e. big matrices, and that Eigenvectors were central to the types of analysis we do. Besides, I’d always liked the idea that an eigenvector was a “proper” direction in a data analysis problem, and I like to think that we are moving our clients in the “proper” direction. I then wrote down the equation Ax = λx. Pointing to the Greek letter lambda, Chris asked, “What’s the swoopy thing?” I replied, “Generally, people use lambda to represent the eigenvalue in the eigenvector equation.” Chris said, “We have to use the swoopy thing!”
From that, Chris produced the logo that we use today, shown above. The four by four set of boxes represent a matrix, and the “swoopy thing” the matrix eigenvalue(s). Eigenvalues, more than any other parameters, describe the structure of matrices, and are important in our work. When we need a roughly square logo, we put “Eigenvector” on the bottom and “Research” up the side, like a matrix outer product. We use outer products all the time to analyze and approximate matrices as in Principal Components Analysis (PCA).
So what’s in a logo? If it’s a good one, quite a lot!
BMW
]]>I got interested in missing data while in graduate school in the late 1980s. I worked a lot with a prototype glass melter for the solidification of nuclear fuel reprocessing waste. The primary measurements were temperatures provided by thermocouple sensors. The very high temperatures in this system, nearing 1200C (~2200F), caused the thermocouples to fail frequently. Thus it was common for the data record to be incomplete.
Missing data is also common in batch process monitoring. There are several approaches for building models on complete, finished batches. However, it is most useful to know if batches are going wrong BEFORE they are complete. Thus, it is desirable to be able to apply the model to an incomplete data record.
Missing data problems can be divided into two classes: 1)those involving missing data when applying an existing model to new data records, and 2) those involving building a model on an incomplete data record. Of these, the first problem is by far the easiest to deal with, so we will start with it. It will, however, illustrate some approaches which can be modified for use in the second case. These approaches can also be used for other purposes as well, such as cross-validation of Principal Component Analysis (PCA) models.
Consider now the case where you have a process that periodically produces a new data vector xi (1 x n). With it you have a validated PCA model, with loadings Pk (n x k). The residual sum-of-squares or Q statistic, can be calculated for the ith sample as Q = xiRxi‘ where R = I-PkPk‘. For the sake of convenience, imagine that the first p variables in this model are no longer available, but the remaining n-p variables are as usual. Thus, x can be partitioned into a group of bad variables xb and a group of good variables xg, x = [xb xg]. The calculation of Q can then be broken down into parts which do and do not involve missing variables:
Q = xbR11xb‘ + xgR21xb‘ + xbR12xg‘ + xgR22xg‘
where R11 is the upper left (p x p) part of R, R12 = R21‘ is the lower left (n-p x p) section, and R22 is the lower right (n-p x n-p) section.
It is possible to solve for the values of the bad variables xb that minimize Q, as shown in our 1991 paper referenced below. The (incredibly simple) solution is
xb = -xgR21R11-1
Unsurprisingly, the residuals on the replaced variables on the full model will be zero.
This method is the basis of the PLS_Toolbox function replace, which maps the solution above into a matrix so variables in arbitrary positions can be replaced.

It is easy to demonstrate that this method works perfectly in the rank deficient, no noise case. In MATLAB, you can create a rank 5 data set with 20 variables, then use the Singular Value Decomposition (SVD) to get a set of PCA loadings P, and from that, the R matrix.
>> c = randn(100,5);
>> p = randn(20,5);
>> x = c*p’;
>> [u,s,v] = svd(x);
>> P = v(:,1:5);
>> R = eye(20)-P*P’;
Now let’s say the sensor associated with variable 5 has failed. We can use the replace function to generate a matrix Rm which replaces it based on the values of the other variables.
>> Rm = replace(R,5,’matrix’);
>> imagesc(sign(Rm)), colormap(rwb)
Rm has the somewhat curious structure show in the figure above. The white area is zeros, the diagonal is ones, and R21R11-1 for the appropriately rearranged R is mapped into the vertical section.

We can try Rm out on a new data set that spans the same space as the previous one, and plot up the results as follows:
>> newx = randn(100,5)*p’;
>> var5 = newx(:,5);
>> newx(:,5) = 0;
>> newx_r = newx*Rm;
>> figure(2)
>> plot(var5,newx_r(:,5),’+b’), dp
The (not very interesting) figure at left shows that the replaced value of variable 5 agrees with the original value. This can be done for multiple variables.
In the second installment of this Missing Data series I’ll give some examples of how this works in practice, discuss limitations, and show some alternate ways of estimating missing values. In the third installment we’ll get to the much more challenging issue of building models on incomplete data sets.
BMW
B.M. Wise and N.L. Ricker, “Recent advances in Multivariate Statistical Process Control, Improving Robustness and Sensitivity,” IFAC Symposium n Advanced Control of Chemical Processes, pps. 125-130, Toulouse, France, October 1991.
]]>With PLS_Toolbox and Solo Version 6.5 released last month, this is an opportune time to attend this course. Participants will learn how to take advantage of many of the recently added tools. It will also be a great time to ask “how to” type questions. Nobody knows our software more intimately than Jeremy, as he is responsible for its overall development. He’s constantly surprising the rest of us EigenGuys by showing us easier ways to accomplish our modeling tasks using features we didn’t know existed! Neal will be on hand to guide users through many of the methods, particularly the advanced preprocessing features. Neal has extensive experience in this area due to his work with remote sensing applications.
The course includes an optional second half day which covers our tools for Multivariate Image Analysis and Design of Experiments. There will also be time for one-on-one consulting with the software. Attendees are encouraged to bring their own data for this! Often all the methods and tools make a lot more sense when applied to data with which you are familiar.
If you have any questions about this course or our other course offerings, such as EigenU, please write to me.
BMW
]]>Many thanks to FOSS and especially Lars Nørgaard for inviting us here. FOSS took good care of us, providing ample coffee, great snacks for breaks, and great lunches in their cafeteria. We’ve also had a couple nice evenings out, including a very nice dinner at Ristorante La Perle. We can see why everybody in Hillerød goes there for their birthday.
Thanks to everyone who attended!
BMW
]]>The course will be held at FOSS World Headquaters. FOSS is very big in applications of spectroscopy to problems in food and beverages, grain, feed, meat etc. Chemometrics is a critical part of this and FOSS has a substantial chemometrics group. That group is headed up by Lars Nørgaard, the former Head of the Department of Food Science at University of Copenhagen.
I think of Lars more as a chemometrician, however, rather than a manager. PLS_Toolbox owes a number of things to Lars, including Inteval-PLS (iPLS) and ‘color-by.’ iPLS is a method for selecting variables and also elucidating from which part of the spectrum calibration models get their predictive information. The ‘color-by’ feature uses the color of data points in a plot to indicate the value of another variable. It really helps spot trends. I first saw this feature in LatentiX, with which Lars was involved.
We have a nearly full class lined up and with Lars, Rasmus and myself it should make for a lively group. Plus, we’ll be teaching with the just released PLS_Toolbox 6.5. I’ll need to spend some time learning about 6.5′s new features myself. I’m looking forward to it!
BMW
]]>