18th Apr, 2007

DataSet Object Conflict

In 2001 the software developers at EvRI (at the time that was Neal, Rasmus and myself) started thinking about how we could improve the organization of the data sets users analyze with PLS_Toolbox. The problem was that a typical data set contains a fair number of pieces, (including the data matrix, wavelength and/or time axes, labels on variables and/or samples, etc.) and they were all floating around independently in the MATLAB workspace. After working for a while, it often happened that you wound up with mismatched pieces. For instance, after deleting a sample from the data matrix, you might forget to delete the corresponding labels, or delete the wrong one.

We considered a number of options. One of them was to create a MATLAB structure array for data sets with a convention on field names for the typical parts. The problem here, though, is that there is no way to really control what gets stuck in the fields, or even that the conventions on names be strictly followed. After MUCH discussion we decided that the best way to assure data set integrity was to create a custom class object. With a custom class object you can control how MATLAB functions act on it, including both your own functions and MATLAB built-in functions. (This is known as “overloading”.) Thus, you can program the tools so that you can’t, for example, associate a wavelength axis with 400 entries with a set of spectra that have 401 channels.

The obvious name for this custom class object was “DATASET”. We commonly refer to it around here as the DataSet Object or DSO.

It seemed to us that maintaining the integrity of data sets was a fairly general problem, not just something specific to chemometrics. Maybe this should be a general tool in MATLAB. Because of this, we decided to share our ideas with representatives of The MathWorks. We met with them in June of 2001. The general subject of the meeting, (which was attended by several software companies, instrument companies and end users) was using MATLAB and models developed in MATLAB in on-line applications. It would have made sense for TMW to maintain the DataSet object and take input from users so that it could become a general, well supported tool. We shared our code with TMW, along with our ideas about how the evolution of the DataSet should be managed. After a few follow on emails it became apparent that TMW wasn’t really interested, so we proceeded on our own.

In January of 2002 we made our DSO publicly available, and announced it in an issue of EigenNews. The DSO was then and still is a free download from Eigenvector. We’ve continued to develop it, carefully adding features and functionality. It now supports multi-way data, multivariate images, batch data of unequal lengths, “soft” deletion of data, meta-data, maintains a history of changes, etc. Essentially all of the upper level functions in PLS_Toolbox use the DSO, as input, output or internally. This has greatly improved the management of data in MATLAB, and we’ve built additional tools for working with DSOs such as the DataSet Editor, which is part of PLS_Toolbox.

This was all well and good until TMW released the “prerelease” version of R2007a. We started getting reports from some of our users that there was a problem. One of them in particular did a very good job of tracking down the problem and reporting it to us and to TMW. We first heard about it on March 1, 2007. The problem was that the Statistics Toolbox from TMW now included a custom class object called “DATASET”. Depending upon which one is first on the path, you get different behavior. If the Stats Toolbox is first on the path, almost all the upper level functions, including interfaces, in PLS_Toolbox won’t work. With a quick patch, we were able to make PLS_Toolbox work normally provided that it is first on the path. But this causes problems with the functions in the Stats toolbox that use TMW’s DATASET.

We started contacting TMW almost immediately in the hopes of resolving this issue, which had quickly become our biggest support problem. This includes the letter we sent them on March 15, which reiterated many of the points we’d made previously. After about 3 weeks of persistent pestering, I finally received a response from the lead developer of the Stats Toolbox in which he “apologized for not getting back to us sooner.” However, in the interim, MATLAB 2007a had gone from prerelease to release. Thus, our early March suggestion that TMW rename their object was met with “As you know we’ve already shipped, so renaming is not an option for us.”

We responded to that by, once again, suggesting that TMW adopt our standard and that we share stewardship of the code. The Stats Toolbox DATASET is very limited compared to EvRI’s DSO, and in fact, we quickly added the few features that their’s had but ours didn’t in order to impove compatibility. Its now been three weeks since we last suggested this and TMW hasn’t responded.

The good news for PLS_Toolbox users is, provided you have PLS_Toolbox first on the path, everything in it works normally. And because we’ve responded rapidly, the Stats Toolbox is almost unaffected. But we think TMW is missing a great opportunity here to develop a standard that would be of use across a wide variety of application areas. If you think so too, drop TMW a line and let them know.



Leave a response

You must be logged in to post a comment.