MINING is an arduous, time-consuming business. Sometimes, tons of material must be excavated to uncover ounces of precious metals or gems. The computational equivalent of old-fashioned, down-in-the-dirt mining is data mining. Whether the search is for metals or information, the task is similar. In data mining, trillions of bytes of data must be sifted to find a handful of precious numbers or images.|
As computers grow in speed, number-crunching capabilities, and memory, scientific researchers are edging into data overload as they try to find meaningful ways to interpret data sets holding more information than the U.S. Library of Congress. According to Livermore computer scientist Chandrika Kamath, "The problem has its roots in the many advances in technology that allow scientists to gather data from experiments, simulations, and observations in ever-increasing quantities," says Kamath. "In many scientific areas, the data sets are so enormous and complex that it is no longer practical for individual researchers to explore and analyze them by hand. When the sets get so large, useful information is easily overlooked, and the data cannot be fully utilized."
To address this problem, Kamath and a small team of Livermore researchers are developing Sapphire-a semiautomated, flexible data-mining software infrastructure. Sapphire shows great promise in helping scientific researchers plow through enormous data sets to turn up information that will help them better understand the world around us, from the makeup of the universe to atomic interactions. Sapphire is funded by the Laboratory Directed Research and Development program and the Department of Energy's Accelerated Strategic Computing Initiative (ASCI).
Data mining is not a new field. In the commercial world, it is used to detect credit card fraud and computer network intrusions; reveal consumer buying patterns; recognize faces, eyes, or fingerprints; and analyze optical characters. At Lawrence Livermore, the terascale computing environment created by ASCI as well as the prolific use of several different types of sensors have created great interest in large-scale, scientific data-mining efforts such as Sapphire. Kamath and her team envision that Sapphire will be applicable to a variety of scientific endeavors, including assuring the safety and reliability of the nation's nuclear weapons, nonproliferation and arms control, climate modeling, astrophysics, and the human genome effort.
Sampling: Selecting a subset of data items. Sampling is a widely accepted technique to reduce the size of the data set and make it easier to handle. However, in some cases, such as when looking for something that appears infrequently in the set, sampling may not be viable.
Multiresolution analysis: Another technique to reduce the size of the data set. With multiresolution analysis, data at a fine resolution can be "coarsened," which shrinks the data set by removing some of the detail. By preserving the detail, the transformation can be reversed.
Denoising: A technique that removes "noise" in images or data. It can be used to sharpen a fuzzy picture or aid in character recognition (differentiating a "6" from a "b" in text, for instance).
Feature extraction: A technique to extract relevant features from the raw data set. In credit card fraud, for instance, an important feature might be the location where a card is used. Thus, if a credit card is suddenly used in a country where it's never been used before, fraudulent use seems likely. Dimension reduction: Reducing the number of features used to mine data, so only the features best at discriminating among the data items are retained.
Normalization: A technique used to level the playing field when looking at features that widely vary in size as a result of the units selected for representation.
Data Mining Step by Step|
Data mining starts with the raw data, which usually takes the form of simulation data, observed signals, or images. These data are preprocessed using various techniques such as sampling, multiresolution analysis, denoising, feature extraction, and normalization. (See the box above.)
Once the data are preprocessed or "transformed," pattern-recognition software is used to look for patterns. Patterns are defined as an ordering that contains some underlying structure. The results are processed back into a form-usually images or numbers-familiar to the scientific experts who then can examine and interpret the results.
To be truly useful, data-mining techniques must be scalable. "In other words," says Kamath, "when the problem increases in size, we don't want the mining time to increase proportionally. Making the end-to-end process scalable can be very challenging, because it's not just a matter of scaling each step but of scaling the process as a whole. For instance, the raw data set may be 100 terabytes, and as the data move through the data-mining process, the process decreases the data set size in ways we cannot predict. By the end of the process, we may have a resulting data set that's only a few megabytes in size."
To test and refine their algorithms, Sapphire researchers teamed up with Laboratory astrophysicists who were examining data from the FIRST (Faint Images of the Radio Sky at Twenty Centimeters) sky survey. This survey, which was conducted at the Very Large Array in New Mexico, seeks to locate a special type of quasar (radio-emitting stellar object) called bent doubles. The FIRST survey has generated more than 22,000 images of the sky to date. Each image is 7.1 megabytes, yielding more than 100 gigabytes of image data in the entire data set. Searching for bent doubles in this mountain of images is as daunting as searching for the needle in the proverbial haystack.
Mining bent Doubles|
The first step in applying data mining to this astrophysical search was to identify what features are unique to radio-emitting bent doubles. "Extracting the key features is essential before applying pattern recognition software," explains Kamath. "Although data exist at the pixel level (or at the grid level in mesh data), patterns usually appear at higher or coarser levels. The features-which can be any measurement-must be relevant to the problem, insensitive to small changes in the data, and invariant to scaling, rotation, and translation. Identifying the best features can be a time-intensive step, but it's a very important one."
Sapphire researchers worked with astrophysicists to draw up a list of features useful in identifying bent doubles. Such features included the number of "blobs," the spatial relationships of the blobs, and the peak intensity of the radio waves detected from each blob. "A parallel concern is to reduce the number of features to a relatively small set that will still provide accurate results," says Kamath. She notes that every additional feature used in pattern recognition on a terabyte data set adds enormously to the computational time and effort.
Once preprocessing is complete, the transformed data are input to pattern-recognition software. Two types of general pattern-recognition techniques used in data mining are classification and clustering. In classification, the algorithms "learn" a function that allows a researcher to map a data item into one of several predefined classes. In clustering, the algorithms work to identify a finite set of categories or clusters to describe the data. There are several different algorithms for classification and clustering, and frequently, both types of pattern recognition can be used within an application.
Once patterns are identified and translated by the Sapphire software back into a usable format, the results are examined by an expert. "We consider data mining to be a semiautomatic process because a human is involved in each step of the entire discovery process," explains Kamath. "The process is both iterative and interactive."
Kamath and her team are pleased with how the data-mining algorithms tested out on the bent-double research-as are the astrophysicists. "Using our algorithms on the FIRST data, we identified a bent double previously overlooked by the astrophysicists in their manual search," said Kamath.
The data-mining algorithms in Sapphire are modular and easy to use in a variety of scientific applications and across diverse computer platforms. The beta release of this software to Lawrence Livermore users is scheduled for late 2000.
"We're also looking at what can be done to apply complex pattern recognition algorithms to data as they are being gathered," says Kamath. "For example, if one is looking for transient events-asteroids in astrophysics data or fraud in business transactions-the processing must keep up with the rate at which new data are acquired."
Key Words: Accelerated Strategic Computing Initiative (ASCI), data mining, Faint Images of the Radio Sky at Twenty Centimeters (FIRST), pattern recognition, Sapphire.
For more information contact Chandrika Kamath (925) 423-3768 (firstname.lastname@example.org).
More information about Sapphire can be found at http://www.llnl.gov/casc/sapphire/sapphire_home.html.