Analyzing the Wisconsin Breast Cancer (WDBC) Data Set

In classification problems, the main goal is to derive an accurate representative data model that can correctly classify new test data instances. The accuracy of the classification model can be affected by the presence of outliers in a data set and the inability to correctly classify data records near the boundary. Considering the first case of outliers are critical nuggets different from outliers and can existing approaches in outlier detection. Critical nuggets in certain cases may involve outliers, but this may not always be true. In the example of the previous section, cells in tumors may not show anomalous behavior on an individual basis but collectively, such cells may contain critical pieces of information. It has several applications including detecting fraud in business transactional data, identifying network intrusions , isolating abnormal trends in time-series data , and picking out suspicious criminal activity . A lot of work in data mining has been devoted to finding interesting patterns or rules in data sets. The mining of outliers and the concept of distance-based outliers was proposed to identify records that are different from the rest of the data set. A good definition of an outlier is that of, an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was caused by a different mechanism. These nuggets of information may not always be detected by pattern mining methods or by distance-based outlier detection methods as nuggets may not conform to a specific pattern and may not be outliers. A simple visual example is outlined , where the data set with protrusions around the circular region might be considered more interesting than the simpler circular region. The notion of identifying subsets of critical data instances in data sets. Critical nuggets of information can take the following form during classification tasks: small subsets of data instances that lie very close to the class boundary and are sensitive to small changes in attribute values, such that these small changesresult in the switching of classes. Such critical nuggets have an intrinsic worth that far outweighs other subsets of the same data set. Analyzing the Wisconsin Breast Cancer (WDBC) Data Set This data set has 569 data instances (357 Benign and 212 Malignant), 32 attributes (30 attributes when the record locator and class labels are skipped), and two types of class labels (Benign and Malignant). Using the FindBoundary algorithm, an approximate boundary set comprising of 150 Benign and 150 Malignant data instances was selected. The main task was to apply the FindCriticalNuggets algorithm to identify critical nuggets. The standard normalization function available in the Weka library was used to normalize this data set. For different values of R and for a given class, the FindCriticalNuggets algorithm was run.