Instance selection for model-based classifiers

05-10-2014

Overview

Motivation
Approach
Results
Future work

Motivation

A model-based classifier is an abstraction of data used to make predictions

Retail
Healthcare
Finance

Better classifiers are beneficial

Make fewer missclassifications
Gain useful knowledge when analyzing the classifier

Motivation

We can build a model to distinguish between these

Motivation

Motivation

Accuracy counts

- Prediction
- Interpretation

Motivation

Aspects of a training dataset may make it difficult to build/induce/learn an accurate classifier

Motivation

Where should overlapping classes be separated?

Motivation

Where should overlapping classes be separated?

Motivation

Where should overlapping classes be separated?

Motivation

Where should overlapping classes be separated?

Motivation

Should these outliers be accommodated?

Motivation

Should these outliers be accommodated?

Motivation

Does capturing the minority class sacrifice accuracy?

Motivation

Does capturing the minority class sacrifice accuracy?

Motivation

We believe that selecting which instances to learn from can improve the accuracy of a classifier.

This is called instance selection !

Instance selection

Instance selection

Instance selection

Instance selection

Previous work

\({\mathbf {Max} \ \ \ \ Classifier \ Accuracy \\ \mathbf {s.t} \\ \ \ \ \ \ \ \ \ \ \ \ \ \ x_i \in \{0,1\} \ \forall \ i \in I}\)

- This is a combinatorial optimization problem
- There are \(2^n\) possible solutions
- There is no closed form for the objective function

Previous work

A VAST majority rely on evolutionary algorithms to find a solution
Other optimization problems look similar to instance selection if the problem is reformulated . This allows us to take advantage of optimization theory .

Experimental results

Landsat dataset

Classify land cover from satellite images
Logistic regression improved from 65% to 80%

Landsat dataset

Landsat dataset

Number of test instances misclassified:

	Damp Grey Soil	Total
Original Training Data	9	51
With Instance Selection	14	28

The ability to classify the "Damp Grey Soil" is likely sacrificed in an effort to make it easier to separate the remaining classes

Case study

A Population-based Assessment of Perioperative Mortality After Nephroureterectomy for Upper-tract Urothelial Carcinoma

(I'll be calling this NU for UTUC!!)

Case study

Case study

Data: SEER database

Attributes: age, gender, histopathology, extraglandular
involvement, tumor grade, tumor size, and
mortality

Patients: 2,328 (9% mortality)

Classification task: predict mortality

Classifier: logistic regression

Case study

Before

- 90% test accuracy
- Uninformative!

After

- 88% test accuracy
- Learn something about mortality

Case study

Before

- 90% test accuracy
- Uninformative!

After

- 88% test accuracy
- Learn something about mortality

Future work

Address overfitting
Scale for large datasets
Optimize for imbalanced data
- Over/under sampling
- Cost sensitive learning

Walter Bennette
315-330-4957
walter.bennette.1@us.af.mil

Previous work

C. Reeves, S. Taylor, Selection of training sets for neural networks by a genetic algorithm, Parallel Problem Solving from Nature- PSSN V, (1998) 633-642.

C. Reeves, D Bush, Using genetic algorithms for training data selection in RBF networks, in: Instance Selection and Construction for Data Mining, H. Liu and H. Motoda (Eds), Kluwer, Norwell, MA, (2001) pp.339–356.

T. Endou, Q. Zhao, Generation of Comprehensible Decision Trees Through Evolution of Training Data, in proceedings of the 2002 Congress on Evolutionary Computation, (2002) 1221-1225.

J. Cano, F. Herrera, M. Lozano, Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study, IEEE Transactions on Evolutionary Computation, 7(6) (2003) 561-575.

J. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification Rules with Trade off Precision-Interpretability, Data & Knowledge Engineering, 60 (2006) 90-108.

N. Garcia-Pedrajas, Evolutionary computation for training set selection, WIREs Data Mining and Knowledge Discovery, 1 (2011) 512-523.

Kim K-J, Artificial neural networks with evolutionary instance selection for financial forecasting, Expert Syst Appl, 30 (2006) 519-526.

Previous work

Wu, Shuing. Optimal instance selection for improved decision tree. (2007 Dissertation)

Walter Bennette, Instance selection for simplified decision trees through the generation and selection of instance candidate subsets. (2009 Master’s thesis)

Walter Bennette, Sigurdur Olafsson, Model based classifier improvement through the generation and selection of instance candidate subsets, Data and Knowledge Engineering (under revision).

Take away message:

You can improve the accuracy of model-based classifiers by learning from carefully selected instances.