Instance selection for model-based classifiers

09-10-2014

Overview

About me
Motivation
Approach
Results
Future work

About me

About me

$plot of chunk unnamed-chunk-2$

Instance selection for model-based classifiers

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

We can build a model to distinguish between these

Instance selection for model-based classifiers

plot of chunk unnamed-chunk-5

Motivation

Accuracy counts

- Prediction
- Interpretation

Motivation

Aspects of a training dataset may make it difficult to build/induce/learn an accurate classifier

Motivation

plot of chunk unnamed-chunk-6

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-7

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-8

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-9

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-10

Should these outliers be accommodated?

Motivation

plot of chunk unnamed-chunk-11

Should these outliers be accommodated?

Motivation

plot of chunk unnamed-chunk-12

Does capturing the minority class sacrifice accuracy?

Motivation

plot of chunk unnamed-chunk-13

Does capturing the minority class sacrifice accuracy?

Motivation

We believe that selecting which instances to learn from can improve the accuracy of a classifier.

This is called instance selection !

Instance selection

plot of chunk unnamed-chunk-14

Instance selection

plot of chunk unnamed-chunk-15

Instance selection

plot of chunk unnamed-chunk-16

Instance selection

plot of chunk unnamed-chunk-17

Previous work

\({\mathbf {Max} \ \ \ \ Classifier \ Accuracy \\ \mathbf {s.t} \\ \ \ \ \ \ \ \ \ \ \ \ \ \ x_i \in \{0,1\} \ \forall \ i \in I}\)

- This is a combinatorial optimization problem
- There are \(2^n\) possible solutions
- There is no closed form for the objective function

Previous work

A VAST majority rely on evolutionary algorithms to find a solution
Other optimization problems look similar to instance selection if the problem is reformulated . This allows us to take advantage of optimization theory .

Experimental results

plot of chunk unnamed-chunk-18

Landsat dataset

Classify land cover from satellite images
Logistic regression improved from 65% to 80%

Landsat dataset

plot of chunk unnamed-chunk-19

Landsat dataset

Number of test instances misclassified:

	Damp Grey Soil	Total
Original Training Data	9	51
With Instance Selection	14	28

The ability to classify the "Damp Grey Soil" is likely sacrificed in an effort to make it easier to separate the remaining classes

Case study

A Population-based Assessment of Perioperative Mortality After Nephroureterectomy for Upper-tract Urothelial Carcinoma

(I'll be calling this NU for UTUC!!)

Case study

plot of chunk unnamed-chunk-20

Case study

Data: SEER database

Attributes: age, gender, histopathology, extraglandular
involvement, tumor grade, tumor size, and
mortality

Patients: 2,328 (9% mortality)

Classification task: predict mortality

Classifier: logistic regression

Case study

Before

- 90% test accuracy
- Uninformative!

After

- 88% test accuracy
- Learn something about mortality

Case study

Before

- 90% test accuracy
- Uninformative!

After

- 88% test accuracy
- Learn something about mortality

Future work

Find Air Force specific applications
Address overfitting
Scale for large datasets
Optimize for imbalanced data
- Over/under sampling
- Cost sensitive learning

Walter Bennette
330-4957
walter.bennette.1@us.af.mil

Previous work

C. Reeves, S. Taylor, Selection of training sets for neural networks by a genetic algorithm, Parallel Problem Solving from Nature- PSSN V, (1998) 633-642.

C. Reeves, D Bush, Using genetic algorithms for training data selection in RBF networks, in: Instance Selection and Construction for Data Mining, H. Liu and H. Motoda (Eds), Kluwer, Norwell, MA, (2001) pp.339–356.

T. Endou, Q. Zhao, Generation of Comprehensible Decision Trees Through Evolution of Training Data, in proceedings of the 2002 Congress on Evolutionary Computation, (2002) 1221-1225.

J. Cano, F. Herrera, M. Lozano, Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study, IEEE Transactions on Evolutionary Computation, 7(6) (2003) 561-575.

J. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification Rules with Trade off Precision-Interpretability, Data & Knowledge Engineering, 60 (2006) 90-108.

N. Garcia-Pedrajas, Evolutionary computation for training set selection, WIREs Data Mining and Knowledge Discovery, 1 (2011) 512-523.

Kim K-J, Artificial neural networks with evolutionary instance selection for financial forecasting, Expert Syst Appl, 30 (2006) 519-526.

Previous work

Wu, Shuing. Optimal instance selection for improved decision tree. (2007 Dissertation)

Walter Bennette, Instance selection for simplified decision trees through the generation and selection of instance candidate subsets. (2009 Master’s thesis)

Walter Bennette, Sigurdur Olafsson, Model based classifier improvement through the generation and selection of instance candidate subsets, Data and Knowledge Engineering (under revision).

plot of chunk unnamed-chunk-21

Take away message:

You can improve the accuracy of model-based classifiers by learning from carefully selected instances.