09-10-2014

Overview

  1. About me
  2. Motivation
  3. Approach
  4. Results
  5. Future work

About me

plot of chunk unnamed-chunk-1

About me

plot of chunk unnamed-chunk-2

Instance selection for model-based classifiers

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

We can build a model to distinguish between these

Instance selection for model-based classifiers

plot of chunk unnamed-chunk-5

Motivation

Accuracy counts


- Prediction
- Interpretation

Motivation

Aspects of a training dataset may make it difficult to build/induce/learn an accurate classifier

Motivation

plot of chunk unnamed-chunk-6

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-7

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-8

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-9

Where should overlapping classes be separated?

Motivation

plot of chunk unnamed-chunk-10

Should these outliers be accommodated?

Motivation

plot of chunk unnamed-chunk-11

Should these outliers be accommodated?

Motivation

plot of chunk unnamed-chunk-12

Does capturing the minority class sacrifice accuracy?

Motivation

plot of chunk unnamed-chunk-13

Does capturing the minority class sacrifice accuracy?

Motivation

We believe that selecting which instances to learn from can improve the accuracy of a classifier.

This is called
instance selection !

Instance selection

plot of chunk unnamed-chunk-14

Instance selection

plot of chunk unnamed-chunk-15

Instance selection

plot of chunk unnamed-chunk-16

Instance selection

plot of chunk unnamed-chunk-17

Previous work

\({\mathbf {Max} \ \ \ \ Classifier \ Accuracy \\ \mathbf {s.t} \\ \ \ \ \ \ \ \ \ \ \ \ \ \ x_i \in \{0,1\} \ \forall \ i \in I}\)


- This is a combinatorial optimization problem
- There are \(2^n\) possible solutions
- There is no closed form for the objective function

Previous work

  • A VAST majority rely on evolutionary algorithms to find a solution

  • Other optimization problems look similar to instance selection if the problem is reformulated . This allows us to take advantage of optimization theory .

Experimental results

plot of chunk unnamed-chunk-18

Landsat dataset

  • Classify land cover from satellite images
  • Logistic regression improved from 65% to 80%

Landsat dataset

plot of chunk unnamed-chunk-19

Landsat dataset

Number of test instances misclassified:

Damp Grey Soil Total
Original Training Data 9 51
With Instance Selection 14 28



The ability to classify the "Damp Grey Soil" is likely sacrificed in an effort to make it easier to separate the remaining classes

Case study

A Population-based Assessment of Perioperative Mortality After Nephroureterectomy for Upper-tract Urothelial Carcinoma

(I'll be calling this NU for UTUC!!)

Case study

plot of chunk unnamed-chunk-20

Case study

Data: SEER database

Attributes: age, gender, histopathology, extraglandular
                     involvement, tumor grade, tumor size, and
                     mortality

Patients: 2,328 (9% mortality)

Classification task: predict mortality

Classifier: logistic regression

Case study

Before



- 90% test accuracy
- Uninformative!
After



- 88% test accuracy
- Learn something about mortality

Case study

Before



- 90% test accuracy
- Uninformative!
After



- 88% test accuracy
- Learn something about mortality

Future work

  • Find Air Force specific applications
  • Address overfitting
  • Scale for large datasets
  • Optimize for imbalanced data
    • Over/under sampling
    • Cost sensitive learning

Previous work

C. Reeves, S. Taylor, Selection of training sets for neural networks by a genetic algorithm, Parallel Problem Solving from Nature- PSSN V, (1998) 633-642.

C. Reeves, D Bush, Using genetic algorithms for training data selection in RBF networks, in: Instance Selection and Construction for Data Mining, H. Liu and H. Motoda (Eds), Kluwer, Norwell, MA, (2001) pp.339–356.

T. Endou, Q. Zhao, Generation of Comprehensible Decision Trees Through Evolution of Training Data, in proceedings of the 2002 Congress on Evolutionary Computation, (2002) 1221-1225.

J. Cano, F. Herrera, M. Lozano, Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study, IEEE Transactions on Evolutionary Computation, 7(6) (2003) 561-575.

J. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification Rules with Trade off Precision-Interpretability, Data & Knowledge Engineering, 60 (2006) 90-108.

N. Garcia-Pedrajas, Evolutionary computation for training set selection, WIREs Data Mining and Knowledge Discovery, 1 (2011) 512-523.

Kim K-J, Artificial neural networks with evolutionary instance selection for financial forecasting, Expert Syst Appl, 30 (2006) 519-526.

Previous work

Wu, Shuing. Optimal instance selection for improved decision tree. (2007 Dissertation)

Walter Bennette, Instance selection for simplified decision trees through the generation and selection of instance candidate subsets. (2009 Master’s thesis)

Walter Bennette, Sigurdur Olafsson, Model based classifier improvement through the generation and selection of instance candidate subsets, Data and Knowledge Engineering (under revision).

plot of chunk unnamed-chunk-21

Take away message:

You can improve the accuracy of model-based classifiers by learning from carefully selected instances.