Walter Bennette
04-15-2014
Model-based classifiers with improved accuracy can be created by learning from carefully selected instances.
1 / 68
2 / 68
A model-based classifier is an abstraction of data used to make predictions
Better classifiers are beneficial
3 / 68
4 / 68
Classes may overlap
Where should these classes be separated?
5 / 68
Classes may overlap
Where should these classes be separated?
5 / 68
Classes may overlap
Where should these classes be separated?
5 / 68
Classes may overlap
Where should these classes be separated?
5 / 68
Classes may have outliers
Should these outliers be accommodated?
6 / 68
Classes may have outliers
Should these outliers be accommodated?
6 / 68
Minority class
Does capturing the minority class introduce unnecessary structure?
7 / 68
Minority class
Does capturing the minority class introduce unnecessary structure?
7 / 68
8 / 68
9 / 68
Instance selection addresses issues in the training data by creating a subset of the original data
The intention is that the classification algorithm will perform better when learning from the selected/reduced data set
10 / 68
11 / 68
11 / 68
11 / 68
12 / 68
13 / 68
14 / 68
Started for k-NN
15 / 68
Made sense for model-based classifiers
\( {\mathbf {Max} \ \ \ \ Classifier \ Accuracy \\ \mathbf {s.t} \\ \ \ \ \ \ \ \ \ \ \ \ \ \ x_i \in \{0,1\} \ \forall \ i \in I} \)
16 / 68
A VAST majority rely on evolutionary algorithms to find a solution
Other optimization problems look similar to instance selection if the problem is reformulated. This allows us to take advantage of optimization theory.
17 / 68
C. Reeves, S. Taylor, Selection of training sets for neural networks by a genetic algorithm, Parallel Problem Solving from Nature- PSSN V, (1998) 633-642.
C. Reeves, D Bush, Using genetic algorithms for training data selection in RBF networks, in: Instance Selection and Construction for Data Mining, H. Liu and H. Motoda (Eds), Kluwer, Norwell, MA, (2001) pp.339–356.
T. Endou, Q. Zhao, Generation of Comprehensible Decision Trees Through Evolution of Training Data, in proceedings of the 2002 Congress on Evolutionary Computation, (2002) 1221-1225.
J. Cano, F. Herrera, M. Lozano, Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study, IEEE Transactions on Evolutionary Computation, 7(6) (2003) 561-575.
J. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification Rules with Trade off Precision-Interpretability, Data & Knowledge Engineering, 60 (2006) 90-108.
N. Garcia-Pedrajas, Evolutionary computation for training set selection, WIREs Data Mining and Knowledge Discovery, 1 (2011) 512-523.
Kim K-J, Artificial neural networks with evolutionary instance selection for financial forecasting, Expert Syst Appl, 30 (2006) 519-526.
Wu, Shuing. Optimal instance selection for improved decision tree. (2007 Dissertation)
Walter Bennette, Instance selection for simplified decision trees through the generation and selection of instance candidate subsets. (2009 Master’s thesis)
Walter Bennette, Sigurdur Olafsson, Model based classifier improvement through the generation and selection of instance candidate subsets, Data and Knowledge Engineering (under revision).
18 / 68
19 / 68
New decision variables
20 / 68
Formulation 1
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
21 / 68
Formulation 1
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<—A column is
associated with each
possible subset of
instances
21 / 68
Formulation 1
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— \( a_{ij} \) indicates if instance
\( i \ \) is in column \( \ j \)
21 / 68
Formulation 1
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— \( c_j \) is the accuracy of a
classifier built from
the contents in column
\( \ j \)
21 / 68
Example
\( Training= \{\tau_1, \tau_2\} \)
\( Column^{(1)}=\{\ \} \)
\( Column^{(2)}=\{\tau_1 \} \)
\( Column^{(3)}=\{\tau_2 \} \)
\( Column^{(4)}=\{\tau_1, \tau_2 \} \)
\( \mathbf {Max} \ \ \ \ 0.3x_1 + 0.7x_2 + 0.5x_3 + 0.6x_4 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_1 + 1x_2 + 0x_3 + 1x_4 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_1 + 0x_2 + 1x_3 + 1x_4 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ x_1, x_2, x_3, x_4 \in \{0,1\} \)
22 / 68
Formulation 2
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
23 / 68
Formulation 2
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<—A column is
associated with each
possible subset of
instances
23 / 68
Formulation 2
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— Select at most one
column
23 / 68
Formulation 2
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— \( c_j \) is the accuracy of a
classifier built from
the contents of
column \( \ j \)
23 / 68
Example
\( Training= \{\tau_1, \tau_2\} \)
\( Column^{(1)}=\{\ \} \)
\( Column^{(2)}=\{\tau_1 \} \)
\( Column^{(3)}=\{\tau_2 \} \)
\( Column^{(4)}=\{\tau_1, \tau_2 \} \)
\( \mathbf {Max} \ \ \ \ 0.3x_1 + 0.7x_2 + 0.5x_3 + 0.6x_4 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ x_1 + x_2 + x_3 + x_4 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ x_1, x_2, x_3, x_4 \in \{0,1\} \)
24 / 68
Issue
For Formulation 1 and Formulation 2, a dataset of size \( n \) has \( 2^n \) columns
Solution
1. Begin with a good set of initial columns
2. Use column generation to find improvements
25 / 68
\( Training \ \ = \{\tau_1, \tau_2, \tau_3 \} \)
\( Column^{(2)} = \{ \tau_1 \} \)
\( Column^{(6)} = \{ \tau_1, \tau_3 \} \)
\( \mathbf {Max} \ \ \ \ 0.6x_2 + 0.7x_6 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 1x_2 + 1x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 0x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 1x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0 \le x_2, x_6 \)
27 / 68
\( Training \ \ = \{\tau_1, \tau_2, \tau_3 \} \)
\( Column^{(2)} = \{ \tau_1 \} \)
\( Column^{(6)} = \{ \tau_1, \tau_3 \} \)
\( \mathbf {Max} \ \ \ \ 0.6x_2 + 0.7x_6 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 1x_2 + 1x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 0x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 1x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0 \le x_2, x_6 \)
—> Solve and get dual
variables
27 / 68
\( Training \ \ = \{\tau_1, \tau_2, \tau_3 \} \)
\( Column^{(2)} = \{ \tau_1 \} \)
\( Column^{(6)} = \{ \tau_1, \tau_3 \} \)
\( \mathbf {Max} \ \ \ \ 0.6x_2 + 0.7x_6 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 1x_2 + 1x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 0x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 1x_6 \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0 \le x_2, x_6 \)
—> \( \pi^*=\{0.6, 0, 0.1\} \)
27 / 68
Price out problem
\( \mathbf {Max} \ \ \ \ c^{new} - \pi^*a^{new} \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{i \in I} a_i^{new} \le G \)
\( \ \ \ \ \ \ \ \ \ \ \ a_i^{new} \in \{0,1\} \ \forall i \in I \)
28 / 68
Price out problem
\( \mathbf {Max} \ \ \ \ c^{new} - (0.6a_1+0a_2+0.1a_3) \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ a_1 + a_2 + a_3\le 3 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ a_1, a_2, a_3 \in \{0,1\} \)
Column | \( a_1 \) | \( a_2 \) | \( a_3 \) | \( c^{new} \) |
---|---|---|---|---|
\( Column^{(1)} \) | 0 | 0 | 0 | 0.30 |
\( Column^{(3)} \) | 0 | 1 | 0 | 0.60 |
\( Column^{(4)} \) | 0 | 0 | 1 | 0.60 |
\( Column^{(5)} \) | 1 | 1 | 0 | 0.50 |
\( Column^{(7)} \) | 0 | 1 | 1 | 0.70 |
\( Column^{(8)} \) | 1 | 1 | 1 | 0.70 |
29 / 68
Price out problem
\( \mathbf {Max} \ \ \ \ c^{new} - (0.6a_1+0a_2+0.1a_3) \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ a_1 + a_2 + a_3\le 3 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ a_1, a_2, a_3 \in \{0,1\} \)
Column | \( a_1 \) | \( a_2 \) | \( a_3 \) | \( c^{new} \) | Obj |
---|---|---|---|---|---|
\( Column^{(1)} \) | 0 | 0 | 0 | 0.30 | 0.30 |
\( Column^{(3)} \) | 0 | 1 | 0 | 0.60 | 0.60 |
\( Column^{(4)} \) | 0 | 0 | 1 | 0.60 | 0.50 |
\( Column^{(5)} \) | 1 | 1 | 0 | 0.50 | -0.10 |
\( Column^{(7)} \) | 0 | 1 | 1 | 0.70 | 0.60 |
\( Column^{(8)} \) | 1 | 1 | 1 | 0.70 | 0.00 |
29 / 68
Price out problem
\( \mathbf {Max} \ \ \ \ c^{new} - (0.6a_1+0a_2+0.1a_3) \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ a_1 + a_2 + a_3\le 3 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ a_1, a_2, a_3 \in \{0,1\} \)
Column | \( a_1 \) | \( a_2 \) | \( a_3 \) | \( c^{new} \) | Obj |
---|---|---|---|---|---|
\( Column^{(1)} \) | 0 | 0 | 0 | 0.30 | 0.30 |
\( Column^{(3)} \) | 0 | 1 | 0 | 0.60 | 0.60 |
\( Column^{(4)} \) | 0 | 0 | 1 | 0.60 | 0.50 |
\( Column^{(5)} \) | 1 | 1 | 0 | 0.50 | -0.10 |
\( Column^{(7)} \) | 0 | 1 | 1 | 0.70 | 0.60 |
\( Column^{(8)} \) | 1 | 1 | 1 | 0.70 | 0.00 |
29 / 68
\( Training \ \ = \{\tau_1, \tau_2, \tau_3 \} \)
\( Column^{(2)} = \{ \tau_1 \} \)
\( Column^{(6)} = \{ \tau_1, \tau_3 \} \)
\( Column^{(3)} = \{ \tau_2 \} \)
\( \mathbf {Max} \ \ \ \ 0.6x_2 + 0.7x_6 + 0.6x_3 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 1x_2 + 1x_6 + 0x_3\le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 0x_6 +1x_3\le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 1x_6 +0x_3\le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0 \le x_1, x_2, x_3 \)
30 / 68
\( Training \ \ = \{\tau_1, \tau_2, \tau_3 \} \)
\( Column^{(2)} = \{ \tau_1 \} \)
\( Column^{(6)} = \{ \tau_1, \tau_3 \} \)
\( Column^{(3)} = \{ \tau_2 \} \)
\( \mathbf {Max} \ \ \ \ 0.6x_2 + 0.7x_6 + 0.6x_3 \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 1x_2 + 1x_6 + 0x_3\le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 0x_6 +1x_3\le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0x_2 + 1x_6 +0x_3\le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ \ 0 \le x_1, x_2, x_3 \)
—> Column generation can
continue until the
optimal solution is found
30 / 68
From the price out problem:
\( \mathbf {Max} \ \ \ \ c^{new} - \pi^*a^{new} \)
An approximation of \( c^{new} \) based on \( a^{new} \) is necessary to implement column generation
32 / 68
From the price out problem:
\( \mathbf {Max} \ \ \ \ c^{new} - \pi^*a^{new} \)
An approximation of classifier accuracy based on the contents of its training data is necessary to implement column generation
32 / 68
Two approximations are developed. Each creates a ranking of the instances and assumes columns with high ranking instances will have high accuracy.
33 / 68
Frequency approximation:
High ranking instances are those that appear frequently
34 / 68
Information approximation:
\( L_{\tau} = - \sum_{j=1}^{|y|} P \left( y_j|\tau \right) log \left(P_E \left(y_j|\tau \right) \right) \)
\[ Bh_{\tau} = \begin{cases} \ \ 0.1^{L_\tau}, & \text{if the classifier correctly classifies $\tau$} \\ (-0.1)^{L_\tau}, & \text{otherwise} \\ \end{cases} \]
35 / 68
Information approximation:
High ranking instances are those that decrease the value of \( Bh \)
36 / 68
37 / 68
A greedy selection procedure is used to combine columns into a final selection of instances
37 / 68
Formulation 1
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
Encourages the discovery of columns that have similar instances but higher accuracy
38 / 68
Formulation 2
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
Discovers columns that have higher accuracy than any previously discovered column
39 / 68
Name | Instances | Attributes | Numeric | Nominal | Classes |
---|---|---|---|---|---|
Balance | 625 | 4 | Yes | No | 3 |
Credit | 690 | 15 | Yes | Yes | 2 |
Diabetes | 768 | 8 | Yes | No | 2 |
Ecoli | 336 | 7 | Yes | No | 8 |
Glass | 214 | 9 | Yes | No | 6 |
Horse | 368 | 21 | Yes | Yes | 3 |
Ionosphere | 351 | 34 | Yes | No | 2 |
LandSat* | 444 | 36 | Yes | No | 6 |
Spect | 267 | 22 | No | Yes | 2 |
Tic-tac-toe | 958 | 9 | No | Yes | 2 |
Waveform | 800 | 21 | Yes | No | 3 |
Wisc Cancer | 699 | 9 | Yes | No | 2 |
41 / 68
Instance selection is performed for:
Approach:
42 / 68
44 / 68
45 / 68
46 / 68
47 / 68
The selected instances depend on
Instance selection is used as a wrapper
48 / 68
Instances with extreme values were removed
49 / 68
50 / 68
51 / 68
52 / 68
53 / 68
54 / 68
55 / 68
56 / 68 Number of test instances misclassified
Class1 | Class2 | Class3 | Class4 | Class5 | Class6 | Total | |
---|---|---|---|---|---|---|---|
Original Training Data | 5 | 9 | 11 | 9 | 8 | 9 | 51 |
With Instance Selection | 1 | 3 | 3 | 14 | 5 | 2 | 28 |
57 / 68
58 / 68
A Population-based Assessment of Perioperative Mortality After Nephroureterectomy for Upper-tract Urothelial Carcinoma
(I'll be calling this NU for UTUC!!)
59 / 68
Data: SEER database
Attributes: age, gender, histopathology, extraglandular
involvement, tumor grade, tumor size, and
mortality
Patients: 2,328 (9% mortality)
Classification task: predict mortality
Classifier: logistic regression
61 / 68
Confusion matrix of test data before instance selection
62 / 68
Confusion matrix of test data after instance selection
63 / 68
64 / 68
65 / 68
Reformulated instance selection as an integer program
Model-based classifiers with improved accuracy can be created by learning from carefully selected instances (because classifiers are imperfect)
Instance selection is useful for actual classification problems and is a competitive technique
66 / 68
Obvious
Interesting
67 / 68
Walter Bennette
bennette@iastate.edu
68 / 68
e1071
train SVM: 98%
test SVM: 95%
train NB/IS: 98%
test NB/IS: 95%