Walter Bennette
02-28-2014
How Companies Learn Your Secrets
Target used data mining to determine when customers became pregnant.
A classifier is a scheme built from training data to classify unlabeled data.
The decision tree classifier builds a model that tries to split the data into homogeneous groups.
Test accuracy is found by making predictions on previously unseen data.
We say the decision tree has 80% test accuracy because four of the five predictions are correct.
Classification models can be useful in a variety of application areas
Better (or classifiers with higher testing accuracy) are beneficial
Aspects of training data may make it difficult to build/induce/learn an accurate classifier
We believe that selecting which instances to learn from can improve the accuracy of a classifier. This is called instance selection!
Classes may overlap
Where should these classes be separated?
Classes may overlap
Where should these classes be separated?
Classes may overlap
Where should these classes be separated?
Classes may overlap
Where should these classes be separated?
Classes may have outliers
Should these outliers be accommodated?
Classes may have outliers
Should these outliers be accommodated?
Minority class
Does capturing the minority class introduce unnecessary structure?
Minority class
Does capturing the minority class introduce unnecessary structure?
Instance selection can address issues in the training data by creating a subset of the original data
The intention is that the classification algorithm will perform better when learning from the selected/reduced data set
Instance selection is a combinatorial optimization problem
Goal: maximize classifier accuracy
\( {\mathbf {Max} \ \ \ \ Classifier \ Accuracy \\ \mathbf {s.t} \\ \ \ \ \ \ \ \ \ \ \ \ \ \ x_i \in \{0,1\} \ \forall \ i \in I} \)
A VAST majority rely on evolutionary algorithms to solve this problem.
Other combinatorial optimization problems look similar to instance selection if the problem is reformulated. This allows us to take advantage of optimization theory.
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<—A column is
associated with each
possible subset of
instances
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— Select at most one
column
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— \( a_{ij} \) indicates if instance
\( i \ \) is in column \( \ j \)
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<— \( c_j \) is the accuracy of a
classifier built from
the contents in column
\( \ j \)
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<—For a dataset of size \( n \)
there are \( 2^n \) decision
variables
A technique to solve linear programs with a huge number of decision variables
Commonly extended to integer programs
Exploit a feature of the simplex algorithm to guarantee that the optimal solution of the master problem is found
Let us have a quick reminder of how simplex works
Process
\( \mathbf {Max} \ \ \ \sum_{j \in J}c_jx_j \)
\( \mathbf {s.t} \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}a_{ij}x_j \le 1 \ \ \forall \ i \in I \)
\( \ \ \ \ \ \ \ \ \ \ \sum_{j \in J}x_j \le 1 \)
\( \ \ \ \ \ \ \ \ \ \ \ x_j \in \{0,1\} \ \ \forall \ j \in J \)
<—For a dataset of size \( n \)
there are \( 2^n \) decision
variables
Statlog (Landsat Satellite) dataset
Showing the effect of instance selection on a naive Bayes classifier. Average test accuracy is reported.
Dataset | Original | Instance Selection |
---|---|---|
Balance | 90% | 91% |
Chess | 84% | 90% |
Credit Approval | 78% | 85% |
E-coli | 83% | 83% |
Statlog | 80% | 83% |
Tic-Tac-Toe Endgame | 70% | 72% |
Integer programming techniques can help us solve the instance selection problem
Instance selection manipulates the training data to allow classifiers to take greater advantage of the data