Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5 Evaluation Given: Hypothesis h(x): XC, in hypothesis space H, mapping attributes x to classes c=[1,2,3,...C] A data-sample S(n) of size n. Questions: What is the error of h on unseen data? If we have two competing hypotheses, which one is better on unseen data? How do we compare two learning algorithms in the face of limited data?

How certain are we about our answers? Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 1 n error (h | S ) [h (xi ) yi ] n i 1 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): error (h | P ) dx P (x ) [h(x ) f (x )] where f(x) is the true hypothesis.

Binomial Distributions Assume you toss a coin n times. And it has probability p of coming heads (which we will call success) What is the probability distribution governing the number of heads in n trials? Answer: the Binomial distribution. p(# heads r | p,n) n! pr (1 p)n r r !(n r )! Distribution over Errors Consider some hypothesis h(x) Draw n samples X~P(X).

Do this k times. Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). {e1,...,ek} are samples from a Binomial distribution ! Why? imagine a magic coin, where God secretly determines the probability of heads by the following procedure. First He takes some random hypothesis h. Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. If it does, he makes sure the coin lands heads up... You have a single sample S, for which you observe e(S) errors. What would be a reasonable estimate for Error(h|P) you think? Binomial Moments mean(r ) E [r | n, p] np var

2 var (r ) E [(r E [r ]) ] np(1 p) mean If we match the mean, np, with the observed value n*error(h|S) we find: E [error (h | P )] E [r / n] p error (h | S ) If we match the variance we can obtain an estimate of the width: var[error (h | P )] var[r / n] error (h | S ) (1 error (h | S )) n Confidence Intervals

We would like to state: With N% confidence we believe that error(h|P) is contained in the interval: error (h | P ) error (h | S ) z N 80% error (h | S ) (1 error (h | S )) n z 0.8 1.28 Normal(0,1) 1.28

In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to approximate a Binomial by a Gaussian for which we can easily compute z-values. Bias-Variance The estimator is unbiased if E [error (h | X )] p Imagine again you have infinitely many sample sets X1,X2,.. of size n. Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. Two unbiased estimators can still differ in their variance (efficiency). Which one do you prefer? p Eav

Flow of Thought Determine the property you want to know about the future data (e.g. error(h|P)) Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) Determine the distribution P(E) of E under the assumption you have infinitely many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. (sums of random variables converge to normal distributions central limit theorem) State you confidence interval as: with confidence N% error(h|P) is contained in the interval Y mean z N var Assumptions We only consider discrete valued hypotheses (i.e. classes)

Training data and test data are drawn IID from the same distribution P(x). (IID: independently & identically distributed) The hypothesis must be chosen independently from the data sample S! When you obtain a hypothesis from a learning algorithm, split the data into a training set and a testing set. Find the hypothesis using the training set and estimate error on the testing set. Homework: Consider training a classifier h(x) on data S. Argue why the third assumption is violated. Comparing Hypotheses Assume we like to compare 2 hypothesis h1 and h2, which we have tested on two independent samples S1 and S2 of size n1 and n2. I.e. we are interested in the quantity: d error (h1| P ) error (h2| P ) ? Define estimator for d: d error (h1| X 1) error (h2| X 2)

with X1,X2 sample sets of size n1,n2. Since error(h1|S1) and error(h2|S2) are both approximately Normal their difference is approximately Normal with: mean d error (h1| S 1) error (h2| S 2) var error (h1| S 1)(1 error (h1| S 1)) error (h2| S 2)(1 error (h2| S 2)) n1 n2 Hence, with N% confidence we believe that d is contained in the interval: d mean z N var Say, mean = -0.1, sqrt(var)=0.5. Z(0.8)=1.28. Do you want to conclude that h1 is significantly better than h2?

Paired Tests Consider the following data: error(h1|s1)=0.1 error(h1|s2)=0.2 error(h1|s3)=0.66 error(h1|s4)=0.45 and so on. error(h2|s1)=0.11 error(h2|s2)=0.21 error(h2|s3)=0.67 error(h2|s4)=0.46 We have var(error(h1)) = large, var(error(h2)) = large.

The total variance of error(h1)-error(h2) is their sum. However, h1 is consistently better than h2. We ignored the fact that we compare on the same data. We want a different estimator that compares data one by one. You can use a paired t-test (e.g. in matlab) to see if the two errors are significantly different, or if one error is significantly larger than the other. Paired t-test Chunk the data up in subsets T1,...,Tk with |Ti|>30 On each subset compute the error and compute: i error (h1|Ti ) error (h2|Ti ) Now compute: 1 k i k i 1

k 1 s ( ) (i )2 k (k 1) i 1 State: With N% confidence the difference in error between h1 and h2 is: t N ,k 1s ( ) t is the t-statistic which is related to the student-t distribution (table 5.6). Comparing Learning Algorithms In general it is a really bad idea to estimate error rates on the same data on which a learning algorithm is trained. WHY?

So just as in x-validation, we split the data into k subsets: S{T1,T2,...Tk}. Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement of each subset: {S-T1,S-T2,...) to produce hypotheses {L1(S-Ti), L2(S-Ti)} for all i. Compute for all i : i error (L1(S Ti ) |Ti ) error (L2 (S Ti ) |Ti ) Note: we train on S-Ti, but test on Ti. As in the last slide perform a paired t-test on these differences to compute an estimate and a confidence interval for the relative error of the hypothesis produced by L1 and L2. Homework: Are all assumptions for the statistical test respected. If not, find one that is violated. Do you think that this estimate is correct, optimistic or pessimistic? Evaluation: ROC curves moving threshold class 0 (negatives)

TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives class 1 (positives)

Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter. Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course) Appendix The following slide are optional:

Hypothesis Tests Consider you want to compare again two hypothesis h1 and h2 on sample sets S1 and S2 of size n1 and n2. Now we like to answer: will h2 have significantly less error on future data than h1? We can formulate this as: is the probability P(d>0) larger than 95% (say). Since d mean(d ) this means: What is the total probability that we observe d 0.1 when d>0. P (d ) This can be computed as:

0.1 d 0 dd P (d 0.1| mean d ) dd P (d | mean 0) (move Gauss to the right) mean(d ) d 0.1