Evaluating Classifiers

Evaluating Classifiers

Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5 Evaluation Given: Hypothesis h(x): XC, in hypothesis space H, mapping attributes x to classes c=[1,2,3,...C] A data-sample S(n) of size n. Questions: What is the error of h on unseen data? If we have two competing hypotheses, which one is better on unseen data? How do we compare two learning algorithms in the face of limited data?

How certain are we about our answers? Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 1 n error (h | S ) [h (xi ) yi ] n i 1 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): error (h | P ) dx P (x ) [h(x ) f (x )] where f(x) is the true hypothesis.

Binomial Distributions Assume you toss a coin n times. And it has probability p of coming heads (which we will call success) What is the probability distribution governing the number of heads in n trials? Answer: the Binomial distribution. p(# heads r | p,n) n! pr (1 p)n r r !(n r )! Distribution over Errors Consider some hypothesis h(x) Draw n samples X~P(X).

Do this k times. Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). {e1,...,ek} are samples from a Binomial distribution ! Why? imagine a magic coin, where God secretly determines the probability of heads by the following procedure. First He takes some random hypothesis h. Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. If it does, he makes sure the coin lands heads up... You have a single sample S, for which you observe e(S) errors. What would be a reasonable estimate for Error(h|P) you think? Binomial Moments mean(r ) E [r | n, p] np var

2 var (r ) E [(r E [r ]) ] np(1 p) mean If we match the mean, np, with the observed value n*error(h|S) we find: E [error (h | P )] E [r / n] p error (h | S ) If we match the variance we can obtain an estimate of the width: var[error (h | P )] var[r / n] error (h | S ) (1 error (h | S )) n Confidence Intervals

We would like to state: With N% confidence we believe that error(h|P) is contained in the interval: error (h | P ) error (h | S ) z N 80% error (h | S ) (1 error (h | S )) n z 0.8 1.28 Normal(0,1) 1.28

In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to approximate a Binomial by a Gaussian for which we can easily compute z-values. Bias-Variance The estimator is unbiased if E [error (h | X )] p Imagine again you have infinitely many sample sets X1,X2,.. of size n. Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. Two unbiased estimators can still differ in their variance (efficiency). Which one do you prefer? p Eav

Flow of Thought Determine the property you want to know about the future data (e.g. error(h|P)) Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) Determine the distribution P(E) of E under the assumption you have infinitely many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. (sums of random variables converge to normal distributions central limit theorem) State you confidence interval as: with confidence N% error(h|P) is contained in the interval Y mean z N var Assumptions We only consider discrete valued hypotheses (i.e. classes)

Training data and test data are drawn IID from the same distribution P(x). (IID: independently & identically distributed) The hypothesis must be chosen independently from the data sample S! When you obtain a hypothesis from a learning algorithm, split the data into a training set and a testing set. Find the hypothesis using the training set and estimate error on the testing set. Homework: Consider training a classifier h(x) on data S. Argue why the third assumption is violated. Comparing Hypotheses Assume we like to compare 2 hypothesis h1 and h2, which we have tested on two independent samples S1 and S2 of size n1 and n2. I.e. we are interested in the quantity: d error (h1| P ) error (h2| P ) ? Define estimator for d: d error (h1| X 1) error (h2| X 2)

with X1,X2 sample sets of size n1,n2. Since error(h1|S1) and error(h2|S2) are both approximately Normal their difference is approximately Normal with: mean d error (h1| S 1) error (h2| S 2) var error (h1| S 1)(1 error (h1| S 1)) error (h2| S 2)(1 error (h2| S 2)) n1 n2 Hence, with N% confidence we believe that d is contained in the interval: d mean z N var Say, mean = -0.1, sqrt(var)=0.5. Z(0.8)=1.28. Do you want to conclude that h1 is significantly better than h2?

Paired Tests Consider the following data: error(h1|s1)=0.1 error(h1|s2)=0.2 error(h1|s3)=0.66 error(h1|s4)=0.45 and so on. error(h2|s1)=0.11 error(h2|s2)=0.21 error(h2|s3)=0.67 error(h2|s4)=0.46 We have var(error(h1)) = large, var(error(h2)) = large.

The total variance of error(h1)-error(h2) is their sum. However, h1 is consistently better than h2. We ignored the fact that we compare on the same data. We want a different estimator that compares data one by one. You can use a paired t-test (e.g. in matlab) to see if the two errors are significantly different, or if one error is significantly larger than the other. Paired t-test Chunk the data up in subsets T1,...,Tk with |Ti|>30 On each subset compute the error and compute: i error (h1|Ti ) error (h2|Ti ) Now compute: 1 k i k i 1

k 1 s ( ) (i )2 k (k 1) i 1 State: With N% confidence the difference in error between h1 and h2 is: t N ,k 1s ( ) t is the t-statistic which is related to the student-t distribution (table 5.6). Comparing Learning Algorithms In general it is a really bad idea to estimate error rates on the same data on which a learning algorithm is trained. WHY?

So just as in x-validation, we split the data into k subsets: S{T1,T2,...Tk}. Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement of each subset: {S-T1,S-T2,...) to produce hypotheses {L1(S-Ti), L2(S-Ti)} for all i. Compute for all i : i error (L1(S Ti ) |Ti ) error (L2 (S Ti ) |Ti ) Note: we train on S-Ti, but test on Ti. As in the last slide perform a paired t-test on these differences to compute an estimate and a confidence interval for the relative error of the hypothesis produced by L1 and L2. Homework: Are all assumptions for the statistical test respected. If not, find one that is violated. Do you think that this estimate is correct, optimistic or pessimistic? Evaluation: ROC curves moving threshold class 0 (negatives)

TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives class 1 (positives)

Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter. Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course) Appendix The following slide are optional:

Hypothesis Tests Consider you want to compare again two hypothesis h1 and h2 on sample sets S1 and S2 of size n1 and n2. Now we like to answer: will h2 have significantly less error on future data than h1? We can formulate this as: is the probability P(d>0) larger than 95% (say). Since d mean(d ) this means: What is the total probability that we observe d 0.1 when d>0. P (d ) This can be computed as:

0.1 d 0 dd P (d 0.1| mean d ) dd P (d | mean 0) (move Gauss to the right) mean(d ) d 0.1

Recently Viewed Presentations

  • Notes for CS3310 Artificial Intelligence Part 1 Prof. Neil C ...

    Notes for CS3310 Artificial Intelligence Part 1 Prof. Neil C ...

    Transitivity and inheritance, cont. Transitivity and inheritance are "defaults" which can be overridden occasionally by exceptions. In Prolog, facts explaining exceptions go at the top of the Prolog database. Note: Given B A and a case where A is true...
  • Title: Arial Bold 95 Pts. - Home | Health Sciences Library

    Title: Arial Bold 95 Pts. - Home | Health Sciences Library

    Line Spacing 0.9Before Paragraph 0.4After Paragraph 0.0. Text Box with Numbered Points. Arial 35 pts. Line Spacing 0.9Before Paragraph 0.4After Paragraph 0.0. ... Pie Chart Slice A Slice B Slice C Slice D 23 10 7 60. Title: Arial Bold...
  • www.sikhcoalition.org 1 Sikhism in Brief  Sikhism is the

    www.sikhcoalition.org 1 Sikhism in Brief Sikhism is the

    Sikhism in Brief Sikhism is the fifth largest World religion Sikhism is an independent religion with no connection with Hinduism or Islam 23 million Sikhs worldwide 500,000 Sikhs reside in the United States and 500,000 Sikhs live in Canada Sikhs...
  • MIT Research: LCA of Commercial Buildings MIT Concrete

    MIT Research: LCA of Commercial Buildings MIT Concrete

    Life Cycle Emissions (60 years) When the GWP for the total life cycle is put together, it becomes clear that the embodied energy composes only a small fraction of the life cycle GWP, as shown in this graph for a...
  • on Dicti ary The sau Where do I

    on Dicti ary The sau Where do I

    Thesaurus. Almanac. Created by Jessi Olmsted. Note: To view this PPT correctly, go to . ... How many syllables are in a word. How many definitions there are of a word. The word's part of speech . How are dictionaries...
  • Imaging for Arthritis - St. Joseph's Health Care London

    Imaging for Arthritis - St. Joseph's Health Care London

    Arial Calibri Constantia Wingdings 2 Flow 1_Flow 2_Flow 3_Flow Imaging for Arthritis Outline X-rays Computed Tomography (CT) Ultrasound (US) Magnetic Resonance Imaging (MRI) Nuclear Scintigraphy Approach to an Image Osteoarthritis Slide 10 Slide 11 Slide 12 Slide 13 Slide 14...
  • Meiosis - flynnbio.weebly.com

    Meiosis - flynnbio.weebly.com

    Homologous Chromosomes. Pair of . chromosomes (maternal. and . paternal) that are . similar in shape and size. Homologous pairs (tetrads) carry GENES controlling the SAME inherited traits
  • International Symposium on Molecular Spectroscopy, 2014 FULL DIMENSIONAL

    International Symposium on Molecular Spectroscopy, 2014 FULL DIMENSIONAL

    Moumita Majumder, Richard Dawes, Xiao-Gang Wang, Tucker Carrington Jr., Jun Li and HuaGuo. Missouri University of Science and Technology. Queen's University. University of New Mexico. Introduction . Methane (CH. 4): Simplest saturated hydrocarbon.