Parameter Estimation: Maximum Likelihood (ML) Estimation Chapter 3 (Duda et al.) Sections 3.1-3.2 CS479/679 Pattern Recognition Dr. George Bebis Parameter Estimation Bayesian Decision Theory allows us to design an optimal classifier using the Bayes rule: P ( j / x) p (x / j ) P( j ) p ( x) Need to estimate P(i) and p(x/i) Estimating P(i) is usually not very hard. Estimating p(x/i) could be challenging: Dimensionality of feature space is large. Number of samples is often too small.

Problem Formulation one class Assumptions A set of training samples D ={x1, x2, ...., xn}. Samples are drawn according to p(x|j). p(x|j) has some known parametric form: e.g., p(x /j) ~ N(, ) also denoted as p(x/j ,) or p(x/) where =(, ) Objective of parameter estimation: Given D, find the best possible Problem Formulation c classes Assume c classes and c training data sets (i.e., one for each class). Given D1, D2, ...,Dc and a model p(x/j) ~ p(x/j), for each class j , j=1,2,,c, find the best: 1, 2,, c Assumption: if Dj provides no information about i ( i j ), then we need to solve c independent problems (i.e., one for each class).

Main Methods Maximum Likelihood (ML) Views the parameters fixed quantities whose values are unknown. Estimates by maximizing the likelihood of obtaining the samples observed. Bayesian Estimation (BE) Views the parameters as random variables having some known prior distribution p(. Given D, it converts the prior p(to a posterior density p( D) (i.e., the samples D revise our estimate of the distribution over the parameters ). ML Estimation The ML estimate for D={x1,x2,..,xn} is the value that maximizes the likelihood p(D/): arg max p ( D / ) (choose the value of that is most likely to give rise to the data) that is most likely to give rise to the data) Assuming that the samples in D are drawn independently

according to p(x/j): n p( D / ) p ( x1 , x 2 ,..., xn ) p( x k / ) k 1 ML Estimation (contd) For simplicity, take the log : n p( D / ) p (x k / ) k 1 n ln p ( D / ) ln p (x k / ) log-likelihood k 1 So, we need to maximize ln p(D/ ): arg max ln p( D / )

Example red dots (training data) Assume unknown mean, known variance n p ( D / ) p (x k / ) k 1 n ln p ( D / ) ln p( x k / ) k 1 = ML Estimation (contd) How can we find the maximum of ln p(D/) ? ln p( D / ) 0 or n

k 1 where (gradient) n ln p ( D / ) 0 or n ln p ( D / ) ln p(x k / ) k 1 ln p(x k 1

k / ) 0 ln p(x k / ML for Multivariate Gaussian Density: Case of Unknown = Assume p (x / ) ~ N (, ) 1 d 1 t 1 ln p(x / ) ( x ) ( x ) ln 2 ln | | 2 2 2 Compute the gradient n

ln p ( D / ) 0 or n ln k 1 n ln p( D ln/ p()D/0)or=0 or ln p(x kln/ p)(xk0/ ) 0 k 1 k 1 = 1 ln p ( D / ) ln p(x k / ) (x k ) k

k ML for Multivariate Gaussian Density: Case of Unknown = (contd) Set ln p( D: / ) 0 n 1 (xk ) 0 or k 1 The solution is given by: n x k 1

1 n The ML estimate is simply the sample xmean. k n k 1 k n 0 ML for Univariate Gaussian Density: Case of Unknown =(,2) Assume p ( x / ) ~ N ( , 2 ) 1 1 2 ln 2 2 (x

) or k 2 2 2 1 1 ln p(x k / ) ln 2 2 (x k 1 ) 2 =(1,2)=(,2) 2 2 2 ln p(x k / ) n Compute the gradient ln p ( D / ) 0 or n n ln

k 1 ln p( D /0)or ln/ p()D x k0/ ) 0 =0 or ln p(x kln/ p)( k 1 k 1 ML for Univariate Gaussian Density: 2 1 1 Caselnofp(xUnknown =(, / ) ln 2 (x )(contd) ) or 2

k 2 k 2 2 2 1 1 ln p (x k / ) ln 2 2 (x k 1 ) 2 2 2 2 p(xk/) p(xk/) p(xk/) ML for Univariate Gaussian Density: Case of Unknown =(,2) (contd) n

0 or Set ln p( D / ) =0 ln p ( x k / ) 0 k 1 or n ln p(x k / ) 0 k 1

=0 =0 The solutions are given by: sample mean sample variance ML for Multivariate Gaussian Density: Case of Unknown =(,) In the general case (i.e., multivariate Gaussian) the solutions are: 1 n x k n k 1 sample mean n 1 (x k )(x k )t n k 1 (see Problem 13, page 144)

HW #3 sample covariance Generalize ML Estimation: Maximum A-Posteriori Estimator (MAP) Sometimes, we have some knowledge about . Assume that follows some distribution p(). MAP maximizes p(D/)p() or n p(x k 1 k / ) p() ln [p(D/ )p()]: n ln p(x

k 1 k / ) ln p() Maximum A-Posteriori Estimator (MAP) (contd) What happens when p() is uniform? n ln p(x k / ) ln p() k 1 n ln p(x k

/ ) k 1 MAP becomes equivalent to ML (i.e., ML is a special case of MAP!) p() MAP for Multivariate Gaussian Density: Case of Unknown = Assume p(x / ) ~ N (, Diag ( )) and p () ~ N ( 0 , Diag ( 0 )) (both 0 and 0 are known) Compute ln [p(D/ )p()]: n n ln p(x

k / ) ln p() ln p(x k / ) ln p() k 1 k 1 Maximize ln [p(D/ )p()]: n ( ln p(x k / ) ln p()) 0 k 1 MAP for Multivariate Gaussian Density: Case of Unknown = (contd) n

1 1 (x k ) 2 ( 0 ) 0 2 0 k 1 If or 1 n xk 1 , then 2 n k 1 20 n 0 2 xk k 1

20 1 2 n 20 What happens when 0 0 ? 0 Bias and Variance How good are the ML estimates? Two measures of goodness are used for statistical estimates Bias: how close is the estimate to the true value? Variance: how much does it change for different datasets? Bias and Variance The bias-variance tradeoff: in most cases, you can only decrease one of them at the expense of the other Biased and Unbiased Estimates An estimate is unbiased when E[ ]

The ML estimate is unbiased, i.e., E[ ] The ML estimates and are biased: n 1 2 E[ ] n 2 n 1 E[] n Biased and Unbiased Estimates (contd) How bad is this bias? - The bias is only noticeable when n is small. The following are unbiased estimates of and 1 n 2

( x ) k n 1 k 1 n 1 t ( x )( x

) k k n 1 k 1 Comments ML estimation is simpler than alternative methods. ML provides more accurate estimates as the number of training samples increases. If the assumptions about the model p(x/ ) and independence of the samples are true, then ML works well.