# CS479/679 Pattern Recognition Spring 2006 - Prof. Bebis

Parameter Estimation: Maximum Likelihood (ML) Estimation Chapter 3 (Duda et al.) Sections 3.1-3.2 CS479/679 Pattern Recognition Dr. George Bebis Parameter Estimation Bayesian Decision Theory allows us to design an optimal classifier using the Bayes rule: P ( j / x) p (x / j ) P( j ) p ( x) Need to estimate P(i) and p(x/i) Estimating P(i) is usually not very hard. Estimating p(x/i) could be challenging: Dimensionality of feature space is large. Number of samples is often too small.

Problem Formulation one class Assumptions A set of training samples D ={x1, x2, ...., xn}. Samples are drawn according to p(x|j). p(x|j) has some known parametric form: e.g., p(x /j) ~ N(, ) also denoted as p(x/j ,) or p(x/) where =(, ) Objective of parameter estimation: Given D, find the best possible Problem Formulation c classes Assume c classes and c training data sets (i.e., one for each class). Given D1, D2, ...,Dc and a model p(x/j) ~ p(x/j), for each class j , j=1,2,,c, find the best: 1, 2,, c Assumption: if Dj provides no information about i ( i j ), then we need to solve c independent problems (i.e., one for each class).

Main Methods Maximum Likelihood (ML) Views the parameters fixed quantities whose values are unknown. Estimates by maximizing the likelihood of obtaining the samples observed. Bayesian Estimation (BE) Views the parameters as random variables having some known prior distribution p(. Given D, it converts the prior p(to a posterior density p( D) (i.e., the samples D revise our estimate of the distribution over the parameters ). ML Estimation The ML estimate for D={x1,x2,..,xn} is the value that maximizes the likelihood p(D/): arg max p ( D / ) (choose the value of that is most likely to give rise to the data) that is most likely to give rise to the data) Assuming that the samples in D are drawn independently

according to p(x/j): n p( D / ) p ( x1 , x 2 ,..., xn ) p( x k / ) k 1 ML Estimation (contd) For simplicity, take the log : n p( D / ) p (x k / ) k 1 n ln p ( D / ) ln p (x k / ) log-likelihood k 1 So, we need to maximize ln p(D/ ): arg max ln p( D / )

Example red dots (training data) Assume unknown mean, known variance n p ( D / ) p (x k / ) k 1 n ln p ( D / ) ln p( x k / ) k 1 = ML Estimation (contd) How can we find the maximum of ln p(D/) ? ln p( D / ) 0 or n

k 1 where (gradient) n ln p ( D / ) 0 or n ln p ( D / ) ln p(x k / ) k 1 ln p(x k 1

k / ) 0 ln p(x k / ML for Multivariate Gaussian Density: Case of Unknown = Assume p (x / ) ~ N (, ) 1 d 1 t 1 ln p(x / ) ( x ) ( x ) ln 2 ln | | 2 2 2 Compute the gradient n

ln p ( D / ) 0 or n ln k 1 n ln p( D ln/ p()D/0)or=0 or ln p(x kln/ p)(xk0/ ) 0 k 1 k 1 = 1 ln p ( D / ) ln p(x k / ) (x k ) k

k ML for Multivariate Gaussian Density: Case of Unknown = (contd) Set ln p( D: / ) 0 n 1 (xk ) 0 or k 1 The solution is given by: n x k 1

1 n The ML estimate is simply the sample xmean. k n k 1 k n 0 ML for Univariate Gaussian Density: Case of Unknown =(,2) Assume p ( x / ) ~ N ( , 2 ) 1 1 2 ln 2 2 (x

) or k 2 2 2 1 1 ln p(x k / ) ln 2 2 (x k 1 ) 2 =(1,2)=(,2) 2 2 2 ln p(x k / ) n Compute the gradient ln p ( D / ) 0 or n n ln

k 1 ln p( D /0)or ln/ p()D x k0/ ) 0 =0 or ln p(x kln/ p)( k 1 k 1 ML for Univariate Gaussian Density: 2 1 1 Caselnofp(xUnknown =(, / ) ln 2 (x )(contd) ) or 2

k 2 k 2 2 2 1 1 ln p (x k / ) ln 2 2 (x k 1 ) 2 2 2 2 p(xk/) p(xk/) p(xk/) ML for Univariate Gaussian Density: Case of Unknown =(,2) (contd) n

0 or Set ln p( D / ) =0 ln p ( x k / ) 0 k 1 or n ln p(x k / ) 0 k 1

=0 =0 The solutions are given by: sample mean sample variance ML for Multivariate Gaussian Density: Case of Unknown =(,) In the general case (i.e., multivariate Gaussian) the solutions are: 1 n x k n k 1 sample mean n 1 (x k )(x k )t n k 1 (see Problem 13, page 144)

HW #3 sample covariance Generalize ML Estimation: Maximum A-Posteriori Estimator (MAP) Sometimes, we have some knowledge about . Assume that follows some distribution p(). MAP maximizes p(D/)p() or n p(x k 1 k / ) p() ln [p(D/ )p()]: n ln p(x

k 1 k / ) ln p() Maximum A-Posteriori Estimator (MAP) (contd) What happens when p() is uniform? n ln p(x k / ) ln p() k 1 n ln p(x k

/ ) k 1 MAP becomes equivalent to ML (i.e., ML is a special case of MAP!) p() MAP for Multivariate Gaussian Density: Case of Unknown = Assume p(x / ) ~ N (, Diag ( )) and p () ~ N ( 0 , Diag ( 0 )) (both 0 and 0 are known) Compute ln [p(D/ )p()]: n n ln p(x

k / ) ln p() ln p(x k / ) ln p() k 1 k 1 Maximize ln [p(D/ )p()]: n ( ln p(x k / ) ln p()) 0 k 1 MAP for Multivariate Gaussian Density: Case of Unknown = (contd) n

1 1 (x k ) 2 ( 0 ) 0 2 0 k 1 If or 1 n xk 1 , then 2 n k 1 20 n 0 2 xk k 1

20 1 2 n 20 What happens when 0 0 ? 0 Bias and Variance How good are the ML estimates? Two measures of goodness are used for statistical estimates Bias: how close is the estimate to the true value? Variance: how much does it change for different datasets? Bias and Variance The bias-variance tradeoff: in most cases, you can only decrease one of them at the expense of the other Biased and Unbiased Estimates An estimate is unbiased when E[ ]

The ML estimate is unbiased, i.e., E[ ] The ML estimates and are biased: n 1 2 E[ ] n 2 n 1 E[] n Biased and Unbiased Estimates (contd) How bad is this bias? - The bias is only noticeable when n is small. The following are unbiased estimates of and 1 n 2

( x ) k n 1 k 1 n 1 t ( x )( x

) k k n 1 k 1 Comments ML estimation is simpler than alternative methods. ML provides more accurate estimates as the number of training samples increases. If the assumptions about the model p(x/ ) and independence of the samples are true, then ML works well.

## Recently Viewed Presentations

• ΣΤΑΤΙΣΤΙΚΑ ΣΤΟΙΧΕΙΑ ΠΛΗΘΥΣΜΟΥ Σxi, (i=1-N) = μ Μέση τιμή πληθυσμού N Σ(xi -μ)2 Διακύμανση πληθυσμού
• A LOAD WITH ARROW AWAY FROM THE JOINT IS TENSILE A LOAD WITH ARROW TOWARDS THE JOINT IS COMPRESSIVE ASSUMPTIONS THE FRAME IS A PERFECT FRAME MEMBERS ARE PIN JOINTED (Every member of the truss is then in pure compression...
• Network Applications. In this chapter, we discuss network applications, that is, what networks help us to do. We then. explore the variety of network applications that fall under the umbrella of Web 2.0.
• Crossing the Red Sea of Life You Have a Red Sea to Cross Everyone has to cross one 2 Cor. 11:24-28 Paul You might have to suffer for years James 5:11 Job You need God's help Heb. 11:29-30 Israel, Rahab...
• The 30-item SMI was administrated to spectators during an arena football game (2008) in a large metropolitan area in the Midwest region of the United States. Item responses were based on a 7-point Likert scale (e.g., 7 = very much/very...
• Whatcausesunemployment?. Replacing. people. with. machines. Wage. rigidity. Wage. rent. High. population. growth
• Welcome the European Commission report on the "Cumulative economic impact of future trade agreements on EU agriculture" Essential that interests of EU and Irish agriculture not sacrificed by EU negotiators in the pursuit of trade deals . Minimise TRQ volume...
• Cordierite and andalusite are typical of contact metamorphosed pelites, but may be present in regionally metamorphosed terrains in which metasomatism is involved. Sillimanite is indicative of the highest grades of metamorphism in schist, although it is more commonly found in...