+ Machine Learning and Data Mining Linear regression (adapted from) Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions Parameters q Learning algorithm Program (Learner) Training data (examples) Features Feedback / Target values

Characterized by some parameters Procedure (using ) that outputs a prediction Score performance (cost function) Change Improve performance Linear regression Predictor: Evaluate line: Target y 40 return r 20 0 0 10 Feature x

20 Define form of function f(x) explicitly Find a good f(x) within that family (c) Alexander Ihler More dimensions? 26 26 y y 24 24 22 22 20 20 30 30 40 20 x1

30 20 10 0 10 0 x2 (c) Alexander Ihler 40 20 x1 30 20 10 0 10 0 x2

Notation Define feature x0 = 1 (constant) Then (c) Alexander Ihler Measuring error Error or residual Observation Prediction 0 0 20 (c) Alexander Ihler Mean squared error How can we quantify the error? Could choose something else, of course Computationally convenient (more later) Measures the variance of the residuals Corresponds to likelihood under Gaussian model of noise (c) Alexander Ihler

MSE cost function Rewrite using matrix form (Matlab) >> e = y th*X; J = e*e/m; (c) Alexander Ihler 0 J() Visualizing the cost function 1 40 40 30 30

20 20 10 10 0 0 -10 -10 -20 -20 -30 -30 -40 -1 -0.5 0

0.5 1 1.5 2 2.5 (c)3 Alexander-40Ihler -1 -0.5 0 0.5 1 1.5 2 2.5 3 Supervised learning

Notation Features x Targets y Predictions Parameters q Learning algorithm Program (Learner) Training data (examples) Features Feedback / Target values Characterized by some parameters Procedure (using ) that outputs a prediction Score performance (cost function)

Change Improve performance Finding good parameters Want to find parameters which minimize our error Think of a cost surface: error residual for that (c) Alexander Ihler + Machine Learning and Data Mining Linear regression: direct minimization (adapted from) Prof. Alexander Ihler MSE Minimum Consider a simple problem One feature, two data points Two unknowns: 0, 1 Two equations: Can solve this system directly: However, most of the time, m > n There may be no linear function that hits all the data exactly Instead, solve directly for minimum of MSE function (c) Alexander Ihler SSE Minimum

Reordering, we have X (XT X)-1 is called the pseudo-inverse If XT is square and independent, this is the inverse If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler Matlab SSE This is easy to solve in Matlab % % y = [y1 ; ; ym] X = [x1_0 x1_m ; x2_0 x2_m ; ] % Solution 1: manual th = y * X * inv(X * X); % Solution 2: mrdivide th = y / X; % th*X = y (c) Alexander Ihler => th = y/X Effects of MSE choice Sensitivity to outliers 18

16 16 2 cost for this one datum 14 12 Heavy penalty for large errors 10 5 8 4 3 6 2 4 1 0 -20 2 0 0

2 4 6 8 10 12 14 16 18 (c) Alexander Ihler -15 20 -10 -5 0

5 L1 error 18 L2, original data 16 L1, original data 14 L1, outlier data 12 10 8 6 4 2 0 0 2 4 6

8 10 12 14 16 18 (c) Alexander Ihler 20 Cost functions for regression (MSE) (MAE) Something else entirely (???) Arbitrary functions cant be solved in closed form - use gradient descent (c) Alexander Ihler +

Machine Learning and Data Mining Linear regression: nonlinear features (adapted from) Prof. Alexander Ihler Nonlinear functions What if our hypotheses are not lines? Ex: higher-order polynomials Order 3 polynom ial Order 1 polynom ial 18 18 16 16 14 14 12 12 10

10 8 8 6 6 4 4 2 0 2 0 2 4 6 8 10

12 14 16 18 20 0 0 (c) Alexander Ihler 2 4 6 8 10 12 14

16 18 20 Nonlinear functions Single feature x, predict target y: Add features: Linear regression in new features Sometimes useful to think of feature transform (c) Alexander Ihler Higher-order polynomials Order 1 polynom ial 18 Fit in the same way More features 16 14 12

10 8 6 4 2 0 Order 2 polynom ial 18 18 16 16 14 0 2 4 6 8 3 polynom 10 12 Order

ial 14 16 18 20 14 12 12 10 10 8 8 6 6 4 4 2 2

0 -2 0 2 4 6 8 10 12 14 16 18 20 0 0

2 4 6 8 10 12 14 16 18 20 Features In general, can use any features we think are useful Other information about the problem Sq. footage, location, age, Polynomial functions Features [1, x, x2, x3, ] Other functions 1/x, sqrt(x), x1 * x2,

Linear regression = linear in the parameters Features we can make as complex as we want! (c) Alexander Ihler Higher-order polynomials Are more features better? Nested hypotheses 2nd order more general than 1st, 3rd order than 2nd, Fits the observed data better Overfitting and complexity More complex models will always fit the training data better But they may overfit the training data, learning complex relationships that are not really present Complex model Simple model Y Y X

(c) Alexander Ihler X Test data After training the model Go out and get more data from the world New observations (x,y) How well does our model perform? (c) Alexander Ihler Training data New, test data Training versus test error Plot MSE as a function of model complexity 30 Polynomial order Training data 25 New, test data

Decreases 20 15 What about new data? 10 0th to 1st order Error decreases Underfitting Higher order Error increases Overfitting 5 0 0 Mean squared error More complex function fits training data better 0.5 1

1.5 2 Polynomial order (c) Alexander Ihler 2.5 3 3.5 4 4.5 5