# 12 Simple Linear Regression and Correlation Copyright Cengage Correlation Copyright Cengage Learning. All rights reserved. 2 Correlation There are many situations in which the objective in studying the joint behavior of two variables is to see whether they are related, rather than to use one to predict the value of

the other. In this section, we first develop the sample correlation coefficient r as a measure of how strongly related two variables x and y are in a sample and then relate r to the correlation coefficient defined in Chapter 5. 3 The Sample Correlation Coefficient r

4 The Sample Correlation Coefficient r Given n numerical pairs (x1, y1), (x2, y2), c, (xn, yn), it is natural to speak of x and y as having a positive relationship if large xs are paired with large ys and small xs with small ys. Similarly, if large xs are paired with small ys and small xs with large ys, then a negative relationship between the variables is implied. Consider the quantity

5 The Sample Correlation Coefficient r Then if the relationship is strongly positive, an xi above the mean will tend to be paired with a yi above the mean , so that and this product will also be positive whenever both xi and yi are below their respective means. Thus a positive relationship implies that Sxy will be positive. An analogous argument shows that when the relationship is

negative, Sxy will be negative, since most of the products will be negative. 6 The Sample Correlation Coefficient r This is illustrated in Figure 12.19. (b) (a)

(a) Scatter plot with Sxy positive; (b) scatter plot with Sxy negative [+ means (xi x)(yi y) > 0, and means (xi x)(yi y) < 0] Figure 12.19 7 The Sample Correlation Coefficient r Although Sxy seems a plausible measure of the strength of a relationship, we do not yet have any idea of how positive or negative it can be.

Unfortunately, Sxy has a serious defect: By changing the unit of measurement for either x or y, Sxy can be made either arbitrarily large in magnitude or arbitrarily close to zero. For example, if Sxy = 25,000 = 25 when x is measured in meters, then Sxy = 25,000 when x is measured in millimeters and .025 when x is expressed in kilometers. 8 The Sample Correlation Coefficient r A reasonable condition to impose on any measure of how

strongly x and y are related is that the calculated measure should not depend on the particular units used to measure them. This condition is achieved by modifying Sxy to obtain the sample correlation coefficient. 9 The Sample Correlation Coefficient r Definition

10 Example 12.15 An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article Productivity Ratings Based on Soil Series (Prof. Geographer, 1980: 158163) argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which crop is planted, and the relationship between the

yield of two different crops planted in the same soil may not be very strong. 11 Example 12.15 contd To illustrate, the article presents the accompanying data on corn yield x and peanut yield y (mT/Ha) for eight different types of soil.

With 12 Example 12.15 contd from which

13 Properties of r 14 Properties of r The most important properties of r are as follows: 1. The value of r does not depend on which of the two variables under study is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and

y are measured. 3. 1 r 1 4. r = 1 if and only if (iff) (xi, yi) all pairs lie on a straight line with positive slope, and r = 1 iff all (xi, yi) pairs lie on a straight line with negative slope. 15 Properties of r 5. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression modelin

symbols, (r)2 = r 2. Property 1 stands in marked contrast to what happens in regression analysis, where virtually all quantities of interest (the estimated slope, estimated y-intercept, s2, etc.) depend on which of the two variables is treated as the dependent variable. 16 Properties of r However, Property 5 shows that the proportion of variation

in the dependent variable explained by fitting the simple linear regression model does not depend on which variable plays this role. Property 2 is equivalent to saying that r is unchanged if each xi is replaced by cxi and if each yi is replaced by dyi (a change in the scale of measurement), as well as if each xi is replaced by xi a and yi by yi b (which changes the location of zero on the measurement axis). This implies, for example, that r is the same whether temperature is measured in F or C. 17

Properties of r Property 3 tells us that the maximum value of r, corresponding to the largest possible degree of positive relationship, is r = 1, whereas the most negative relationship is identified with r = 1. According to Property 4, the largest positive and largest negative correlations are achieved only when all points lie along a straight line. Any other configuration of points, even if the configuration suggests a deterministic relationship between variables, will

yield an r value less than 1 in absolute magnitude. 18 Properties of r Thus r measures the degree of linear relationship among variables. A value of r near 0 is not evidence of the lack of a strong relationship, but only the absence of a linear relation, so that such a value of r must be interpreted with caution. Figure 12.20 illustrates several configurations of points associated with different values of r.

(a) r near +1 (b) r near 1 (c) r near 0, no apparent relationship (d) r near 0, nonlinear relationship Data plots for different values of r

Figure 12.20 19 Properties of r A frequently asked question is, When can it be said that there is a strong correlation between the variables, and when is the correlation weak? Here is an informal rule of thumb for characterizing the value of r: Weak .5 r .5

Moderate either .8 < r < .5 or .5 < r < .8 Strong either r .8 or r .8 It may surprise you that an r as substantial as .5 or .5 goes in the weak category. 20 Properties of r

The rationale is that if r = .5 or .5, then r2 = .25 in a regression with either variable playing the role of y. A regression model that explains at most 25% of observed variation is not in fact very impressive. In Example 12.15, the correlation between corn yield and peanut yield would be described as weak. 21 Inferences About the Population Correlation Coefficient

22 Inferences About the Population Correlation Coefficient The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. We can think (xi, yi) of the pairs as having been drawn from a bivariate population of pairs, with (Xi, Yi) having some joint pmf or pdf. In Chapter 5 we defined the correlation coefficient (X,Y) by

23 Inferences About the Population Correlation Coefficient Where If we think of p(x, y) or f (x, y) as describing the distribution of pairs of values within the entire population, becomes a measure of how strongly related x and y are in that population. Properties of analogous to those for r were given in Chapter 5.

24 Inferences About the Population Correlation Coefficient The population correlation coefficient p is a parameter or population characteristic, just as X, Y, X, and Y, are, so we can use the sample correlation coefficient to make various inferences about . In particular, r is a point estimate for , and the corresponding estimator is 25

Example 12.16 Medical researchers have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because such babies have higher mortality rates, numerous investigations have focused on the relationship between mothers age and birth weight. One such study is described in the article Body Size and Intelligence in 6-Year-Olds: Are Offspring of Teenage Mothers at Risk? (Maternal and Child Health J., 2009: 847 856).

26 Example 12.16 contd The following data on x = maternal age (yr) and y = babys birth weight (g) is consistent with summary quantities given in the cited article as well as with data published by the National Center for Health Statistics.

27 Example 12.16 A scatterplot of the data shows a rather substantial increasing linear pattern. Relevant summary quantities are , from which Then 28 Example 12.16

With denoting the correlation between mothers age and babys weight in the entire population of adolescent mothers who gave birth, the point estimate of is 29 Inferences About the Population Correlation Coefficient The small-sample intervals and test procedures presented in Chapters 79 were based on an assumption of population normality. To test hypotheses about r, an analogous assumption

about the distribution of pairs of (x, y) values in the population is required. We are now assuming that both X and Y are random, whereas much of our regression work focused on x fixed by the experimenter with a bivariate normal probability distribution as described in Section 5.2. Recall that in this case =0 implies that X and Y are independent rvs. 30 Inferences About the Population Correlation Coefficient Assuming that the pairs are drawn from a bivariate normal

distribution allows us to test hypotheses about r and to construct a CI. There is no completely satisfactory way to check the plausibility of the bivariate normality assumption. A partial check involves constructing two separate normal probability plots, one for the sample xis and another for the sample yis, since bivariate normality implies that the marginal distributions of both X and Y are normal. If either plot deviates substantially from a straight-line pattern, the following inferential procedures should not be used for small n.

31 Inferences About the Population Correlation Coefficient 32 Example 12.17 Neurotoxic effects of manganese are well known and are usually caused by high occupational exposure over long periods of time.

In the fields of occupational hygiene and environmental hygiene, the relationship between lipid peroxidation (which is responsible for deterioration of foods and damage to live tissue) and occupational exposure has not been previously reported. 33 Example 12.17 The article Lipid Peroxidation in Workers Exposed to Manganese (Scand. J. of Work and Environ. Health, 1996:

381386) gives data on x = manganese concentration in blood (ppb) and y = concentration (mmol/L) of malondialdehyde, which is a stable product of lipid peroxidation, both for a sample of 22 workers exposed to manganese and for a control sample of 45 individuals. 34 Example 12.17 The value of r for the control sample is .29, from which

The corresponding P-value for a two-tailed t test based on 43 df is roughly .052 (the cited article reported only that Pvalue .05). We would not want to reject the assertion that p = 0 at either significance level .01 or .05. 35 Example 12.17 For the sample of exposed workers, r = .83 and t 6.7, clear evidence that there is a linear association in the entire population of exposed workers from which the sample was

selected. 36 Inferences About the Population Correlation Coefficient Because p measures the extent to which there is a linear relationship between the two variables in the population, the null hypothesis states that there is no such population relationship. In Section 12.3, we used the t ratio to test for a linear relationship between the two variables in the context of

regression analysis. It turns out that the two test procedures are completely equivalent because . 37 Inferences About the Population Correlation Coefficient When interest lies only in assessing the strength of any linear relationship rather than in fitting a model and using it to estimate or predict, the test statistic formula just presented requires fewer computations than does the tratio.

38 Other Inferences Concerning p The procedure for testing when is not equivalent to any procedure from regression analysis. The test statistic as well as confidence interval formula are based on a transformation of R developed by the famous statistician R.A. Fisher. Proposition

39 Other Inferences Concerning p The rationale for the transformation is to obtain a function of R that has a variance independent of ; this would not be the case with R itself. Also, the transformation should not be used if n is quite small, since the approximation will not be valid. 40

Example 12.18 The article Size Effect in Shear Strength of Large Beams Behavior and Finite Element Modelling (Mag. of Concrete Res., 2005: 497509) reported on a study of various characteristics of large reinforced concrete deep and shallow beams tested until failure. Consider the following data on x = cube strength and y = cylinder strength (both in MPa): 41

Example 12.18 Then and , from which = .761. Does this provide strong evidence for concluding that the two measures of strength are at least moderately positively correlated? 42 Example 12.18 Our previous interpretation of moderate positive correlation was .5 .8, so we wish to test versus

The computed value of V is then Thus . The P-value for this upper-tailed test is . 43 Example 12.18 The null hypothesis can therefore be rejected at significance level .05 but not at level .01. This latter result is somewhat surprising in light of the magnitude of , but when n is small, a reasonably large may result even

when is not all that substantial. At significance level .01, the evidence for a moderately positive correlation is not compelling. 44 Other Inferences Concerning To obtain a CI for r, we first derive an interval for ]. Standardizing V, writing a probability statement, and manipulating the resulting inequalities yields (12.10)

as a interval for , where ]. This interval can then be manipulated to yield the desired CI. 45 Other Inferences Concerning 46 Example 12.19 As far back as Leonardo da Vinci, it was known that x =

height and y = wingspan (measured fingertip to fingertip while arms are outstretched side to side) are closely related. Here are measurements from a random sample of students taking a statistics course: 47 Example 12.19 A scatterplot shows an approximate linear pattern, and so do normal probability plots of x and y. The sample

correlation coefficient is computed to be r = .9422. Its Fisher transformation is A 95% CI for is 48 Example 12.19 The CI for with a confidence level of approximately 95% is therefore

Notice that the interval includes only values exceeding .8, so it appears that there is a strong linear association between the two variables in the sampled population. 49 Other Inferences Concerning In Chapter 5, we cautioned that a large value of the correlation coefficient (near 1 or 21) implies only association and not causation. This applies to both and r.

50