Regression analysis is a statistical technique which can be used to obtain the equation relating 2 variables. A regression line makes estimations on one of the variables when the corresponding value of another variable is known.
In this section, we are going to learn how to draw regression lines (lines of best fit). There are actually 3 methods that I know of:
1. By eye method
You look at the bunch of dots, estimate using your eye, and start drawing the line. Not a good idea though. You probably used this method for your STPM Physics paper 3.
2. L & R method
We fisrt start by finding the average values of x and y. We draw a horizontal and vertical line across the mid-point. Then, we proceed to find the mid point of the data on the left and right of the vertical line, and we connect these 3 midpoints to obtain a line.
3. Least squares regression line
This is probably the best method of all, and we will be learning how to do it below.
METHOD OF LEAST SQUARES
The term ‘least squares’ tells us that the square of the distances between the points and the line is minimized. For a least squares regression line of y on x, the distance taken into account is the vertical distance. This line will definitely pass through the mid-point of the graph, (x̅, y̅). Take a look at the graph below.
The red dots are the scatters, while the blue line is the least squares regression line. The line is drawn in such a way that the sum of squares of the vertical distances between the red dots and the blue line (green lines) is minimized. So to form a least squares regression line, we have 2 equations of lines, namely
y = a + bx
x = c + dy
The line y = a + bx is known as the regression line y on x, while the line x = c + dy is the line x on y. Note that they are 2 different lines, and are not inversions of formulas. The line of y on x is used when x is the independent variable, and y being the dependent one. However, the line x on y is used only under 2 conditions:
1. when neither variable is controlled and you want to estimate x for a given value of y.
2. when y is the independent variable, and x the dependent variable.
Notice another thing. In this chapter, the lines are not written as y = mx + c. The gradient is b, and by usual convention is put behind the constant a, so y = a + bx, but not y = bx + a. The constant b is known as the regression coefficient of y on x, and d is the regression coefficient of x on y. They are both calculated using the formulas
where r is the product-moment correlation coefficient you learned in the previous section. The term r2 has a name too, called the coefficient of determination of regression lines.
You don’t really need to understand what it means, but just memorize it just in case they ask you to define it in exams. Take note that 0 ≤ r2 ≤ 1.
Coming back to the relationship between the correlation coefficient and the regression coefficient. We can see that if
* b and d are positive, then r is positive.
* b and d are negative, then r negative.
and from there, a can be found. Note that the terms x̅ and y̅ can be substituted with any ordered pair (x, y) given, and you get the same line.
By the way, sometimes the lines are not that straightforward. You might be asked to make use of coding, in the form of Y = a + bX to transform lines which are not linearly related, into a linear line that can be analysed using regression lines. Common examples are
Most statistical questions on this chapter mainly asks you to do these few things:
1. Plot scatter diagrams, and draw a regression line on it
All you need to do is use the table of data given, plot the scatter diagram (on graph paper), and find the respective values using your calculator to get the values of a and b.
2. Make predictions and estimations
Sometimes you are asked to extrapolate the line, to find a particular value of y, given x, and tell whether the data is sensible. Remember: extrapolation of a regression line is unreliable. You are to understand that there exists uncertainties of such predictions. In the case of a graph of age against running speed, you know that it doesn’t mean the older you are, the faster you run!
3. Calculator estimation
Within the scatter data, sometimes you are given a value of x, to find the value of y, using the regression line you formulated. The estimated value of y is denoted by ŷ. It is not hard: with your regression line in hand, just substitute the value of x into it, and you get the value of y. In calculator, you can press
[number] [x̂] [=] to find x̂, and
[number] [ŷ] [=] to find ŷ.
However, do take note that you find x̂ using the equation x = c + dy, and you find ŷ by using the equation y = a + bx. Remember which is the dependent and independent variable, they both make a lot of difference.
4. Find the correlation / regression coefficient or the coefficient of determination
This is quite obvious. That was why we learned them in the first place.
Before I end this chapter, let us take a look at an example, and we will learn how to use your calculator to find the regression line too.
The following table shows the marks (x) obtained in a mid-year examination and the marks (y) obtained in the year-end exam by a group of 9 students.
a) Plot the scatter diagram.
b) Find the equation of the estimated least squares regression line of y on x, and x on y, and plot them.
c) A 10th student obtained a mark of 70 in the mid-year exam but was absent from the year-end exam. Estimate the mark that this student would have obtained in the year-end exam.
So now, we are to find the regression lines. Firstly, key in all your data into your calculator. Remember to clear your previous data by pressing SHIFT + CLR, press ‘1’, then ‘=' (refer to previous post on how to key data in REG mode). Now press SHIFT + S-VAR. Press the right button until you see A B r. Guess what, the given a and b are the coefficients of the line that you wanted. So you immediately found the regression line of y on x,
y = 15.83 + 0.72x
Remember to show your workings though. You need to show how you calculate sxx, sxy, and syy, x̅ and y̅. For the equation x = c + dy, there’s no shortcut, so you have to calculate yourself, which gives you
As for the estimate, you can use your calculator again. From the SHIFT + S-VAR function, and typing the formula I posted above, you should get 66.38.
Regression analysis will be very useful in the future, especially when you collect a lot of data for your company, and you want to see the relationships between variables. Master it, and it will help you. ☺