**Regression analysis** is a statistical technique which can be used to obtain the equation relating 2 variables. A regression line makes estimations on one of the variables when the corresponding value of another variable is known.

In this section, we are going to learn how to draw regression lines (lines of best fit). There are actually 3 methods that I know of:

**1. By eye method**You look at the bunch of dots, estimate using your eye, and start drawing the line. Not a good idea though. You probably used this method for your STPM Physics paper 3.

**2. L & R method**

We fisrt start by finding the average values of

**x**and

**y**. We draw a horizontal and vertical line across the mid-point. Then, we proceed to find the mid point of the data on the left and right of the vertical line, and we connect these 3 midpoints to obtain a line.

**3. Least squares regression line**

This is probably the best method of all, and we will be learning how to do it below.

__METHOD OF LEAST SQUARES__

The term ‘least squares’ tells us that the square of the distances between the points and the line is minimized. For a **least squares regression line of y on x**, the distance taken into account is the vertical distance. This line will definitely pass through the **mid-point** of the graph, **(x̅, y̅)**. Take a look at the graph below.

The red dots are the scatters, while the blue line is the least squares regression line. The line is drawn in such a way that the sum of squares of the vertical distances between the red dots and the blue line (green lines) is minimized. So to form a least squares regression line, we have 2 equations of lines, namely

**y = a + bxx = c + dy**

The line **y = a + bx **is known as the regression line **y on x**, while the line **x = c + dy **is the line **x on y**. Note that they are 2 different lines, and are not inversions of formulas. The line of **y on x **is used when **x **is the independent variable, and **y **being the dependent one. However, the line **x on y **is used only under 2 conditions:

**1. **when neither variable is controlled and you want to estimate **x **for a given value of **y.2. **when

**y**is the independent variable, and

**x**the dependent variable.

The line of **x on y**, according to its equation, has its gradient and y-intercept as follows:

Notice another thing. In this chapter, the lines are not written as **y = mx + c**. The gradient is **b**, and by usual convention is put behind the constant **a**, so **y = a + bx**, but not **y = bx + a**. The constant **b **is known as the **regression coefficient of y on x**, and **d **is the **regression coefficient of x on y**. They are both calculated using the formulas

which in the end, you find **b **to be

If you could have looked closely,

where **r **is the product-moment correlation coefficient you learned in the previous section. The term **r ^{2} **has a name too, called the

**c**

**oefficient of determination of regression lines**.

**r ^{2}** tells the percentage of the variable

**y**can be explained by

**x**. Or in other words,

You don’t really need to understand what it means, but just memorize it just in case they ask you to define it in exams. Take note that **0 ≤ r ^{2} ≤ 1**.

Coming back to the relationship between the correlation coefficient and the regression coefficient. We can see that if*** b** and **d** are positive, then **r** is positive.*** b** and **d** are negative, then **r** negative.

Finding **b **is not enough to plot the regression line of **y on x**. The equation of the line, in the end will be

and from there, **a **can be found. Note that the terms **x̅** and **y̅** can be substituted with any ordered pair **(x, y)** given, and you get the same line.

By the way, sometimes the lines are not that straightforward. You might be asked to make use of **coding**, in the form of **Y = a + bX** to transform lines which are not linearly related, into a linear line that can be analysed using regression lines. Common examples are

Most statistical questions on this chapter mainly asks you to do these few things:

**1. Plot scatter diagrams, and draw a regression line on it**All you need to do is use the table of data given, plot the scatter diagram (on

**graph paper**), and find the respective values using your calculator to get the values of

**a**and

**b**.

**2. Make predictions and estimations**Sometimes you are asked to extrapolate the line, to find a particular value of

**y**, given

**x**, and tell whether the data is sensible. Remember: extrapolation of a regression line is

**unreliable**. You are to understand that there exists uncertainties of such predictions. In the case of a graph of age against running speed, you know that it doesn’t mean the older you are, the faster you run!

**3. Calculator estimation**

Within the scatter data, sometimes you are given a value of x, to find the value of y, using the regression line you formulated. The estimated value of **y** is denoted by **ŷ**. It is not hard: with your regression line in hand, just substitute the value of **x **into it, and you get the value of **y**. In calculator, you can press**[number]** **[x̂]** **[=]** to find** x̂**, and **[number]** **[ ŷ]**

**[=]**to find

**ŷ**.

However, do take note that you find **x̂** using the equation **x = c + dy**, and you find **ŷ **by using the equation **y = a + bx**. Remember which is the dependent and independent variable, they both make a lot of difference.

**4. Find the correlation / regression coefficient or the coefficient of determination**This is quite obvious. That was why we learned them in the first place.

Before I end this chapter, let us take a look at an example, and we will learn how to use your calculator to find the regression line too.

*The following table shows the marks (x) obtained in a mid-year examination and the marks (y) obtained in the year-end exam by a group of 9 students.a) Plot the scatter diagram.b) Find the equation of the estimated least squares regression line of y on x, and x on y, and plot them.c) A 10th student obtained a mark of 70 in the mid-year exam but was absent from the year-end exam. Estimate the mark that this student would have obtained in the year-end exam.*

I think you shouldn’t have problem plotting the diagram, right? It looks something like this:

So now, we are to find the regression lines. Firstly, key in all your data into your calculator. Remember to clear your previous data by pressing **SHIFT + CLR**, press ‘**1**’, then ‘**=**' (refer to previous post on how to key data in **REG mode**). Now press **SHIFT + S-VAR**. Press the right button until you see **A B r**. Guess what, the given **a **and **b **are the coefficients of the line that you wanted. So you immediately found the regression line of **y on x**,

**y = 15.83 + 0.72x**Remember to show your workings though. You need to show how you calculate

**s**

_{xx}_{, }

**s**, and

_{xy}**s**,

_{yy}**x̅**and

**y̅**. For the equation

**x = c + dy**, there’s no shortcut, so you have to calculate yourself, which gives you

**x = 22.63 + 0.66y**We shall plot them on the graph:

with the red line being

**y = a + bx**, and green line being

**x = c + dy**. Remember to label them in exams though.

As for the estimate, you can use your calculator again. From the **SHIFT + S-VAR **function, and typing the formula I posted above, you should get **66.38**.

Regression analysis will be very useful in the future, especially when you collect a lot of data for your company, and you want to see the relationships between variables. Master it, and it will help you. ☺