## Tuesday, August 9, 2011

### 16.3 – Linear Regression Lines

Regression analysis is a statistical technique which can be used to obtain the equation relating 2 variables. A regression line makes estimations on one of the variables when the corresponding value of another variable is known.

In this section, we are going to learn how to draw regression lines (lines of best fit). There are actually 3 methods that I know of:

1. By eye method
You look at the bunch of dots, estimate using your eye, and start drawing the line. Not a good idea though. You probably used this method for your STPM Physics paper 3.
2. L & R method
We fisrt start by finding the average values of x and y. We draw a horizontal and vertical line across the mid-point. Then, we proceed to find the mid point of the data on the left and right of the vertical line, and we connect these 3 midpoints to obtain a line.
3. Least squares regression line
This is probably the best method of all, and we will be learning how to do it below.

METHOD OF LEAST SQUARES

The term ‘least squares’ tells us that the square of the distances between the points and the line is minimized. For a least squares regression line of y on x, the distance taken into account is the vertical distance. This line will definitely pass through the mid-point of the graph, (x̅, y̅). Take a look at the graph below.

The red dots are the scatters, while the blue line is the least squares regression line. The line is drawn in such a way that the sum of squares of the vertical distances between the red dots and the blue line (green lines) is minimized. So to form a least squares regression line, we have 2 equations of lines, namely

y = a + bx
x = c + dy

The line y = a + bx is known as the regression line y on x, while the line x = c + dy is the line x on y. Note that they are 2 different lines, and are not inversions of formulas. The line of y on x is used when x is the independent variable, and y being the dependent one. However, the line x on y is used only under 2 conditions:

1. when neither variable is controlled and you want to estimate x for a given value of y.
2.
when y is the independent variable, and x the dependent variable.

The line of x on y, according to its equation, has its gradient and y-intercept as follows:

Notice another thing. In this chapter, the lines are not written as y = mx + c. The gradient is b, and by usual convention is put behind the constant a, so y = a + bx, but not y = bx + a. The constant b is known as the regression coefficient of y on x, and d is the regression coefficient of x on y. They are both calculated using the formulas

which in the end, you find b to be

If you could have looked closely,

where r is the product-moment correlation coefficient you learned in the previous section. The term r2 has a name too, called the coefficient of determination of regression lines.

r2 tells the percentage of the variable y can be explained by x. Or in other words,

Or mathematically,

You don’t really need to understand what it means, but just memorize it just in case they ask you to define it in exams. Take note that 0 ≤ r2 ≤ 1.

Coming back to the relationship between the correlation coefficient and the regression coefficient. We can see that if
* b and d are positive, then r is positive.
* b and d are negative, then r negative.

Finding b is not enough to plot the regression line of y on x. The equation of the line, in the end will be

and from there, a can be found. Note that the terms and can be substituted with any ordered pair (x, y) given, and you get the same line.

By the way, sometimes the lines are not that straightforward. You might be asked to make use of coding, in the form of Y = a + bX to transform lines which are not linearly related, into a linear line that can be analysed using regression lines. Common examples are

Most statistical questions on this chapter mainly asks you to do these few things:

1. Plot scatter diagrams, and draw a regression line on it
All you need to do is use the table of data given, plot the scatter diagram (on graph paper), and find the respective values using your calculator to get the values of a and b.

2. Make predictions and estimations
Sometimes you are asked to extrapolate the line, to find a particular value of y, given x, and tell whether the data is sensible. Remember: extrapolation of a regression line is unreliable. You are to understand that there exists uncertainties of such predictions. In the case of a graph of age against running speed, you know that it doesn’t mean the older you are, the faster you run!

3. Calculator estimation
Within the scatter data, sometimes you are given a value of x, to find the value of y, using the regression line you formulated. The estimated value of y is denoted by . It is not hard: with your regression line in hand, just substitute the value of x into it, and you get the value of y. In calculator, you can press
[number] [x̂] [=] to find, and
[number] [ŷ] [=] to find.

However, do take note that you find using the equation x = c + dy, and you find by using the equation y = a + bx. Remember which is the dependent and independent variable, they both make a lot of difference.

4. Find the correlation / regression coefficient or the coefficient of determination
This is quite obvious. That was why we learned them in the first place.

Before I end this chapter, let us take a look at an example, and we will learn how to use your calculator to find the regression line too.

The following table shows the marks (x) obtained in a mid-year examination and the marks (y) obtained in the year-end exam by a group of 9 students.

a) Plot the scatter diagram.
b) Find the equation of the estimated least squares regression line of y on x, and x on y, and plot them.
c) A 10th student obtained a mark of 70 in the mid-year exam but was absent from the year-end exam. Estimate the mark that this student would have obtained in the year-end exam.

I think you shouldn’t have problem plotting the diagram, right? It looks something like this:

So now, we are to find the regression lines. Firstly, key in all your data into your calculator. Remember to clear your previous data by pressing SHIFT + CLR, press ‘1’, then ‘=' (refer to previous post on how to key data in REG mode). Now press SHIFT + S-VAR. Press the right button until you see A B r. Guess what, the given a and b are the coefficients of the line that you wanted. So you immediately found the regression line of y on x,

y = 15.83 + 0.72x
Remember to show your workings though. You need to show how you calculate sxx, sxy, and syy, and . For the equation x = c + dy, there’s no shortcut, so you have to calculate yourself, which gives you

x = 22.63 + 0.66y
We shall plot them on the graph:

with the red line being y = a + bx, and green line being x = c + dy. Remember to label them in exams though.

As for the estimate, you can use your calculator again. From the SHIFT + S-VAR function, and typing the formula I posted above, you should get 66.38.

Regression analysis will be very useful in the future, especially when you collect a lot of data for your company, and you want to see the relationships between variables. Master it, and it will help you.

## Monday, August 8, 2011

### 16.2 – Pearson Correlation Coefficient

Before we start, let us revise a little bit on standard deviation. We all know that the standard error s is given by the formula

In this chapter, we will be dealing with 2 variables, and thus, we need to specify whether the standard error is for the values of x or y. To make the difference, we put a subscript x or y to indicate which variable it refers to. So over here, we have

the standard errors for x and y respectively. We denote the variances of variables x and y as

Note that sxx and sx2 mean the same thing, it is just a different notation for some books. With this information in mind, we shall now introduce the covariance, which is defined by the formula

PEARSON’S PRODUCT-MOMENT CORRELATION COEFFICIENT

The correlation coefficient is a statistic which provides the information on how strong the relationship of 2 variables is. Pearson’s product-moment correlation coefficient, also known as Pearson correlation coefficient or product-moment correlation coefficient, is a numerical value between –1 and 1 inclusive, which indicates the linear degree of scatter. It is represented by the formula

where, –1 ≤ r ≤ 1.

When r 1, it indicates strong positive correlation, which means the regression line has a positive gradient, or y increases as x increases. Similarly, as r–1, it indicates the presence of strong negative correlation. If r = 1 or r = –1, The points lie exactly on a straight line, and we say that they have perfect positive / negative correlation.

However, when r = 0, it does not necessarily mean that there is no correlation. It might indicate that the variables x and y are independent of each other. Besides, it might also indicate that the variables x and y have a non-linear relationship. Take a look at the diagram below:

Sorry but the dots are ugly. This diagram represents a quadratic function. The variables do have a quadratic relationship, but however, its correlation coefficient r = 0. This is just an example of how r = 0 fail to explain anything. On the other hand, having r close to zero only approximates that the data is positively linear correlated. Take a look at the diagram below.

This diagram has a very high r, about 0.7 to 0.8. But however, it doesn’t mean that the data is highly positively linear correlated. It might mean that there isn’t a relationship after all.

r is independent of the units used in the relation, and is very useful in determining the correlation of a 2 variables. Evaluating r can be tedious if you make use of the definitions of sx and sy. So here is the best way to calculate r:

Some other common formulas to find r are:

Besides, there is also this Big S format, whereby

and using this convention, the formula for r is

I would suggest that you keep to the ‘small s format’. In order to teach you how to find r efficiently using the calculator, consider the example below.

Calculate the value of the p-m correlation coefficient for the data in the following table. Comment on your answers.

Let’s make use of the calculator’s functions. Using your CASIO fx-570MS, press the mode button, and select REG mode. There will many kinds of REG mode, so you press ‘1’ for Lin mode (which means ‘linear’).

Now, to input the data, you press [x-value] [, button] [y-value] [DT button]. So you should type in 5, 4.3 and the DT button for the first readings. Now the screen should display

[n=                     ]
[                       1]

Continue typing every data, and press the AC button when you are done. Now you press SHIFT + S-SUM. You will be able to get lots of data from here: Σx2, Σx, n, Σy2, Σy and Σxy. These are the useful information you needed for your r (you need these to show your workings). But there’s a better one, press SHIFT + S-VAR. You get to find the values of x̅, xσn (sx), y̅, yσn (sy), and in fact, r itself! The only thing you can’t get is sxy (what a pity). So using your calculator, you find that the answer is

r = 0.93, it is a strong positive correlation.

That’s all for this section. With enough knowledge, we will go into the next and very last section, which will be on Regression Lines.

### 16.1 – Scatter Diagrams

A scatter diagram is a diagram produced when pairs of values are plotted, to determine the relationship between 2 variables. Usually a scatter diagram contains bivariate data, which is data connecting 2 variables, x and y. Using the usual convention, x is the independent variable (explanatory variable), where it is controlled by the user who is analysing the situation. y on the other hand, is the dependent variable (response variable), it is the variable that is influenced by the previous one. I believe you learned this in your form 1 Science already.

In a scatter diagram, the independent variable is represented by the x-axis, while the dependent variable is on the y-axis. Basically, a scatter diagram is just a normal graph, with lots of dots on it. Suppose you want to analyse the relationship between the temperature of a chemical mixture, with its yield of a new compound. You started the experiment with various temperatures, and after a fixed time, you measure the yield of the new compound (precipitate). And you plot them in a graph like the one below.

Having drawn a scatter diagram, you can then look for a mathematical relationship between the variables x and y. This relation of y = f(x) is known as the regression function. The scatter diagram above shows a positive linear relationship between the data, but with a large dispersion. You can also find a line of best fit, or regression line to make things clearer. Other kinds of relationship between 2 data are:

For the data in diagrams (a) and (b), we say that there is linear correlation between the data. Diagram (d) shows that there is no correlation between the data, meaning that x and y are independent of one another.

Mathematically, there may appear to be a relationship between two data, but sometimes in reality, there isn’t any relationship. For example, you want to prove that the ears of a spider are on its legs. So you experiment it by putting it on the table, and shout at it and calculate its reaction time. Then you repeat your experiment by cutting its legs one by one. When all the legs are cut, it can’t hear your shout and therefore doesn’t move, so you have wrongly concluded that its ears are grown on its legs!

The appearance of a mathematical relationship doesn’t imply that there is a casual relationship. An increase in one variable does not necessarily cause an increase, or decrease, in the other variable.

Now that you understand scatter diagrams, we shall proceed to learn the relationship of a correlation coefficient with a scatter diagram. We will learn more about the regression lines in the last post.

## Sunday, August 7, 2011

### 15.3 – Tests for Independence

Sometimes situations arise when data are displayed in a contingency table, which is a table displaying data classified according to to 2 different factors / attributes. For example, the table below

This is a 2 by 3 table, which shows the different schools and their different performance in an exam. We use a χ2 test to determine whether the two factors are independent, or whether there is an association between them. According to the table above, we want to know whether the school affects their exam performance. Or in other words, since the amount of students of school A and school B. are different (80 and 70 respectively), we know that, if they have the same ratios of credit, pass and fail, it means that whichever the school it is, also it doesn’t affect the grades.

This kind of test is known as the test for independence. As usual, we shall find the expected frequency, find the degree of freedom ν and find the test statistic X2 which has the same formula as the previous section.

Let’s take the above example. The degree of freedom for a h × k contingency table can be found using the formula
ν = (h – 1)(k – 1)

and so, the above table has the value of ν = 2. The expected frequency E, can be found through the formula

To find this, we first need to find the total of each row and column. We modify the table above, colour it a little, then we get

The black numbers in the middle are known as the observed frequency. To proceed to find the expected frequencies, we construct another table, but clearing off all the data in the middle.

Next, we use the above formula to fill in the expected frequencies. For the top left cell, we have 90 × 80 ÷ 150 = 48.0
We proceed to fill up the rest:

From here, we proceed to find X2 by making use of the 6 values of O and E that we just calculated. Now let me give you an example:

A research worker studying the ages of adults and the number of credit cards they posses obtained the results shown in the table.

Use the χ2 statistic and a significance test at the 5% level to decide whether or not there’s an association between age and number of credit cards possessed.

H0: There’s no association between age and number of credit cards possessed.
H1: There’s an association between age and number of credit cards possessed.
Expected frequency,

ν = (2 – 1)(2 – 1) = 1, the Yates’ Correction is used.
Use the χ2 (1) distribution, perform the test at 5% level.
Since χ2(5%) (1) = 3.841, reject H0 if X2 > 3.841.

Since X2 > 3.841, H0 is rejected. There’s an association between age and number of credit cards possessed, at 5% level.

Easy? That’s all for this chapter.

### 15.2 – Tests for Goodness of Fit

A χ2 Goodness-of-Fit test is used when you have some practical data and you want to know how well a particular statistical distribution, such as a  binomial or a normal, models that data. The null hypothesis H0 is that the particular distribution does provide a model or the data; the alternative hypothesis H1 is that it doesn’t.

Just like Hypothesis Tests, Goodness-of-fit Tests also follow a general guideline. You need to write all these 6 steps in your answer sheet:

1. State the null & alternate hypothesis
H0: x is uniformly, B, P0, N distributed / distributed in a ratio of ?
H1: x is not distributed this way

2. Calculate the expected frequency E in the table

3. State the degree of freedom
There are ? classes and ? restrictions
Consider a χ2 (n – ?) distribution

4. State the significance level
Perform at ?% level
From the tables, χ2 (?%) (ν) = ?, so reject H0 if X2 > ?

5. Calculate X2 using the tables

Since X2 > / < ?, H0 is rejected in favour of H1 / not rejected
. There is evidence, at ?% level, that __________ .

Now we shall proceed to learn how to solve 5 kinds of χ2 tests through examples. Questions are in blue and answers are in red:

1. Uniform Distribution (Random)
A tetrahedral die is thrown 120 times and the number which on it lands is noted.

Test at the 5% level whether the die is fair.

H0: The die is fair [I can also write, “the die follows a uniform distribution”. But this is better.]
H1: The die is not fair

There are 4 classes and 1 restriction (
ΣE = 120) [Remember that ΣE = ? is always one of the restrictions]
Consider a χ2 (3) distribution, p
erform at 5% level.
From the tables, χ2 (5%) (ν) = 7.815, so reject H0 if X2 > 7.815.

Since X2 < 7.815, H0 is not rejected. There is evidence, at 5% level, that the die is fair.

2. Distributed in Given Ratio
The outcomes A, B & C of a certain experiment are thought to occur in the ratio 1 : 2 : 1. The experiment is performed 200 times and the observed frequencies of A, B & C are 36, 115 & 49 respectively. Is the difference in the observed and expected results significant? Test at the 5% level.

H0: The outcomes A, B & C are in the ratio 1 : 2: 1
H1: The outcomes A, B & C are not in the ratio 1 : 2: 1
There are 3 classes and 1 restrictions (ΣE = 200)
Consider a χ2 (2) distribution, p
erform at 5% level.
From the tables, χ2 (5%) (ν) = 5.991, so reject H0 if X2 > 5.991

[To save time, you could just construct one table instead of 2. You find the E and the test statistic in one table.]
Since X2 > 5.991, H0 is rejected in favour of H1. The difference in the observed & expected results are significant, at 5% level.

3. Binomial Distribution
Nothing much is different from this with the above two, just that you need more vigorous calculations to find your E. Once again, remember your binomial and Poisson formula, and combine expected frequencies less than 5. You do that because the error will be reduced, and of course, a different degree of freedom will be used.

Perform a χ2 test to investigate whether the following is drawn from a binomial distribution with p =0.3. Use a 5% level of significance.

H0: X ~ B(5, 0.3) [Writing the short form  is good enough.]
H1: X is not distributed this way.
The expected frequency for a Binomial distribution,
E = P(X = x) × 100 = 5Cx0.3x0.75-x × 100
where
ΣO = 100. We tabulate the table below:

Since the expected frequency of x = 4, 5  are < 5, the last 3 classes are combined. [please take note of this piece of information.]
There are now 4 classes and 1 restrictions  (ΣE = 100)
Consider a χ2 (3) distribution, p
erform at 5% level.
From the tables, χ2 (5%) (3) = 7.815, so reject H0 if X2 > 7.815

Since X2 < 7.815, H0 is not rejected. ∴ X is binomially distributed.

Notice that the number of restriction can increase, if the population proportion is not known. You use x̅ = np to find the value of p. For example, a random sample of size 50 is taken, and you are given this table

You don’t know the mean, but you know that

You can find the value of p by using the equation x̅ = np, where n = 50. That will make the question having 2 restrictions, and your degree of freedom n – 2.

4. Poisson Distribution
This one is very similar to the Binomial one. If the Poisson population mean λ is unknown, the number of restriction will add 1, and you use the sample mean x̅ = λ. Just take a look at the example.

A local council has records of the number of children and the number of households in its area. It is therefore known that the average number of children per household is 1.4 It’s suggested that the number of children per household can be modelled by a Poisson distribution with parameter 1.40. In order to test this, a random sample of 1000 households is taken, giving the following data.

Carry out a χ2 test, at the 5% level of significance, to determine whether or not the proposed model should be accepted.

Let X be the number of children per household.
[notice that in this case, I define X properly. You should do it when you know what is X.]
H0: X ~ P0(1.4)
H1: X is not distributed this way.
There are 6 classes and 1 restrictions  (ΣE = 1000).
Consider a χ2 (5) distribution, p
erform at 5% level.
From the tables, χ2 (5%) (5) = 11.070, so reject H0 if X2 > 11.070.

I suppose you can related that

Since X2 > 11.070 , H0 is rejected in favour of H1. The proposed model shouldn’t be accepted, X doesn’t follow a  Poisson distribution.

5. Normal Distribution
As for normal distribution, it is either you know both the population mean μ and population variance σ2, or you don’t know both μ and σ2.  In this case, you either have degrees of freedom n –1, or n – 3. See the example below:

The following data gives the heights in cm of 100 male students.

Find the expected frequencies of a normal distribution having the same mean and variance as the data given, and test the goodness of fit, using a 5% level of significance.

To start, we need to find the values of μ and σ2 first.

Let X be the height (cm) of 100 male students.
H0: X ~ N(171.54, 50.56)
H1: X is not distributed this way.

Now this one needs a lot of calculations. The expectation frequency of each class can be found by using

where a and b are the lower and upper boundaries of each class (remember to
±0.5). The work for a continuous variable takes some time. Remember that the bell curve goes all the way to infinity. I believe you know that your calculator can help you do tricks, right?

Remember to combine the small classes.
There are 5 classes and 3 restrictions  (ΣE = 100, μ and σ2 estimated from the sample).
Consider a χ2 (2) distribution, p
erform at 5% level.
From the tables, χ2 (5%) (2) = 5.991, so reject H0 if X2 > 5.991.

Since X2 < 5.991, H0 is not rejected. X is normally distributed,
X ~ N(171.54, 50.56).

Before I end this section, let me give you a summary of degrees of freedom used throughout this post:

This section is really not hard, but a lot of vigorous calculations required. Be very careful not make mistakes, and score!