## Monday, August 1, 2011

### 13.1 – Random Samples

In statistics, we are always interested to get information from a particular group, be it people, animals, or even non-living things. This group of interest is what we called as a population. A population is a particular group which we need information about in a statistical enquiry. A population can be very big, for example, the amount of hair growing on one’s head, or the amount of people in a country. So some times, we could only gather information from a sample of people. A simple random sample is a sample of size n if all possible samples are equally likely to be selected. So here, we differentiate the terms population and sample, as the sample being the subset of a population.

A parameter is an unknown or known numerical characteristics of a population, such as the mean μ and the standard deviation σ. A statistic is a value computed from a sample such as mean and standard deviation s. Notice the symbols for both cases are different, and we will make use of this convention. So here we can conclude that the parameter is the actual value of a population, while the statistic is a value obtained from samples, which is supposed to be quite close in value to the parameter.

In order to get the information required, we need to do surveys. There are 2 main kinds of surveys:

1. Census
A census is done to survey on every single member of a population. For a country, they need to do a census to count how many people are there in it. Or in a class, we need everyone to submit their health report, in order to know which blood type do the students belong to. However, there are situations that the census can’t be used. In infinite samples, for example, we have an infinite number of stars, and we can’t measure the brightness of every star to find its mean brightness or distance from the earth. Another example, is testing the durability of light bulbs. To test the average lifespan of light bulbs, you can’t test every light bulb, if not, you’ll destroy the population!

2. Sample Survey
A sample survey is done by interviewing / collecting data from only a small group of members within the group, which is the sample. A sample is always less than 100% of the population. For example, we do a survey on 100 residents in Petaling Jaya, to see whether they like it if we replace the McDonald outlet in SS2 with an A&W outlet.

Both the census and a sample survey have their advantages and disadvantages. To sum up, a census is good for a small population, and a sample survey is more suitable for a big population. Look at the table below:

Before you start sampling, you need to do a few things. First, you need to identify the target population, as in where and who do you want to interview. Next, you determine the sampling units, the people / item to be sampled. If your population is all the primary schools in Malaysia, is your sampling unit the student, the teacher, or the canteen waiter? You have to make it clear. Then, you need a sampling frame. You need a list in which the sampling units within a population are individually named or numbered. Of course the list cannot be complete, or sometimes just couldn’t be generate, as the list of units will change, move in and out, or maybe if they are fish in a pond, they couldn’t be listed down!

Once you are done, you can start your survey.

Knowing that we can start surveying, we need to know the possible sampling methods. We shall not focus on census in this chapter (the title says it). Now we shall look into a few types of sampling methods:

1. Random Sampling
I believe you are familiar with the term ‘random’. It means that you do not choose a sample on purpose, you just simply pick one. There are 3 kinds of random sampling:

Simple Random Sample
As its name suggest, it is ‘simple’, you don’t need to do any homework to get that sample. You could draw lots, use a random number to choose which unit you want to take the survey. You can make use of a random number table to choose your units. It acts as a large dice, and looks something like the one below:

You can use numbers from left to right, following the numbers given. Or you could also close your eyes, and use a pencil to point on a number on the table. For example, in a group of students numbered 1 to 100, you want to choose 5 random students. You can take 2 digit numbers starting from the left of the table, namely 82, 03, 14, 58 and 21 to be the students you want.

You could actually use your calculator as a random number generator. On your CASIO fx-570MS, press shift - Ran#, then you will get a random number, 3 decimal places, between 0.000 to 1.000. You can use multiplication or division to manipulate the random number to the range you want.

Note that there exist 2 kinds of simple random samples, one with replacement, one without replacement.

Systematic Random Sample
In systematic sampling, you  make use of a certain pattern, a certain sequence to find your samples. For example, in a list of 1000 people, you take every kth person to take the survey, depending on your sample size.

Stratified Random Sample
In a stratified sample, there are many distinguishable layers. For example, in a population of people, they have different age groups, they have different occupations and etc. We take a few units from different age groups, and combine them in one sample in the end.

2. Non-Random Sampling
I think I don’t need to elaborate much on this. It is not random, and therefore you choose a unit with a solid and particular reason. There are 2 kinds over here:

Clusters
Clusters are like natural sub-groups of a population. For example, in a primary school, there are 6 classes in standard 1, with all the kids having the same status. Note that this differs from stratified random sample, since stratas are different, and classes are alike. You choose to study on one cluster, which means that you didn’t randomly pick students from any class in the school. You save a lot of effort, time and money, as you don’t need to pick the survey forms from every class or so.

Quotas
Quota sampling is widely used in market researches where the population is divided into groups in terms of age, sex, income level and etc. Then when you are about to survey, you already have your plans in mind: I want to survey one person who has high income, has a big family, and another one with low income, with a small family and etc. You already set specific requirements for the members of the population that you are about to interview or collect data from.

All these sampling methods have their pros and cons. I summarize them in the table below:

In every survey, there will sure be some sources of bias. Obviously, when you are collecting data from a population, you want it to be as accurate as possible, and thus should eliminate any bias in the process of sampling. These biases will cause the survey or data collection to be very inaccurate, and give a wrong picture of what the population really is. Examples of sources of bias are:

1. lack of good sampling frame
It’s like using a list of friends generated from your Twitter account. You will miss out those friends who don’t use Twitter. You need a good sampling frame in order that everyone has an equal chance of being sampled.

2. wrong choice of sampling unit
In surveying on who has a car at home, you chose the wrong sampling unit ‘people’, since a better sampling unit would be ‘household’, since children don’t drive.

3. no response by some chosen units
Some people just choose to answer your survey questions for God-knows-what reason. Then, your questionnaire might have some questions in which they don’t have much choice to answer with. For example, they don’t respond the question “do you like Subway Sandwiches? Yes / No” when they don’t even know that such outlet exist.

4. introduced by the person conducting the survey
The person conducting the survey might already have a conclusion in mind, and tries to make his survey results to suit his mindset. For example, on the question “Which party will do a better job in the next General Elections?” If the surveyor is a Pakatan Rakyat supporter, he might influence the person taking the survey to agree with his stand.

SIMULATING RANDOM SAMPLES

There are many ways to get random samples, just like what we did above. We used a random number table, or using the random number generator from the calculator. But now, we want to simulate random samples from a given distribution. There are 2 kinds of distributions that we can obtain a simulated random sample:

1. Frequency Distribution
A frequency distribution looks something like this:

It has a value x and a frequency. Let’s say, I would like to generate a sample of size 6 from this population. For data like this, we could not just simply use a calculator to randomly get the numbers 1 to 4 as our sample. It has a frequency, or rather a weightage of how we should randomly choose the numbers. So what we can do is we can tabulate a table, making use of its cumulative frequency.

Using this table, we can finally tabulate the random sample. For example, now that we have a random number as 04938581365399, so we can get the numbers 4, 93, 85, 81, 36, 53, which corresponds to the values of x being 1, 4, 3, 3, 2, 3 respectively. We have finally got our random sample from the frequency distribution.

2. Probability Distribution
The method is the same as the above, we create a cumulative frequency, and change the base to be over 1, then use the generated random numbers to find the random samples. There are a few kinds of probability distributions:

probability distribution

This one is not hard. We find the cumulative frequency, then

Binomial distribution X ~ B (n, p)
Hope you still remember the formula, P(X = x) = nCxpxqn-x. For example, we take
X ~ B (3, 0.4), then we have

Poisson distribution X ~ P0 (λ)
The formula is

We tabulate the table for X ~ P0 (4)

Probability density function
It can be something like

We should find its cumulative density function,

From here, we let the random generated number 0 ≤ x ≤ 1 equal to that function, and find x inversely.

Normal distribution X ~ N (μ, σ2)
Making reference to the formula

We let the random generated number 0 ≤ x ≤ 1 equal to the cumulative probability of the normal distribution. Then by using normal tables (or your calculator), you can find z, and therefore x.

No new equations, no new formulas, yet a lot to read and remember though. I couldn’t be sure whether anything from here would come out in exams, but I urge you to at least remember the definitions of the few important terms in this section. By the way, revise your distributions!