Lab 11

Estimation in Sampling

 

A random sample is a set of values drawn independently from a larger population. A (uniform) random sample has the characteristic that all members of the population have an equal chance of being drawn. In the previous lab, we looked at the implications of drawing repeated random samples from a population, where the probability for a value is proportional to normal density. We will now expand this by introducing the use of standard error to assess estimation accuracy. Confidence intervals and tests of hypotheses formalize this approach.

Estimated time to complete this lab: 90 minutes

 

A basic distinction can be made between point estimation and interval estimation. The concept of point estimation is straightforward; a statistic is calculated from a sample and then used to estimate the corresponding population parameter. With probability sampling, the “best” (unbiased) point estimate for the population mean is the sample mean.

How precise are sample point estimators? Because probability sampling involves some uncertainty, it is unlikely that a sample statistic will exactly equal the true population parameter. What can be determined is the likelihood that a sample statistic is within a certain range or interval of the population parameter. This confidence interval represents the levels of precision associated with the population estimate. Its width is determined by (1) the sample size, (2) the amount of variability in the population, and (3) the probability level or level of confidence selected for the problem.

Suppose a random sample of size n is drawn from a population, and the mean of that sample is calculated. Now suppose a second sample of size n is drawn and its mean is calculated. If the process is repeated for many similar-sized independent samples in a population, the frequency distribution of this set of sample means can be graphed. This curve is referred to as the sampling distribution of sampling means. Now let’s do this in [R logo].

 

Step 1         Start [R logo] and change your working directory to U:/GTECH201/R (or another folder of your choice)

 

Step 2         Read a population and create a random sample

            In the following, it is important that you copy and paste the history of your [R logo] commands into your web submission page (or the ASCII-based email message that you send Jing Li if you prefer to not publish your steps on the web).

            We work with random samples here and the exact results can differ from student to student. The ‘how’ is therefore as if not more important as the ‘what’ of your answers.

Question 1a   Read the precipitation data (precip.txt) and create a population consisting of all the reading for BWI.

       b   Now create from that population ten independent samples of size n=5, each to be assigned to a new variable.

       c   Calculate the means of these ten samples and assign them to a new vector.

       d   Draw a frequency distribution of the whole population and then a curve of the sampling distribution of the sampling means over it without erasing the underlying population curve.

(Hint: in order to prevent one graphics output to erase the previous one, you have to set the drawing parameter. par(new=TRUE) instructs [R logo] to draw to the graphics screen as if it was a new device. Remember to reset this parameter when you want to stop drawing one figure on top of another)

What you will hopefully observe, (it would be more convincing if you created 100 samples but that would be a bit tedious as a lab exercise), is generally known as the Central Limit Theorem. The likelihood that a sample mean differs only slightly from the population mean is higher than the likelihood that a sample mean differs greatly from the population mean.

 

Step 3         Observe the effect of sample size

In the previous step, you worked with sample size under 10% of the whole population. How would the sampling distribution of the sample means react if our samples were relatively large?

Question 2     Repeat the above exercise but now with sample size n=30. How does the result compare with that of Q 1?

Step 4         Standard error of the mean

The central limit theorem also provides insight into the variability of the sample means. According to this theorem, the standard deviation of the sampling distribution of the means is equal to the population standard deviation divided by the square root of the sample size.

 

This measure is called the standard error of the mean. Notice the similar logic between standard deviation and standard error. The standard deviation indicates how much a typical value is likely to differ from the mean set of values. In a similar way, the standard error of mean indicates how much a typical sample mean is likely to differ from the true population mean. Quite simply, standard error is a basic measure of the amount of sampling error in a problem.

Question 3     Calculate the sampling errors for your results from Q1 and Q2.

 

Step 5         Confidence intervals

Suppose a geographer wants to place a confidence interval about a sample mean with 90% certainty that the interval range contains the actual population mean. The general formula for a confidence interval is

 

Where P% stands for the percentile of whatever probability distribution P we assume (usually the normal distribution). If our sample size is n=5, our sample mean 83, and our standard deviation 12, then we can compute the relevant quantities as follows:

> xbar = 83

> stdev = 12

> n = 5

> sem = stdev/sqrt(n)     # sem = standard error of the mean

> lower = xbar + sem * qnorm(0.025)

> upper = xbar + sem * qnorm(0.975)

And thus find a 95% confidence interval for the population mean to be within lower (72.48) and upper (93.52).

This procedure avoids the use of so-called Z-values, that I referred to in my lecture and that unfortunate folks without [R logo] will have to contend with. Z values are read from tables, which are the results of countless iterations of applying various confidence interval sizes to the uniform distribution and then determining within how many (fractions of) standard deviations the population mean is going to be around the sample mean.

Several terms are used when making interval estimates in sampling. The confidence level refers to the probability that the interval surrounding the sample mean encompasses the true population mean. This confidence level probability is defined as 1 minus the width of the confidence interval. The significance level, on the other hand refers to the probability that the interval surrounding the sample mean fails to encompass the true population mean and equals the sampling error. A significance level of .95 means that our observed mean is outside 95% of all normally distributed sample means around the population mean, in other words, our sample is extraordinary and deserves further scrutiny (it is highly significant).

Which leads us to the next topic: hypothesis testing

 

Step 6         t  test

Let’s assume that our sample data come from a normal distributed population. We thus have x1, … xn assumed independent realizations of random variables with distribution N(m, s2), which denotes a normal distribution with the mean m and variance s2, and we wish to test the null hypothesis that m = m0. We can estimate the parameters m and s by the empirical mean and standard deviation, although we must realize that we could never pinpoint their values exactly. For normally distributed data, the rule of thumb is that there is 95% probability of staying within plus or minus two standard deviations. Formally, you calculate

 

where SEM stands for the standard error of means (s/n1/2) that we discussed in step 4. The t test in [R logo] is very straightforward: t.test(sample, mu=population_mean). Now let’s apply our knowledge in [R logo] by testing whether the precipitation records at BWI airport are in any way special compared with those others from eastern Maryland.

Question 4     Read the precipitation data from the previous two labs and test whether the readings for BWI airport are within the ordinary range of precipitation for eastern Maryland.

Let’s have a closer look at what the various outputs of the t.test() function mean. The t value tells us, within how many standard deviations from the population mean our test data lies. This is related to the p value, which tells us that our sample is within 1-p probability of the population mean. Typically, a p < 0.05 is used as a threshold for a significance level of 5%. Finally, we are given a 95% confidence interval, which is exactly what we calculated in step 5.

One aspect that we have not covered yet, and that the t test statistic gives us as well is the notion of degrees of freedom. If a particular problem has a sample size of n, the problem can be thought of as starting with n degrees of freedom. Whenever a parameter must be estimated to calculate a test statistic, one degree of freedom is lost. For example, in the t test above, one degree of freedom is lost because only one population parameter is estimated. We will revisit the concept of degrees of freedom in the following weeks, when we discuss other inferential techniques (ANOVA, chi-square, correlation).

Question 5a   Generate a random sample of 10 numbers from a normal distribution with mean 0 and standard deviation 2.  Use t.test() to test the null hypothesis that the mean is 0.

Question   b   Generate a random sample of 10 numbers from a normal distribution with a mean of 1.5 and a standard deviation of 2. Again use t.test() to test the null hypothesis that the mean is 0.

 

Copy all the [R logo] instructions and the output that lead to your successful completion of your lab answers to a text document (ASCII format) and send that file to Jing Li.