Lab 11
Estimation in Sampling
A random
sample is a set of values drawn independently from a larger population. A (uniform)
random sample has the characteristic that all members of the population have an
equal chance of being drawn. In the previous lab, we looked at the implications
of drawing repeated random samples from a population, where the probability for
a value is proportional to normal density. We will now expand this by
introducing the use of standard error to assess estimation accuracy. Confidence
intervals and tests of hypotheses formalize this approach.
Estimated
time to complete this lab: 90 minutes
A basic
distinction can be made between point
estimation and interval estimation.
The concept of point estimation is straightforward; a statistic is calculated
from a sample and then used to estimate the corresponding population parameter.
With probability sampling, the “best” (unbiased) point estimate for the
population mean is the sample mean.
How
precise are sample point estimators? Because probability sampling involves some
uncertainty, it is unlikely that a sample statistic will exactly equal the true
population parameter. What can be determined is the likelihood that a sample statistic is within a certain range or
interval of the population parameter. This confidence
interval represents the levels of precision associated with the population
estimate. Its width is determined by (1) the sample size, (2) the amount of
variability in the population, and (3) the probability level or level of
confidence selected for the problem.
Suppose a
random sample of size n is drawn from a population, and the mean of that sample
is calculated. Now suppose a second sample of size n is drawn and its mean is
calculated. If the process is repeated for many similar-sized independent
samples in a population, the frequency distribution of this set of sample means
can be graphed. This curve is referred to as the sampling distribution of sampling means. Now let’s do this in .
Step 1 Start and change your working directory to U:/GTECH201/R (or another folder of your choice)
Step 2 Read a population and create a
random sample
In the following, it
is important that you copy and paste the history of your commands into your
web submission page (or the ASCII-based email message that you send Jing Li if you prefer to not publish your steps on the
web).
We work with random samples here and
the exact results can differ from student to student. The ‘how’ is therefore as
if not more important as the ‘what’ of your answers.
1a Read
the precipitation data (precip.txt) and create a population consisting of all
the reading for BWI.
b Now create from that population ten independent
samples of size n=5, each to be assigned to a new variable.
c Calculate the means of these ten samples and
assign them to a new vector.
d Draw a frequency distribution of the whole
population and then a curve of the sampling distribution of the sampling means
over it without erasing the underlying population curve.
(Hint: in order to
prevent one graphics output to erase the previous one, you have to set the
drawing parameter. par(new=TRUE) instructs to draw to the
graphics screen as if it was a new device. Remember to reset this parameter
when you want to stop drawing one figure on top of another)
What you
will hopefully observe, (it would be more convincing if you created 100 samples
but that would be a bit tedious as a lab exercise), is generally known as the Central Limit Theorem. The likelihood
that a sample mean differs only slightly from the population mean is higher
than the likelihood that a sample mean differs greatly from the population
mean.
Step 3 Observe the effect of sample size
In the
previous step, you worked with sample size under 10% of the whole population.
How would the sampling distribution of the sample means react if our samples
were relatively large?
2 Repeat
the above exercise but now with sample size n=30. How does the result compare
with that of Q 1?
Step 4 Standard error of the mean
The
central limit theorem also provides insight into the variability of the sample
means. According to this theorem, the standard deviation of the sampling
distribution of the means is equal to the population standard deviation divided
by the square root of the sample size.
This
measure is called the standard error of
the mean. Notice the similar logic between standard deviation and standard error.
The standard deviation indicates how much a typical value is likely to differ
from the mean set of values. In a similar way, the standard error of mean
indicates how much a typical sample mean is likely to differ from the true
population mean. Quite simply, standard error is a basic measure of the amount
of sampling error in a problem.
3 Calculate
the sampling errors for your results from Q1 and Q2.
Step 5 Confidence intervals
Suppose a
geographer wants to place a confidence interval about a sample mean with 90%
certainty that the interval range contains the actual population mean. The
general formula for a confidence interval is
Where P% stands for the percentile
of whatever probability distribution P
we assume (usually the normal distribution). If our sample size is n=5, our sample mean 83, and our standard deviation 12, then we can
compute the relevant quantities as follows:
> xbar = 83
>
stdev = 12
>
n = 5
>
sem = stdev/sqrt(n) # sem = standard
error of the mean
>
lower = xbar + sem * qnorm(0.025)
> upper = xbar + sem * qnorm(0.975)
And thus
find a 95% confidence interval for the population mean to be within lower
(72.48) and upper (93.52).
This procedure avoids the use of so-called Z-values, that I referred to in my lecture and that unfortunate folks
without will have to contend with. Z values are read
from tables, which are the results of countless iterations of applying various
confidence interval sizes to the uniform distribution and then determining
within how many (fractions of) standard deviations the population mean is going
to be around the sample mean.
Several
terms are used when making interval estimates in sampling. The confidence level refers to the
probability that the interval surrounding the sample mean encompasses the true population mean. This confidence level
probability is defined as 1 minus the width of the confidence interval. The significance level, on the other hand
refers to the probability that the interval surrounding the sample mean fails
to encompass the true population mean and equals the sampling error. A
significance level of .95 means that our observed mean is outside 95% of all
normally distributed sample means around the population mean, in other words,
our sample is extraordinary and deserves further scrutiny (it is highly
significant).
Which leads us to the next topic: hypothesis testing
Step 6 t test
Let’s
assume that our sample data come from a normal distributed population. We thus
have x1, … xn assumed
independent realizations of random variables with distribution N(m, s2), which denotes a normal
distribution with the mean m and variance s2, and we wish to test the null
hypothesis that m = m0. We can estimate the parameters m and s by the empirical mean and standard
deviation, although we must realize that we could never pinpoint their values
exactly. For normally distributed data, the rule of thumb is that there is 95%
probability of staying within plus or minus two standard deviations. Formally,
you calculate
where SEM
stands for the standard error of means (s/n1/2) that we discussed in step 4. The t
test in is very straightforward: t.test(sample, mu=population_mean). Now let’s apply our knowledge in by testing whether the precipitation records
at BWI airport are in any way special compared with those others from eastern
4 Read
the precipitation data from the previous two labs and test whether the readings
for BWI airport are within the ordinary range of precipitation for eastern
Let’s have a closer look at what the various outputs of the t.test() function
mean. The t value tells us, within
how many standard deviations from the population mean our test data lies. This
is related to the p value, which
tells us that our sample is within 1-p
probability of the population mean. Typically, a p < 0.05 is used as a threshold for a significance level of 5%.
Finally, we are given a 95% confidence interval, which is exactly what we
calculated in step 5.
One aspect that we have not covered yet, and that the t test statistic gives us as well is the
notion of degrees of freedom. If a
particular problem has a sample size of n,
the problem can be thought of as starting with n degrees of freedom. Whenever a parameter must be estimated to
calculate a test statistic, one degree of freedom is lost. For example, in the
t test above, one degree of freedom is lost because only one population
parameter is estimated. We will revisit the concept of degrees of freedom in
the following weeks, when we discuss other inferential techniques (ANOVA,
chi-square, correlation).
5a Generate
a random sample of 10 numbers from a normal distribution with mean 0 and
standard deviation 2. Use t.test() to test
the null hypothesis that the mean is 0.
b Generate a random sample of 10 numbers from a
normal distribution with a mean of 1.5 and a standard deviation of 2. Again use
t.test() to test the
null hypothesis that the mean is 0.
Copy all the instructions and the
output that lead to your successful completion of your lab answers to a text
document (ASCII format) and send that file to Jing
Li.