Lab 09
Descriptive Statistics
This lab
exercise has two main goals. The first one is to introduce you to so-called
exploratory data analysis (EDA). Here, we use graphics to gain insight into the
characteristics of a data set, which then form the basis for selecting
subsequent analysis methods. The other goal is to familiarize yourself with the
descriptive stats functions provided by .
Since is very comfortable, there is not really much
to learn – the concepts were already introduced in the lecture.
Estimated
time to complete this lab: 100 minutes
Recall
that the overall goal of descriptive statistics is to provide a concise, easily
understood summary of characteristics of a particular data set. There are two
ways of doing that: one is to look at
the distribution of our data points. We try to let the data speak for
themselves before or as part of a formal analysis. An effective EDA display
presents data in a way that will make effective use of the human brain’s
ability to recognize patterns. There is a risk that you will see patterns that
are merely a result of looking too hard. That’s when the summary statistics discussed
in the lecture help to verify what we believe we are seeing.
The
histogram is a basic EDA tool. It gives a graphical representation of the
frequency distribution of a data set. The area of each rectangle of a histogram
is proportional to the number of observations whose values lies within the
width (also known as bin) of the rectangle.
Step 1 Start and change your working directory to U:/GTECH201/R (or another folder of your choice)
Step 2 Create a random data set and plot a
histogram of it
> x = rnorm(50)
> hist(x)
By
specifying breaks=n in the hist call, you get approximately n bars in the histogram since the algorithm tries to create
“pretty” cutpoints. You can have full control over
the interval divisions by specifying breaks as a vector, rather than a number.
Step 3 Play with histogram break points
Altman
(1991, pp. 25-26) contains an example of accident rates by age group. These are
given as a count in age groups 0-4, 5-9, 10-15, 16, 17, 18-19, 20-24, 25-59,
and 60-79 years of age. The data can be entered as follows:
> mid.age = c(2.5,
7.5, 13, 16.5, 17.5, 19, 22.5, 44.5, 70.5)
>
acc.count = c(28, 46, 58,
20, 31, 64, 149, 316, 103)
>
age.acc = rep(mid.age, acc.count)
> hist(age.acc)
1 Interpret
the histogram displayed Why are there gaps? Which age
group is the most threatened by car accidents?
Now let’s
see, whether the introduction of break points that represent the way the data
was aggregated changes our picture.
> brk = c(0, 5,
10, 16, 17, 18, 20, 25, 60, 80)
> hist(age.acc, breaks=brk)
2 What is the difference? Which age group is
now the most threatened by car accidents? Which of the two histograms is the
better to explain the data set?
You can
place the two outputs next two each other by redefining the graphics parameters
for .
The par()
function is extremely complex, i.e., it has an enormous number of parameters.
Don’t let this frighten you though; for now all we need is the mfrow parameter, which stands for multif rame
row wise.
> par(mfrow=c(1,2))
>
hist(age.acc)
>
hist(age.acc, breaks=brk)
>
par(mfrow=c(1,1)) # you need to reset the multiframe command to (1,1) if you don’t want to
# continue plotting everything to
alternate graphics windows
As you
might have guessed, there is also a mfcol parameter to plot columnwise…
Step 4 Cumulative frequency polygon
The
cumulative frequency polygon or ogive is defined as the fraction of data that is smaller
than or equal to x. That is, if x is the kth smallest
observation, then the proportion k/n
of the data is smaller or equal to x.
For the precipitation readings at Baltimore-Washington International (BWI)
airport, the ogive can be constructed as follows:
> precip = read.table(file="precip.txt", header=T)
>
year = precip[,1] # extract the first field
>
bwi = precip[,2] # extract the readings for BWI (there
are other station readings in that file as well)
>
hist(bwi) # draws
a histogram of annual precipitation values at BWI
>
n = length(bwi)
>
plot(sort(bwi), (1:n)/n,
type=”l”, ylim=c(0,1)) # type = l as in line, not the number
1
Step 5 Boxplots
A boxplot, also known as box-and-whiskers plot is a graphical
summary of a distribution. The box in the middle indicates “hinges” (nearly
quartiles, see the help page on boxplot.stats) and
median. The lines (whiskers) show the largest/smallest observation that falls
within a distance of 1.5 times the box size from the nearest hinge. If any
observations fall further away, the additional points are considered extreme
and are shown separately.
> boxplot(bwi)
As you can
see, there are two years (1979, 2003) that were exceptionally wet.
While
visualizing data in form of one graph or another may be a good way to get
started with asking questions about a data set, the interpretation is usually
somewhat ambiguous. Hard numbers, especially, when they are derived in a standardized
way, make it often easier to communicate summary statistics of a particular
data set.
A data set
can be summarized in several different ways:
·
Measures of central tendency – numbers that represent the
center or typical value of a frequency distribution, such as mode, median, and
mean.
·
Measures of dispersion – numbers that depict the amount
of spread or variability in a data set, such as range, interquartile
range, standard deviation, variance, and coefficient of variation.
Geographers
must be cautious when applying descriptive statistics to spatial or locational data. The way in which a geographic problem is
structured can affect the resulting descriptive statistics. We will be talking
about the effects of boundary delineation and different levels of spatial
aggregation or different scales at the end of the statistics section of this
course.
It is very
easy to calculate simple summary statistics with .
Here is how to calculate the mean, standard deviation, variance, and median.
Step 6 Individual descriptive measures
> x = rnorm(50)
>
mean(x)
> sd(x)
> var(x)
> median(x)
Notice
that the example starts with the generation of an artificial data vector x of 50 normally distributed
observations. It is used in examples throughout this section of the lab. When
reproducing the examples, you will not get exactly the same results since your
random numbers will differ. Empirical quantiles may
be obtained with the function quantile, like this:
> quantile(x)
As your see,
by default, you get the minimum, the maximum, and the three quartiles – the .25, 0.50, and .75 quantiles, so named because they correspond to a division
into four parts. Similarly, we have deciles
for 0.1, 02.,…,0.9, and centiles or percentiles. To get other than the customary quartiles, you have to
provide a secondary parameter telling quantile() how many divisions you want:
> dec = seq(0,1,0.1)
> quantile(x, dec)
The difference
between the first and third quartiles is called the interquartile range (IQR) and is sometimes used as a robust alternative to the
standard deviation. Finally, if we want everything in one go, then there is the
summary() function:
>
summary(bwi)
What’s
even better, is that the summary() function can be applied not just
to a single numerical variable but to a whole data frame. Try
> summary(precip)
With this,
we came to the end of this lab exercise. The six steps above provide you with
all the information you need to now perform the following five tasks, which,
together with your answers to Q1 and Q2 above make up your lab submission.
Start Frontpage Express to create a new web page.
3 Examine
the help for function mean(), and use it to learn
about the trimmed mean. For the precipitation readings at BWI, calculate the
mean, the median, and the 10% trimmed mean. How does the 10% trimmed mean
differ from the mean for these data? Under what circumstances will the trimmed
mean differ substantially from the mean?
Write your answer clearly marked as answers to Q3 in your new web page.
4 Plot
a line graph of the BWI data. Can you identify any trend? Create a new vector of five-year averages of
the BWI data (use a one-line command that concatenates the means of vector
indices) and plot the new vector. How had the graph changed? Now do the same
thing again but start your ten-year periods in 1955 instead of 1950. Compare
the two curves side-by-side using a two-plot layout. Both plots are using the
same data set, yet the results look so different. Copy the two-plot graph to
your web page and explain how one can trust statistical results if the
application of the same measure (here the mean) onto the same data set has such different outcomes. Hint: these are annual
averages of daily precipitation measures.
5 Import
the anonymized results from this semester’s 201
midterm.txt.
Pick a column (student) of your choice and write that vector to an external
variable identified by the column header in midterm.txt. Is that student above
or below average?
6 Each
row represents one question. Create a summary of all correct answers for each
question. Which question was not correctly answered by anyone? You know the
answer from the midterm results page on the 201 website but what are the
necessary steps in to arrive at that answer? Write
the steps (or copy them from your window) in your web page.
7 In our accident data from step3, what is the average number
of accidents per year of age? You have seen from the first histogram that the
age groups are very skewed (varied in width). Create a
weighted average for each age group by dividing the number of accidents
for a group by the range of years that fall into this group (e.g., five for the
1st group and one for the 4th group). Now plot a
histogram where the height of the bars represents the weighted average and the
width of each bar the bandwidth of years in each age group. Copy the graph to
your web page.
Rename
your web page lab09.answers.html and set a link to this page from your home
page. Then send an email to Jing Li announcing your
lab submission and providing him with the URL to your lab answers.