Lab 09: Descriptive Statistics

Lab 09

Descriptive Statistics

This lab exercise has two main goals. The first one is to introduce you to so-called exploratory data analysis (EDA). Here, we use graphics to gain insight into the characteristics of a data set, which then form the basis for selecting subsequent analysis methods. The other goal is to familiarize yourself with the descriptive stats functions provided by . Since is very comfortable, there is not really much to learn – the concepts were already introduced in the lecture.

Estimated time to complete this lab: 100 minutes

Recall that the overall goal of descriptive statistics is to provide a concise, easily understood summary of characteristics of a particular data set. There are two ways of doing that: one is to look at the distribution of our data points. We try to let the data speak for themselves before or as part of a formal analysis. An effective EDA display presents data in a way that will make effective use of the human brain’s ability to recognize patterns. There is a risk that you will see patterns that are merely a result of looking too hard. That’s when the summary statistics discussed in the lecture help to verify what we believe we are seeing.

The histogram is a basic EDA tool. It gives a graphical representation of the frequency distribution of a data set. The area of each rectangle of a histogram is proportional to the number of observations whose values lies within the width (also known as bin) of the rectangle.

Step 1 Start and change your working directory to U:/GTECH201/R (or another folder of your choice)

Step 2 Create a random data set and plot a histogram of it

> x = rnorm(50)

> hist(x)

By specifying breaks=n in the hist call, you get approximately n bars in the histogram since the algorithm tries to create “pretty” cutpoints. You can have full control over the interval divisions by specifying breaks as a vector, rather than a number.

Step 3 Play with histogram break points

Altman (1991, pp. 25-26) contains an example of accident rates by age group. These are given as a count in age groups 0-4, 5-9, 10-15, 16, 17, 18-19, 20-24, 25-59, and 60-79 years of age. The data can be entered as follows:

> mid.age = c(2.5, 7.5, 13, 16.5, 17.5, 19, 22.5, 44.5, 70.5)

> acc.count = c(28, 46, 58, 20, 31, 64, 149, 316, 103)

> age.acc = rep(mid.age, acc.count)

> hist(age.acc)

1 Interpret the histogram displayed Why are there gaps? Which age group is the most threatened by car accidents?

Now let’s see, whether the introduction of break points that represent the way the data was aggregated changes our picture.

> brk = c(0, 5, 10, 16, 17, 18, 20, 25, 60, 80)

> hist(age.acc, breaks=brk)

2 What is the difference? Which age group is now the most threatened by car accidents? Which of the two histograms is the better to explain the data set?

You can place the two outputs next two each other by redefining the graphics parameters for . The par() function is extremely complex, i.e., it has an enormous number of parameters. Don’t let this frighten you though; for now all we need is the mfrow parameter, which stands for multif rame row wise.

> par(mfrow=c(1,2))

> hist(age.acc)

> hist(age.acc, breaks=brk)

> par(mfrow=c(1,1)) # you need to reset the multiframe command to (1,1) if you don’t want to

# continue plotting everything to alternate graphics windows

As you might have guessed, there is also a mfcol parameter to plot columnwise…

Step 4 Cumulative frequency polygon

The cumulative frequency polygon or ogive is defined as the fraction of data that is smaller than or equal to x. That is, if x is the k^th smallest observation, then the proportion k/n of the data is smaller or equal to x. For the precipitation readings at Baltimore-Washington International (BWI) airport, the ogive can be constructed as follows:

> precip = read.table(file="precip.txt", header=T)

> year = precip[,1] # extract the first field

> bwi = precip[,2] # extract the readings for BWI (there are other station readings in that file as well)

> hist(bwi) # draws a histogram of annual precipitation values at BWI

> n = length(bwi)

> plot(sort(bwi), (1:n)/n, type=”l”, ylim=c(0,1)) # type = l as in line, not the number 1

Step 5 Boxplots

A boxplot, also known as box-and-whiskers plot is a graphical summary of a distribution. The box in the middle indicates “hinges” (nearly quartiles, see the help page on boxplot.stats) and median. The lines (whiskers) show the largest/smallest observation that falls within a distance of 1.5 times the box size from the nearest hinge. If any observations fall further away, the additional points are considered extreme and are shown separately.

> boxplot(bwi)

As you can see, there are two years (1979, 2003) that were exceptionally wet.

While visualizing data in form of one graph or another may be a good way to get started with asking questions about a data set, the interpretation is usually somewhat ambiguous. Hard numbers, especially, when they are derived in a standardized way, make it often easier to communicate summary statistics of a particular data set.

A data set can be summarized in several different ways:

· Measures of central tendency – numbers that represent the center or typical value of a frequency distribution, such as mode, median, and mean.

· Measures of dispersion – numbers that depict the amount of spread or variability in a data set, such as range, interquartile range, standard deviation, variance, and coefficient of variation.

Geographers must be cautious when applying descriptive statistics to spatial or locational data. The way in which a geographic problem is structured can affect the resulting descriptive statistics. We will be talking about the effects of boundary delineation and different levels of spatial aggregation or different scales at the end of the statistics section of this course.

It is very easy to calculate simple summary statistics with . Here is how to calculate the mean, standard deviation, variance, and median.

Step 6 Individual descriptive measures

> x = rnorm(50)

> mean(x)

> sd(x)

> var(x)

> median(x)

Notice that the example starts with the generation of an artificial data vector x of 50 normally distributed observations. It is used in examples throughout this section of the lab. When reproducing the examples, you will not get exactly the same results since your random numbers will differ. Empirical quantiles may be obtained with the function quantile, like this:

> quantile(x)

As your see, by default, you get the minimum, the maximum, and the three quartiles – the .25, 0.50, and .75 quantiles, so named because they correspond to a division into four parts. Similarly, we have deciles for 0.1, 02.,…,0.9, and centiles or percentiles. To get other than the customary quartiles, you have to provide a secondary parameter telling quantile() how many divisions you want:

> dec = seq(0,1,0.1)

> quantile(x, dec)

The difference between the first and third quartiles is called the interquartile range (IQR) and is sometimes used as a robust alternative to the standard deviation. Finally, if we want everything in one go, then there is the summary() function:

> summary(bwi)

What’s even better, is that the summary() function can be applied not just to a single numerical variable but to a whole data frame. Try

> summary(precip)

With this, we came to the end of this lab exercise. The six steps above provide you with all the information you need to now perform the following five tasks, which, together with your answers to Q1 and Q2 above make up your lab submission. Start Frontpage Express to create a new web page.

3 Examine the help for function mean(), and use it to learn about the trimmed mean. For the precipitation readings at BWI, calculate the mean, the median, and the 10% trimmed mean. How does the 10% trimmed mean differ from the mean for these data? Under what circumstances will the trimmed mean differ substantially from the mean?
Write your answer clearly marked as answers to Q3 in your new web page.

4 Plot a line graph of the BWI data. Can you identify any trend? Create a new vector of five-year averages of the BWI data (use a one-line command that concatenates the means of vector indices) and plot the new vector. How had the graph changed? Now do the same thing again but start your ten-year periods in 1955 instead of 1950. Compare the two curves side-by-side using a two-plot layout. Both plots are using the same data set, yet the results look so different. Copy the two-plot graph to your web page and explain how one can trust statistical results if the application of the same measure (here the mean) onto the same data set has such different outcomes. Hint: these are annual averages of daily precipitation measures.

5 Import the anonymized results from this semester’s 201 midterm.txt.
Pick a column (student) of your choice and write that vector to an external variable identified by the column header in midterm.txt. Is that student above or below average?

6 Each row represents one question. Create a summary of all correct answers for each question. Which question was not correctly answered by anyone? You know the answer from the midterm results page on the 201 website but what are the necessary steps in to arrive at that answer? Write the steps (or copy them from your window) in your web page.

7 In our accident data from step3, what is the average number of accidents per year of age? You have seen from the first histogram that the age groups are very skewed (varied in width). Create a weighted average for each age group by dividing the number of accidents for a group by the range of years that fall into this group (e.g., five for the 1^st group and one for the 4^th group). Now plot a histogram where the height of the bars represents the weighted average and the width of each bar the bandwidth of years in each age group. Copy the graph to your web page.

Rename your web page lab09.answers.html and set a link to this page from your home page. Then send an email to Jing Li announcing your lab submission and providing him with the URL to your lab answers.