Lab 12

Bivariate Statistics

 

This lab exercise corresponds directly with the lecture by the same name. By now, you are familiar with the format of [R logo] exercises. After a brief repeat of the concept, a short example of its application in [R logo] is given and where applicable particular idiosyncrasies (oddities) of [R logo] reported. This is followed up with a lab question, where you are to apply to concepts learned. In this lab, we have four steps covering four related topics. There is a brief question incorporated into each of these four steps.

 

Step 1  Two-sample difference of means test

Geographers use two primary methods to compare and test means from two independent samples for significance of difference. If the sample data and population distributions meet the requirements the appropriate independent sample difference of means procedure would be the t  test. In this case, the sample data would have to be measured on the interval or ratio scale and the samples must be drawn from normally distributed populations. Conversely, if the sample data are measured at the ordinal or nominal scale, a non-parametric procedure is required. The Wilcoxon rank sum W test is most commonly used. We discussed the t  test in the previous lab and will therefore concentrate here on the non-parametric tests. Be it just mentioned here that instead of comparing a sample to a population or an expected value, we now compare it with a second sample.

Rather than using parameters like mean and variance, non-parametric techniques use the ranks of sample observations to measure the magnitude of the differences in ranked positions between the two sets of sample data. The Wilcoxon test uses a variation of the t  test to see if the sum of sample ranks is significantly different from what it should be if the two samples are actually drawn from the same population. The test statistic W for the two-sample Wilcoxon procedure is

 

where  is the mean rank  with nS being the small and nB the bigger sample.

 

The similarity in philosophy between the t and the Wilcoxon W tests is mirrored by their use in [R logo]. The airquality data set built into [R logo] lists among other things ozone readings for the months May through September. List the contents of the airquality data set, plot it by month, and then run the t and W statistics.

>  airquality

> boxplot (Ozone ~ Month, data = airquality)   # notice the use of the ~ tilde operator to specify that Ozone is described by Month

>  t.test(Ozone ~ Month, data = airquality, subset = Month %in% c(5, 8))

> wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(5, 8))

Now you will conduct a similar analysis on urban microclimates. It is widely recognized that climate characteristics in cities differ considerably from those in the surrounding country-side. In “The Climate of Cities” (Scientific American, 217:15-23), William Lowry (1967) discusses the basic influences that set a city’s climate apart from that of the surrounding area. The washclim dataset has been derived from this article and will serve you to determine whether temperature and precipitation differ significantly between the city and countryside in the Washington, DC area. The maps beneath will help you to visualize the problem. The city is signified as U (urban) whereas the countryside is signified as N (non-urban).

Question 1   Have a look at the washclim data set and
(a) determine what are the dependent and what are the independent variables,
(b) describe in plain English the two null hypotheses H0.

Question 2   Use the washclim data set to
(a) conduct a difference of means t  test with the interval/ratio based data, and
(b) apply the Wilcoxon rank sum test to the same dataset after you downgraded it to an ordinal scale.

Both the t  and the Wilcoxon test require the two samples to be independent from each other. If this requirement is not met, a matched-pairs difference test is in order.

 

Step 2  Dependent sample difference test

Geographers often want to determine if two sets of values defined for one group of individuals or spatial locations differ. For examples, suppose one wishes to compare the number of migrants moving into a set of counties with the number of migrants leaving the same set of counties. While at first sight, a difference of means t  or Wilcoxon rank sum W  may seem appropriate, these tests do not qualify because they require independent samples. The dependent sample difference test measures the differences within matched pairs of observations. As before, we separate a matched-pair t  test for ration or interval data from the Wilcoxon matched-pairs signed ranks test for ordinal or non-parametric data.

The null hypothesis in the matched-pairs problem states that the mean difference for all matched pairs, , in the population equals zero. The test statistic is calculated analogous to the regular t  test (mp stands for matched-pair and d for difference). If the difference is non-zero but small, we will not reject this null hypothesis. The challenge now lies in figuring out what difference is big enough to suggest that the two samples do indeed come from different sub populations rather than one homogenous one. As usual, this can only be answered by the domain scientist and not the statistician. The Wilcoxon matched-pairs signed ranks test statistic looks frightening but is actually quite simple and again analogous to the one based on independent samples:

 

Let’s look at a geographic example. Global warming has been recognized as one of the most severe threats to life on this planet as we know it. A way to test for global warming would be to record temperatures at a single set of locations over a time period. Because this problem uses only one variable collected at two different time periods, a matched-pair or dependent sample difference test is most appropriate. The corntemp dataset consists of a sample of 28 weather stations located within the American Corn Belt with the average annual temperature calculated for a two-year period in the mid-1970’s and in the mid-1990’s. Of the 28 stations, 18 showed warmer average temperatures over the 20-yeaer period, 7 showed colder averages, and 3 stations showed no change. Over all 28 stations, the temperature rose on average 0.2 degrees Celsius.

Question 3   Use the corntemp dataset to calculate the likelihood of committing a Type I error. How (statistically) confident are you in your judgment?
You would want to calculate the average difference, the standard error of that difference, the tmp statistic and a (one-tailed) p value.

 

Step 3  Chi-square goodness-of-fit test

Depending on the circumstances, a set of data can be tested for “fit” against a variety of expected, sometimes theoretical frequency distributions. The expected frequency distribution would for example be the uniform one, if we were to examine nutrient runoff levels from a series of adjacent agricultural land use tributaries running into the same bay. If, on the other hand, we know that the distribution of farmland among the five tributaries varies widely, then we might want to hypothesize a distribution that is proportional to the amount of farmland in each watershed. Again, we distinguish between cases of interval/ratio data on one side and ordinal or nominal data on the other. This time, we start with the latter, and the appropriate test is known as the Chi-square (c2) Goodness-of-Fit test.

When applied as a goodness-of-fit test, the chi-square statistic compares observed frequency counts of a single variable with an expected distribution of frequency counts. The null hypothesis states that the population from which the sample is drawn fits an expected distribution; thus there is no difference between observed and expected frequency counts. The formula for the chi-square test statistic is

 

With k = number of nominal or ordinal categories.

Example:

A newly appointed school district superintendent wants to conduct a geographic study that compares SAT test scores among graduating seniors from five comparably sized high schools in different parts of New York City. Are students from all schools performing equally (that is uniform) well? Here is the table of the number of SAT scores above national median for each of the five schools:

John Muir

Gifford Pinchot

Rachel Carson

Garrett Hardin

Aldo Leopold

42

45

51

47

60

Ei = SOi / k = 245/5 = 49

c2 = (42-49)2/49  +  (45-49)2/49  +  (51-49)2/49  +  (47-49)2/49  +  (60-49)2/49 = 3.959, and hence p = .412

This is not a particularly conclusive result – but then, this is what we often get with real data.

Now it is your turn to apply the chisq.test() function in [R logo].

Question 4   In 747 cases of “Rocky Mountain spotted fever” from the western United States, 210 patients died. Out of 661 cases from the eastern United States, 122 died. Is the difference statistically significant?

 

Step 4  Kolmogorov-Smirnov goodness-of-fit test

For nominal or ordinal data, we can just list the expected values and compare them with the observed ones. For interval or ratio data, we rather compare our observed sample values with those of an expected frequency distribution such as Poisson or Normal distribution. The Kolmogorov-Smirnov Goodness-of-Fit test is commonly used by geographers (and others) to test for normality. More specifically, the Kolmogorov-Smirnov test compares the cumulative relative frequencies of the observed sample data with the cumulative frequencies expected for a perfectly normal distribution. To calculate an observed cumulative frequency distribution, individual data are ranked from low to high and then the absolute maximum difference is calculated between the observed and the expected value based on one distribution or another. For instance, in a previous lab exercise, you were asked to visually check whether the BWI precipitation data is normally distributed. You can (and will) do this now with the Kolmogorov-Smirnov test.

The Kolmogorov-Smirnov test is implemented in [R logo] as ks.test(). The first parameter is always the sample data set. If the second parameter is numeric, a two-sample test of the null hypothesis that both samples were drawn from the same continuous distribution is performed. Alternatively, the second parameter can be a character string naming a continuous distribution function.  In this case, a one-sample test is carried out of the null that the distribution function which generated the sample is distribution 'xyz'.

Question 5   Read the BWI data that you know so well by now and use the K-S test to check whether the data is indeed normally distributed.

With this, we covered all the material for this lab exercise. Let’s now apply the c2 test in a real geographic problem setting. An intermediate-sized community contains four city parks and nine identifiable neighborhoods (see map below). Members of the city park and recreation department have been collecting information on park usage by neighborhood. The following visitation data have been collected:

 

Neighborhood

Park ID and Name

1

2

3

4

5

6

7

8

9

A: Greenwood Park

120

312

110

76

195

62

58

62

52

B: Hochwald Park

70

110

410

53

82

101

80

60

78

C: Lathrop Park

60

62

61

94

114

85

74

280

96

D: Watson Park

35

40

110

38

55

98

512

42

82

Question 6a For each of the four parks use a chi-square goodness-of-fit test to determine whether the number of visitors to the park from the nine neighborhoods is uniformly distributed. Record the p-values for visitations to each of the four parks.

On further consideration, the community recreation planners suspect that some form of spatial interaction model might provide a more valid prediction of attendance at the four city parks. To estimate the number of visitors from each neighborhood, a simple potential model is applied that hypothesizes park attendance as being directly related to neighborhood population size and inversely related to distance from the centroid of the neighborhood to the park.

 

Using this model, the expected number of visitors to each park from each neighborhood is calculated:

 

Neighborhood

Park ID and Name

1

2

3

4

5

6

7

8

9

A: Greenwood Park

106

370

98

71

179

62

50

57

54

B: Hochwald Park

68

122

414

47

71

96

81

55

90

C: Lathrop Park

52

55

58

80

126

91

71

296

97

D: Watson Park

34

38

116

44

50

97

507

48

78

Question 6b For each of the four parks use the data from the table of expected number of visitors from each neighborhood and apply chi-square as a goodness-of-fit test to determine whether the number of visitors to the park from the nine neighborhoods is proportionally distributed as predicted by the spatial interaction model. Record the p-values for visitations to each of the four parks.
Hint: You would make your life easier if you read the observed vs. the expected values into a matrix of nine values to a row describing a neighborhood.

Record all your answers to the above six questions and send them in an ASCII email message to Jing Li.