Chapter 1 Introduction to spatial data in R

Learning Objectives

  • Understand the structure of sf objects
  • Read spatial data into sf objects
  • Create sf objects from coordinate columns
  • Map sf data with static and interactive visuals
  • Read GeoTiff single and multiband into a raster object.
  • Examine raster objects

1.1 Spatial vector data model in R

In the vector model of spatial data, we have types of points (Points, MultiPoints), lines (LinesStrings, MultiLineStrings), polygons (Polygons, MultiPolygons), and a combination of them (GeometryCollection).

But these are pure geometries defined in the Euclidean space with Cartesian coordinates. we need to refer such data to physical locations on the Earth surface, aka “geographically referenced.” In order to do that, the software need to know extra information about the shape of the 3D Earth surface and how locations on the 3D surface can be projected or transformed to 2D surfaces like our paper maps and computer screens. The spatial reference systems (SRS) or coordinate reference systems (CRS) are the solution, which have been fully developed in Geographic Information Science or Geographic Information Systems.

  • In the next session, we will continue to examine some of the details of CRS. In this session, we start to learn how it is defined and used.
  • Geographic data, by definition, are spatial data with reference to the Earth surface. Spatial data, by contrast, could refer to any surfaces or spaces. In other words, all geographic data are spatial data but some spatial data are not geographic data. While mostly interchangeable, they could be usefully differentiated in some contexts.

1.1.1 Handling spatial data in R

Spatial data analysis has always been an important component in the R ecosystem. For vector spatial data in R, there are two families of data structures based on two packages.

  • sp family: used to be de facto standard and still popular
    • sp means spatial and defines a family of spatial data classes.
    • some of the top level classes are SpaitalPointsDataFrame, SpatialMultiPointsDataFrame, SpatialLinesDataFrame, and SpatialPolygonsDataFrame.
    • The ESRI Shapefile is closest to sp in terms of data organization.
  • sf family: the newer and “modern” standard
    • sf means Simple Feature Specification as defined by Open Geospatial Consortium or OGC.
    • It is faster, more efficient, and consistent with other software like PostGIS.
    • sf is supported by the R Consortium and therefore “official.”
    • It is almost the same as the table format in spatial databases like PostGIS and SpatiaLite.
    • sf is tidyverse compatible and works well with tidy-packages like dplyr, ggplot, and tidycensus

One must understand that sp and sf are not two rivals in the R ecosystem. Instead, sf is more like a natural evolution of sp. First, there is demand to unify the data models across the entire geospatial industry so that GIS software and spatial databases are compatible and interoperable. OGC is leading the development of such geospatial standards. Most organizations and products like QGIS, PostGIS, Oracle, and ArcGIS are increasingly adopting them. And sf was a response from the R spatial community to these standards.

Second, R now has more advanced data processing features and better fundamental packages such as those in the tidyverse. For geospatial data, more efficient open source spatial libraries also become available based on libraries like RGEOS, GDAL, PROJ, boost geometry etc. The new sf package can better take advantage of these developments.

Practically, it is very easy to convert between sp and sf structures as their underlying spatial models are compatible. And some of the key developers are actually working on both packages. Many R spatial packages can use both. But in case a package can only use one of them, we must make explicit conversions, mostly from sf to sp because most classic R spatial packages are based on sp.

That being said, when and where possible, we should always give the sf-family packages a higher priority as most packages are reworking towards being tidyverse compatible. Some packages related to the sp family packages are retiring. For this course, we focus on the sf package.

1.1.2 The sf package

The sf1 package is first released on CRAN in late October 2016 . It implements a formal standard called “Simple Features” that specifies a storage and access model of spatial geometries such as points, lines, and polygons. A feature geometry is called simple when it meets certain criteria. For example, simple polygons cannot self-intersect and they cannot have spikes or dangling vertexes. Simple Features are independent and have no explicit information about their neighbors or other spatially connected features. This standard has been adopted widely, not only by spatial databases such as PostGIS, but also web standards such as GeoJSON.

If you work with PostGIS or GeoJSON you may have come across the WKT (well-known text) format (Fig 1.1 and 1.2). Note that these texts are not what are being actually saved or encoded in the data. They are just for users’ convenience as they are readable as opposed to the binary code.

Well-Known-Text Geometry primitives  (wikipedia)

FIGURE 1.1: Well-Known-Text Geometry primitives (wikipedia)

Well-Known-Text Multipart geometries (wikipedia)

FIGURE 1.2: Well-Known-Text Multipart geometries (wikipedia)

sf implements the Simple Features standard natively in R. sf stores spatial objects as a simple data frame with a special column that contains the information for the geometry coordinates. That special column is a list with the same length as the number of rows in the data frame. Each of the individual list elements then can be of any length needed to hold the coordinates that correspond to an individual feature.

How simple features in R are organized

The three classes used to represent simple features are:

  • sf, the table (data.frame) with feature attributes and feature geometries, which contains
  • sfc, the list-column with the geometries for each feature (record), which is composed of
  • sfg, the feature geometry of an individual simple feature.
Relationship of sf, sfc, and sfg

FIGURE 1.3: Relationship of sf, sfc, and sfg

1.1.3 Create a spatial sf object manually

The basic steps of creating sf objects are bottom-up, that is from sfg, to sfc, then to sf.

I. Create geometric objects

Geometric objects (simple features) can be created from a numeric vector, matrix or a list with the coordinates. They are called sfg objects for Simple Feature Geometry. In sf package, there are functions that help create simple feature geometries, like st_point(), st_linestring(), st_polygon() and more.

  1. Combine all individual single feature objects for the special column.

The feature geometries are then combined into a Simple Feature Collection with st_sfc(). which is nothing other than a simple feature geometry list-column. The sfc object also holds the bounding box and the projection information.

  1. Add attributes.

Lastly, we add the attributes to the the simple feature collection with the st_sf() function. This function extends the well known data frame in R with a column that holds the simple feature collection.

Here we try to creat some highway objects. First, we would generate LINESTRINGs as simple feature geometries sfg out of a matrix with coordinates:

set.seed(1006) # for replicability
# runif(6) generates 6 random numbers in uniform distribution
lnstr_sfg1 <- st_linestring(matrix(runif(6), ncol=2)) 
set.seed(1043)
lnstr_sfg2 <- st_linestring(matrix(runif(6), ncol=2)) 
class(lnstr_sfg1)
#> [1] "XY"         "LINESTRING" "sfg"

We would then combine this into a simple feature collection sfc:

lnstr_sfc <- st_sfc(lnstr_sfg1, lnstr_sfg2) # just one feature here
class(lnstr_sfc)
#> [1] "sfc_LINESTRING" "sfc"

And lastly use our data frame from above to generate the sf object:

dfr <- data.frame(id = c("hwy1", "hwy2"), cars_per_hour = c(78, 22)) 
lnstr_sf <- st_sf(dfr , lnstr_sfc)
class(lnstr_sf)
#> [1] "sf"         "data.frame"
plot(lnstr_sf, col='black')

1.1.4 sf methods

There are many methods available in the sf package. Most function names, particularly those related to spatial data processing and analysis, start with st_, which means Spatial Type. To find out more, use

methods(class="sf")
#>   [1] $<-                          [                           
#>   [3] [[<-                         [<-                         
#>   [5] aggregate                    anti_join                   
#>   [7] arrange                      as.data.frame               
#>   [9] cbind                        coerce                      
#>  [11] crs                          dbDataType                  
#>  [13] dbWriteTable                 distance                    
#>  [15] distinct                     dplyr_reconstruct           
#>  [17] drop_na                      duplicated                  
#>  [19] extent                       extract                     
#>  [21] filter                       full_join                   
#>  [23] gather                       group_by                    
#>  [25] group_split                  identify                    
#>  [27] initialize                   inner_join                  
#>  [29] left_join                    lines                       
#>  [31] mapView                      mask                        
#>  [33] merge                        mutate                      
#>  [35] nest                         pivot_longer                
#>  [37] pivot_wider                  plot                        
#>  [39] points                       print                       
#>  [41] raster                       rasterize                   
#>  [43] rbind                        rename                      
#>  [45] rename_with                  right_join                  
#>  [47] rowwise                      sample_frac                 
#>  [49] sample_n                     select                      
#>  [51] semi_join                    separate                    
#>  [53] separate_rows                show                        
#>  [55] slice                        slotsFromS3                 
#>  [57] spread                       st_agr                      
#>  [59] st_agr<-                     st_area                     
#>  [61] st_as_s2                     st_as_sf                    
#>  [63] st_as_sfc                    st_bbox                     
#>  [65] st_boundary                  st_break_antimeridian       
#>  [67] st_buffer                    st_cast                     
#>  [69] st_centroid                  st_collection_extract       
#>  [71] st_concave_hull              st_convex_hull              
#>  [73] st_coordinates               st_crop                     
#>  [75] st_crs                       st_crs<-                    
#>  [77] st_difference                st_drop_geometry            
#>  [79] st_exterior_ring             st_filter                   
#>  [81] st_geometry                  st_geometry<-               
#>  [83] st_inscribed_circle          st_interpolate_aw           
#>  [85] st_intersection              st_intersects               
#>  [87] st_is                        st_is_full                  
#>  [89] st_is_valid                  st_join                     
#>  [91] st_line_merge                st_m_range                  
#>  [93] st_make_valid                st_minimum_rotated_rectangle
#>  [95] st_nearest_points            st_node                     
#>  [97] st_normalize                 st_point_on_surface         
#>  [99] st_polygonize                st_precision                
#> [101] st_reverse                   st_sample                   
#> [103] st_segmentize                st_set_precision            
#> [105] st_shift_longitude           st_simplify                 
#> [107] st_snap                      st_sym_difference           
#> [109] st_transform                 st_triangulate              
#> [111] st_triangulate_constrained   st_union                    
#> [113] st_voronoi                   st_wrap_dateline            
#> [115] st_write                     st_z_range                  
#> [117] st_zm                        summarise                   
#> [119] text                         transform                   
#> [121] transmute                    ungroup                     
#> [123] unite                        unnest                      
#> see '?methods' for accessing help and source code

Here are some of the other highlights of sf that you might be interested in:

  • provides fast I/O, particularly relevant for large files

  • spatial functions that rely on GEOS and GDAL and PROJ external libraries are directly linked into the package, so no need to load additional external packages (like in sp)

  • sf objects can be plotted directly with ggplot

  • sf directly reads from and writes to spatial databases such as PostGIS

  • sf is compatible with the tidyvderse, (but see some pitfalls here)

Note that sp and sf are not the only way spatial objects are conceptualized in R. Other spatial packages may use their own class definitions for spatial data (for example spatstat).

There are packages specifically for the GeoJSON and for that reason are more lightweight, for example:

Usually you can find functions that convert objects to and from these formats.

Self-Exercise

Similarly to the example above generate a Point object in R.

  1. Create a matrix pts of random numbers with two columns and as many rows as you like. These are your points.
  2. Create a dataframe attrib_df with the same number of rows as your pts matrix and a column that holds an attribute. You can make up any attribute.
  3. Use the appropriate commands and pts to create an sf object with a geometry column of class sfc_POINT.
  4. Try to subset/filter your spatial object using the attribute you have added and the way you are used to from regular data frames.
  5. How do you determine the bounding box of your spatial object?

1.2 Load and Display Data with sf

With good understanding of the sf data structure, we are ready to load and display some real data.

1.2.1 Reading Shapefiles into R

sf utilizes the powerful GDAL library to conduct data I/O (input/output), which is automatically loaded when loading sf. GDAL can handle numerous vector and raster data format. The st_read() method in sf is the interface for reading vector spatial data. In this section, we mostly use the Shapefile format.

# read in a Shapefile. With the 'shp' file extension, sf knows it is a Shapefile.
nyc_sf <- st_read("data/nyc/nyc_acs_tracts.shp")
#> Reading layer `nyc_acs_tracts' from data source 
#>   `D:\Cloud_Drive\Hunter College Dropbox\Shipeng Sun\Workspace\RSpace\R-Spatial_Book\data\nyc\nyc_acs_tracts.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 2166 features and 113 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
#> Geodetic CRS:  NAD83
# take a look at what we've got
str(nyc_sf) # note again the geometry column at the end of the output
#> Classes 'sf' and 'data.frame':   2166 obs. of  114 variables:
#>  $ UNEMP_RATE: num  0 0.0817 0.1706 0 0.088 ...
#>  $ cartodb_id: num  1 2 3 4 5 6 7 8 9 10 ...
#>  $ withssi   : int  0 228 658 0 736 0 261 0 39 638 ...
#>  $ withsocial: int  0 353 1577 0 1382 99 1122 10 264 895 ...
#>  $ withpubass: int  0 47 198 0 194 0 145 0 5 334 ...
#>  $ struggling: int  0 694 2589 0 2953 337 3085 17 131 1938 ...
#>  $ profession: int  0 0 36 0 19 745 18 49 35 0 ...
#>  $ popunemplo: int  0 92 549 0 379 321 432 26 106 494 ...
#>  $ poptot    : int  0 2773 8339 0 10760 7024 10955 849 1701 5923 ...
#>  $ popover18 : int  0 2351 6878 0 8867 6637 8932 711 1241 4755 ...
#>  $ popinlabou: int  0 1126 3218 0 4305 5702 5114 657 735 2283 ...
#>  $ poororstru: int  0 2026 4833 0 7044 1041 6376 112 232 4115 ...
#>  $ poor      : int  0 1332 2244 0 4091 704 3291 95 101 2177 ...
#>  $ pacificune: int  0 0 0 0 13 0 0 0 0 0 ...
#>  $ pacificinl: int  0 0 0 0 13 0 0 0 0 0 ...
#>  $ pacific   : int  0 0 0 0 13 0 0 0 0 0 ...
#>  $ otherunemp: int  0 0 103 0 46 0 0 0 17 151 ...
#>  $ otherinlab: int  0 144 348 0 609 112 184 48 90 525 ...
#>  $ otherethni: int  0 598 1175 0 1799 112 377 48 183 1664 ...
#>  $ onlyprofes: int  0 0 102 0 20 890 116 78 77 0 ...
#>  $ onlymaster: int  0 77 412 0 292 2162 408 233 328 16 ...
#>  $ onlylessth: int  0 878 2164 0 3793 121 4384 34 115 1846 ...
#>  $ onlyhighsc: int  0 1088 4068 0 3868 5508 3882 631 1093 2343 ...
#>  $ onlydoctor: int  0 0 66 0 1 145 98 29 42 0 ...
#>  $ onlycolleg: int  0 471 2355 0 2290 5371 2223 585 884 1111 ...
#>  $ onlybachel: int  0 271 1269 0 1322 4685 1262 481 687 200 ...
#>  $ okay      : int  0 742 3474 0 3499 5982 4579 733 1469 1798 ...
#>  $ mixedunemp: int  0 16 21 0 14 24 0 0 51 31 ...
#>  $ mixedinlab: int  0 72 134 0 100 136 91 10 62 140 ...
#>  $ mixed     : int  0 175 234 0 251 224 115 16 102 334 ...
#>  $ master    : int  0 77 310 0 272 1272 292 155 251 16 ...
#>  $ maleunempl: int  0 76 349 0 179 204 197 8 72 277 ...
#>  $ maleover18: int  0 1101 3134 0 4068 3557 3992 454 613 1935 ...
#>  $ malepro   : int  0 0 36 0 19 473 48 51 61 0 ...
#>  $ malemastr : int  0 10 143 0 126 1034 181 123 139 0 ...
#>  $ male_lesHS: int  0 302 1063 0 1693 29 2013 17 103 714 ...
#>  $ male_HS   : int  0 607 1893 0 1651 2985 1702 414 485 985 ...
#>  $ male_doctr: int  0 0 24 0 0 77 30 29 38 0 ...
#>  $ male_collg: int  0 227 1225 0 997 2924 871 381 368 462 ...
#>  $ male_BA   : int  0 132 668 0 562 2411 499 288 292 43 ...
#>  $ maleinlabo: int  0 612 1640 0 2059 3299 2466 423 392 991 ...
#>  $ maledrop  : int  0 16 8 0 0 0 0 0 0 0 ...
#>  $ male16to19: int  0 65 253 0 162 37 236 20 32 187 ...
#>  $ male      : int  0 1318 3850 0 4796 3840 5244 533 897 2514 ...
#>  $ lessthanhi: int  0 878 2164 0 3793 121 4384 34 115 1846 ...
#>  $ lessthan10: int  0 212 760 0 1100 289 625 2 51 463 ...
#>  $ households: int  0 986 3382 0 3838 4104 3950 367 739 2224 ...
#>  $ hispanicun: int  0 30 269 0 115 27 0 0 68 335 ...
#>  $ hispanicin: int  0 357 1171 0 819 297 236 77 201 1145 ...
#>  $ hispanic  : int  0 1187 3503 0 2608 374 318 108 360 3478 ...
#>  $ highschool: int  0 617 1713 0 1578 137 1659 46 209 1232 ...
#>  $ geo_state : int  36 36 36 36 36 36 36 36 36 36 ...
#>  $ geo_place : int  51000 51000 51000 51000 51000 51000 51000 51000 51000 51000 ...
#>  $ geo_county: int  61 61 61 61 61 61 61 61 61 61 ...
#>  $ field_1   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ femaleunem: int  0 16 200 0 200 117 235 18 34 217 ...
#>  $ femaleover: int  0 1250 3744 0 4799 3080 4940 257 628 2820 ...
#>  $ fem_profes: int  0 0 66 0 1 417 68 27 16 0 ...
#>  $ fem_master: int  0 67 269 0 166 1128 227 110 189 16 ...
#>  $ fem_lessHS: int  0 576 1101 0 2100 92 2371 17 12 1132 ...
#>  $ fem_HS    : int  0 481 2175 0 2217 2523 2180 217 608 1358 ...
#>  $ fem_doctor: int  0 0 42 0 1 68 68 0 4 0 ...
#>  $ fem_colleg: int  0 244 1130 0 1293 2447 1352 204 516 649 ...
#>  $ fem_BA    : int  0 139 601 0 760 2274 763 193 395 157 ...
#>  $ femaleinla: int  0 514 1578 0 2246 2403 2648 234 343 1292 ...
#>  $ femaledrop: int  0 0 1 0 0 0 0 0 0 14 ...
#>  $ femal16_19: int  0 84 124 0 271 2 126 2 32 242 ...
#>  $ female    : int  0 1455 4489 0 5964 3184 5711 316 804 3409 ...
#>  $ europeanun: int  0 14 328 0 56 188 71 26 22 132 ...
#>  $ europeanin: int  0 303 1641 0 525 4058 492 540 543 510 ...
#>  $ european  : int  0 540 4091 0 1181 4975 929 717 1327 1645 ...
#>  $ doctorate : int  0 0 66 0 1 145 98 29 42 0 ...
#>  $ com_90plus: int  0 49 101 0 36 61 82 17 27 178 ...
#>  $ comm_5less: int  0 99 0 0 1 220 46 26 0 0 ...
#>  $ comm_60_89: int  0 40 215 0 370 136 687 32 41 331 ...
#>  $ comm_5_14 : int  0 121 226 0 331 913 718 87 58 99 ...
#>  $ comm_45_59: int  0 142 171 0 284 287 409 46 84 351 ...
#>  $ comm_30_44: int  0 217 923 0 1442 1514 1118 198 210 493 ...
#>  $ comm_15_29: int  0 352 970 0 1250 1840 1363 192 134 304 ...
#>  $ college   : int  0 200 1086 0 968 686 961 104 197 911 ...
#>  $ bachelor  : int  0 194 857 0 1030 2523 854 248 359 184 ...
#>  $ asianunemp: int  0 62 38 0 207 108 352 0 16 61 ...
#>  $ asianinlab: int  0 559 615 0 2736 1286 4283 42 35 491 ...
#>  $ asian     : int  0 1249 1724 0 6549 1598 9448 51 75 922 ...
#>  $ americanun: int  0 0 0 0 0 1 0 0 0 24 ...
#>  $ americanin: int  0 0 0 0 57 1 0 0 0 24 ...
#>  $ american  : int  0 43 0 0 57 1 0 0 0 51 ...
#>  $ africanune: int  0 0 59 0 43 0 9 0 0 95 ...
#>  $ africaninl: int  0 48 434 0 265 109 64 17 5 593 ...
#>  $ african   : int  0 168 1115 0 910 114 86 17 14 1307 ...
#>  $ puma      : chr  "3810" "3809" "3809" "3810" ...
#>  $ ntaname   : chr  "park-cemetery-etc-Manhattan" "Lower East Side" "Lower East Side" "park-cemetery-etc-Manhattan" ...
#>  $ ntacode   : chr  "MN99" "MN28" "MN28" "MN99" ...
#>  $ ctlabel   : chr  "1" "2.01" "2.02" "5" ...
#>  $ cdeligibil: chr  "I" "E" "E" "I" ...
#>  $ boroname  : chr  "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
#>  $ medianinco: num  NA 17282 24371 NA 18832 ...
#>  $ medianagem: num  NA 39.3 44.9 NA 43.4 29.7 42 33.2 41.5 31.2 ...
#>  $ medianagef: num  NA 43.8 47.9 NA 43 28.3 43 31.9 45.7 47.5 ...
#>   [list output truncated]
#>  - attr(*, "sf_column")= chr "geometry"
#>  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "names")= chr [1:113] "UNEMP_RATE" "cartodb_id" "withssi" "withsocial" ...

Two more words about the geometry column: The geometry column does not have to be named “geometry”. You can name this column any way you wish. “geom”, for example, is another popular name for the geometry column. Secondly, you can remove this column and revert to a regular, non-spatial dataframe at any time with st_drop_geometry().

1.2.2 Simple sf plotting

The default plot of an sf object is a multi-plot of the first attributes, with a warning if not all can be plotted. By default, it plots the first 9 columns:

In order to only plot the polygon boundaries we need to directly use the geometry column. We use the st_geometry() function to extract it. We can also select a single column or a few columns to draw.

# pure geometry
plot(st_geometry(nyc_sf), main='Pure geometry with st_geometry function')

# Only the college column, with geometry
plot(nyc_sf['college'], main='One column with ["college"]')

# Two columns, with geometry
plot(nyc_sf[c('college', 'fem_colleg')], main='Two Columns with ["college", "fem_colleg"]')

Let’s add a subset of polygons with only the census tracts where the median household income is more than $80,000. We can extract elements from an sf object based on attributes using your preferred method of filtering dataframes.

# subset the familiar way
nyc_sf_rich <- nyc_sf[nyc_sf$medianinco > 80000, ]    
# Make a plot using all tracts, then add the "rich" in red
plot(st_geometry(nyc_sf))
plot(st_geometry(nyc_sf_rich), add=T, col="red")

tidyverse and piping work as well!

plot(nyc_sf %>% st_geometry(), col = 'NA', border='grey', lwd=0.5)

nyc_sf %>% 
  filter(medianinco > 80000) %>% 
  st_geometry() %>% 
  plot(col="red", add=T)

1.2.3 Simple interactive mapping with mapview

One of most useful, yet simplest methods of visualizing a sf object or other spatial data is to utilize the mapview package for interactive mapping. mapview utilizes the leaflet package in R and the original JavaScript leaflet library for web mapping. While we will learn the details of leaflet later, here is a quick application of the mapview package.

# We use all the default options, but select a few columns. 
# If we use all columns, the popup window will be too big. 
nyc_sf %>% dplyr::select(boroname, poptot, female, doctorate) %>%
  mapview::mapview()

Further configuration could be applied.

nyc_sf %>% dplyr::select(boroname, poptot, female, doctorate) -> tmp_sf

mapview(tmp_sf, zcol='boroname', layer.name = 'Borough', stroke=FALSE) + 
  mapview(tmp_sf, zcol='poptot', 
          layer.name = 'Total Population', 
          homebutton=FALSE, 
          legend=FALSE,
          hide = TRUE) +
  mapview(tmp_sf, zcol="doctorate", 
          layer.name = 'Population with Doctorate', 
          homebutton=FALSE, 
          legend=FALSE,
          hide=TRUE,
          label="doctorate",
          popup = 'doctorate')

Another way of using multiple mapview maps is to save it to a variable. Then we can keep adding new layers to this variable.

# This variable has one layer
tmpMapObj_1 <- mapView(tmp_sf, zcol='boroname', layer.name = 'Borough', stroke=FALSE)

# This is another layer
tmpMapObj_2 <- mapView(x = tmp_sf, zcol='poptot', 
          layer.name = 'Total Population', 
          homebutton=FALSE, 
          legend=FALSE,
          hide = TRUE) 

# We can add a new layer to an existing layer
tmpMapObj_3 <- tmpMapObj_1 + mapView(x = tmp_sf, zcol="doctorate", 
          layer.name = 'Population with Doctorate', 
          homebutton=FALSE, 
          legend=FALSE,
          hide=TRUE,
          label="doctorate",
          popup = 'doctorate')


# We can choose to show any of the maps with themselves or with "+"

# This show two layers
# tmpMapObj_3

# This show three layers
tmpMapObj_2 + tmpMapObj_3

mapview is particularly suitable for quick and simple interactive mapping. However, the cost of convenience and simplicity is the loss of controls to many details. With leaflet, by contrast, we can calibrate many details of the interactive maps.

1.3 Creating a spatial object from a lat/lon table

Increasingly, we are using GPS or GPS-compatible devices to collect spatial data or geographically referenced data in the field. Such data are often organized in a spreadsheet that contains latitude, longitude and some attribute values. We have learned many different ways of reading such spreadsheet into a dataframe. We can then very easily convert the table into a spatial object in R.

1.3.1 With sf

An sf object can be created from a data frame in the following way. We take advantage of the st_as_sf() function which converts any foreign object into an sf object. Similarly to above, it requires an argument coords, which in the case of point data needs to be a vector that specifies the dataframe’s columns for the longitude and latitude (x,y) coordinates.

`my_sf_object <- st_as_sf(myDataframe, coords)`

st_as_sf() creates a new object and leaves the original data frame untouched.

We use read_csv() in the readr package to read manhattan_noise.csv into a dataframe in R and name it manhattan_noise_df. Alternatively, we can use the read.csv() in the base package.

# read_csv has better capabilities for data types like datetime.
manhattan_noise_df <- read_csv("data/nyc/manhattan_noise.csv", 
                               show_col_types = FALSE, 
                               lazy = FALSE) %>%
  dplyr::mutate(datetime = lubridate::force_tz(datetime, 'America/New_York'))

#manhattan_noise_df <- read.csv("data/nyc/manhattan_noise.csv")
str(manhattan_noise_df)
#> tibble [68,582 × 7] (S3: tbl_df/tbl/data.frame)
#>  $ datetime              : POSIXct[1:68582], format: "2020-07-25 15:02:08" "2020-07-25 02:58:27" ...
#>  $ descriptor            : chr [1:68582] "Loud Music/Party" "Loud Music/Party" "Loud Music/Party" "Loud Music/Party" ...
#>  $ incident_zip          : num [1:68582] 10040 10031 10034 10034 10003 ...
#>  $ incident_address      : chr [1:68582] "281 WADSWORTH AVENUE" "600 WEST  142 STREET" "33 INDIAN ROAD" "65 VERMILYEA AVENUE" ...
#>  $ open_data_channel_type: chr [1:68582] "MOBILE" "ONLINE" "ONLINE" "ONLINE" ...
#>  $ latitude              : num [1:68582] 40.9 40.8 40.9 40.9 40.7 ...
#>  $ longitude             : num [1:68582] -73.9 -74 -73.9 -73.9 -74 ...

In the real world, things would not be so smooth as many do not have appropriate knowledge on geography and GPS. Quite often, the coordinate columns are not ready for direct conversion. Commonly problems include:

  • coordinates are in degree, minute, and second format instead of decimal degrees.
  • The +/- signs in coordinates are replaced by N(orth), S(outh), W(est), E(ast).
  • Longitude and latitude are switched or labeled the wrong way.
  • Missing the negative signs in coordinates when there should be.
  • Coordinates are mixed with text values.

With general data processing power in R, these issues can be fixed relatively easily.

We convert the manhattan_noise_df data frame into an sf object with st_as_sf()

# Note the two columns for coords must exist in the dataframe.
# Or put in another way, you must choose two columns from the dataframe for coords.
manhattan_noise_sf <- st_as_sf(manhattan_noise_df, 
                               coords = c("longitude", "latitude"))
str(manhattan_noise_sf)
#> sf [68,582 × 6] (S3: sf/tbl_df/tbl/data.frame)
#>  $ datetime              : POSIXct[1:68582], format: "2020-07-25 15:02:08" "2020-07-25 02:58:27" ...
#>  $ descriptor            : chr [1:68582] "Loud Music/Party" "Loud Music/Party" "Loud Music/Party" "Loud Music/Party" ...
#>  $ incident_zip          : num [1:68582] 10040 10031 10034 10034 10003 ...
#>  $ incident_address      : chr [1:68582] "281 WADSWORTH AVENUE" "600 WEST  142 STREET" "33 INDIAN ROAD" "65 VERMILYEA AVENUE" ...
#>  $ open_data_channel_type: chr [1:68582] "MOBILE" "ONLINE" "ONLINE" "ONLINE" ...
#>  $ geometry              :sfc_POINT of length 68582; first list element:  'XY' num [1:2] -73.9 40.9
#>  - attr(*, "sf_column")= chr "geometry"
#>  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
#>   ..- attr(*, "names")= chr [1:5] "datetime" "descriptor" "incident_zip" "incident_address" ...

Note the additional geometry list-column which now holds the simple feature collection with the coordinates of all the points.

To make it a complete geographical object we assign the WGS84 projection, which has the EPSG code 4326:

st_crs(manhattan_noise_sf)
#> Coordinate Reference System: NA
st_crs(manhattan_noise_sf) <- 4326 # we can use EPSG as numeric here
# Alternatively, we can use the sf::st_set_crs method
# st_set_crs(manhattan_noise_sf, 4326)
st_crs(manhattan_noise_sf)
#> Coordinate Reference System:
#>   User input: EPSG:4326 
#>   wkt:
#> GEOGCRS["WGS 84",
#>     ENSEMBLE["World Geodetic System 1984 ensemble",
#>         MEMBER["World Geodetic System 1984 (Transit)"],
#>         MEMBER["World Geodetic System 1984 (G730)"],
#>         MEMBER["World Geodetic System 1984 (G873)"],
#>         MEMBER["World Geodetic System 1984 (G1150)"],
#>         MEMBER["World Geodetic System 1984 (G1674)"],
#>         MEMBER["World Geodetic System 1984 (G1762)"],
#>         MEMBER["World Geodetic System 1984 (G2139)"],
#>         MEMBER["World Geodetic System 1984 (G2296)"],
#>         ELLIPSOID["WGS 84",6378137,298.257223563,
#>             LENGTHUNIT["metre",1]],
#>         ENSEMBLEACCURACY[2.0]],
#>     PRIMEM["Greenwich",0,
#>         ANGLEUNIT["degree",0.0174532925199433]],
#>     CS[ellipsoidal,2],
#>         AXIS["geodetic latitude (Lat)",north,
#>             ORDER[1],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>         AXIS["geodetic longitude (Lon)",east,
#>             ORDER[2],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>     USAGE[
#>         SCOPE["Horizontal component of 3D system."],
#>         AREA["World."],
#>         BBOX[-90,-180,90,180]],
#>     ID["EPSG",4326]]
  • Geographic coordinates from GPS are always using the WGS84 projection, which has a EPSG code of 4326.
  • EPSG stands for European Petroleum Survey Group. The map projections are too complicated for many engineers in the petroleum industry. So they started using numeric codes, SRID, to denote spatial reference systems (SRS). Such simplicity has also been favored by software engineers, so they also use EPSG extensively. By contrast, GIS professionals and users are supposedly to have good knowledge on map projection and reference systems. As a result, classic GIS software like ArcGIS have just started using EPSG codes.
  • There are many websites that allow us to query EPSG codes. The best one, in my opinion, is Spatial Reference, which provides the projection and spatial reference information in many different formats and for many applications.

Now, the sf object looks like follows. Note how the SRID and prj4string are filled.

manhattan_noise_sf
#> Simple feature collection with 68582 features and 5 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -74.018 ymin: 40.69889 xmax: -73.90809 ymax: 40.87783
#> Geodetic CRS:  WGS 84
#> # A tibble: 68,582 × 6
#>    datetime            descriptor       incident_zip incident_address    
#>  * <dttm>              <chr>                   <dbl> <chr>               
#>  1 2020-07-25 15:02:08 Loud Music/Party        10040 281 WADSWORTH AVENUE
#>  2 2020-07-25 02:58:27 Loud Music/Party        10031 600 WEST  142 STREET
#>  3 2020-07-25 21:35:59 Loud Music/Party        10034 33 INDIAN ROAD      
#>  4 2020-07-26 01:33:28 Loud Music/Party        10034 65 VERMILYEA AVENUE 
#>  5 2020-07-25 02:07:27 Loud Music/Party        10003 317 2 AVENUE        
#>  6 2020-07-25 20:34:37 Loud Talking            10009 131 AVENUE B        
#>  7 2020-07-26 00:30:54 Loud Talking            10026 358 WEST  118 STREET
#>  8 2020-07-26 01:36:19 Loud Music/Party        10032 502 WEST  170 STREET
#>  9 2020-07-25 23:21:11 Loud Music/Party        10033 600 WEST  183 STREET
#> 10 2020-07-25 21:12:56 Loud Music/Party        10016 134 EAST   28 STREET
#> # ℹ 68,572 more rows
#> # ℹ 2 more variables: open_data_channel_type <chr>, geometry <POINT [°]>

Here is a quick, simple plot of the data to verify everything is good. First, we select one day from the dataset to simplify the plot. As sf is tidyverse compatible, we can use dplyr::filter to do that. The lubridate package in tidyverse can help us process datetime data. Then, we use ggplot and geom_sf to plot the data. Last, we use a free basemap from Stamen for the plotting. If the dataset is wrong, particularly its coordinates reference system, you will see the locations of those geometries will be way off or completely disappear. We will learn more about the mapping in future sessions.

ggplot(data = manhattan_noise_sf  %>% 
         dplyr::filter(month(datetime) == 4 & year(datetime) > 2019) ) +
  geom_sf(aes(color=descriptor)) +
  coord_sf(xlim = c(-74.06, -73.85), default_crs = sf::st_crs(4326) )

man_day_noise_sf <- manhattan_noise_sf  %>% 
         dplyr::filter(day(datetime) == 19 & month(datetime) == 4 & year(datetime) > 2019); 

ggplot(data = man_day_noise_sf) +
  geom_sf(aes(color=descriptor)) +
  coord_sf(xlim =  c(-74.06, -73.85), crs = sf::st_crs(2831), default_crs = sf::st_crs(4326)) +
  labs(title = "Residential Noise 311 Calls in Manhattan", subtitle = 'April 19, 2019')

Now, we use a base map to plot it again.

The ggmap package functions for basemap retrieval require API keys now. We will not run the code and display the results here. It is more convenient to use mapview, which shows the OpenStreetMap basemap by default.

# We can get a simple basemap. Adjust zoom with the map extent. 
# Higher zoom means more details but needs more time to load.
# Start from small and increase to an appropriate level. Will learn more on this.

# In most cases we can use the extent of the data to retrieve the basemap

manhattan_noise_sf %>% st_bbox() %>% as.vector() %>%
  ggmap::get_stamenmap(zoom = 11, messaging = FALSE) -> baseMap;

# This is the ggplot style plotting. Again, will come back on this for more.
ggmap(baseMap) +
  geom_point(aes(x=X, y=Y), 
             data = man_day_noise_sf %>% st_coordinates() %>% tibble::as_tibble(),
             color = 'brown',
             size = 1,
             alpha = .5
             )
 
# but for the long shape of Manhattan, it might be better to manually choose the basemap
baseMapExt <- c(-74.06, 40.691, -73.85, 40.879)

ggmap::get_stamenmap(baseMapExt, zoom = 11, messaging = FALSE) -> baseMap;

ggmap(baseMap) +
  geom_point(aes(x=X, y=Y), 
             data = man_day_noise_sf %>% st_coordinates() %>% tibble::as_tibble(),
             color = 'brown',
             size = 1,
             alpha = .5
             ) + 
  labs(title = "Residential Noise 311 Calls in Manhattan", subtitle = 'April 19, 2019')

Of course, mapview comes in handy for creating simple interactive maps quickly.

mapview(man_day_noise_sf, zcol='descriptor', layer.name='Manhattan Noise')

1.3.2 sp and sf conversion

While sf is the new standard, many classic packages are still using sp. When needed, we can convert sf objects to corresponding sp objects or vice versa. Using sf::st_as_sf and sf::as_Spatial, sp objects and sf objects can be converted to each other.

# convert sf to sp
class(manhattan_noise_sf)
manhattan_noise_sp_ <- sf::as_Spatial(manhattan_noise_sf)
class(manhattan_noise_sp_)

# Convert sp to sf
manhattan_noise_sf_ <- sf::st_as_sf(manhattan_noise_sp_) 
class(manhattan_noise_sf_)

1.3.3 Save sf objects

We will save this object as a Shapefile on our hard drive for later use. (Note that by default st_write checks if the file already exists, and if so it will not overwrite it. If you need to force it to overwrite use the option delete_layer = TRUE.)

# 
st_write(manhattan_noise_sf, "data/nyc", "ManhattanNoise", driver = "ESRI Shapefile")
# to force the save: 
st_write(manhattan_noise_sf, "data/nyc/ManhattanNoise.shp", delete_layer = TRUE)

# 
st_write(man_day_noise_sf, './data/nyc/manhattan_day_noise.shp')

Shapefile is an outdated file format and has poor support for very large integers and datetime data types, for example. To completely save the data in R format, we can also directly save them as R data files. Of course, these files can only be used by R. Alternatively, we can use database-compatible format like GeoPackage, which is an OGC standard and increasingly popular, particular with QGIS.

# Save data to RData file
save(manhattan_noise_sf, man_day_noise_sf, 
     file = './data/nyc/manhattan_noise_sf.RData')

# To get the data back into R
# load(file='./data/nyc/manhattan_noise_sf.RData')

# Save data to a GeoPackage file/database
st_write(manhattan_noise_sf, 
         dsn = './data/nyc/man_data.gpkg', 
         layer='manhattan_noise',
         delete_layer = TRUE)

st_write(man_day_noise_sf,          
         dsn = './data/nyc/man_data.gpkg', 
         layer='manhattan_noise_one_day',
         delete_layer = TRUE)

# Read in from geopackage
man_one_day_noise <- st_read(dsn = './data/nyc/man_data.gpkg', 
        layer='manhattan_noise_one_day')

1.4 Reprojecting or Projection Transformation

So far, all our data are using geographic coordinates in degrees. Sometimes, we may have to change the coordinates into a new Coordinate Reference System (CRS). For example, we retrieved census tracts data for the entire country, which are normally using geographic coordinates. For a local application at NYC, we need to reproject the data to a CRS that minimizes the distortion locally. With a local map projection, we can more accurately measure distances, areas, and angles.

Functions to transform, or reproject spatial objects typically take the following two arguments:

  • the spatial object to transform
  • a CRS object with the new projection definition

You can transform

  • a sf object with st_transform()
  • a Spatial* object with spTransform()
  • a raster object with projectRaster()

The perhaps trickiest part here is to determine the definition of the projection, which needs to be a character string in proj4 format or a number for SRID (spatial reference ID). You can look it up online. For example for UTM zone 33N (EPSG:32633), its SRID is 32633 and its proj4 string would be:

+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

You can retrieve the CRS:

  • from an sf object with st_crs()
  • from an existing Spatial* object with proj4string()
  • from a raster object with crs()

1.4.1 Transform or reproject sf objects

Let us check nyc_sf_merged object and reproject it to a local projection in State Plane Long Island, that is the map projection the NYC.

st_crs(man_day_noise_sf)
#> Coordinate Reference System:
#>   User input: EPSG:4326 
#>   wkt:
#> GEOGCRS["WGS 84",
#>     ENSEMBLE["World Geodetic System 1984 ensemble",
#>         MEMBER["World Geodetic System 1984 (Transit)"],
#>         MEMBER["World Geodetic System 1984 (G730)"],
#>         MEMBER["World Geodetic System 1984 (G873)"],
#>         MEMBER["World Geodetic System 1984 (G1150)"],
#>         MEMBER["World Geodetic System 1984 (G1674)"],
#>         MEMBER["World Geodetic System 1984 (G1762)"],
#>         MEMBER["World Geodetic System 1984 (G2139)"],
#>         MEMBER["World Geodetic System 1984 (G2296)"],
#>         ELLIPSOID["WGS 84",6378137,298.257223563,
#>             LENGTHUNIT["metre",1]],
#>         ENSEMBLEACCURACY[2.0]],
#>     PRIMEM["Greenwich",0,
#>         ANGLEUNIT["degree",0.0174532925199433]],
#>     CS[ellipsoidal,2],
#>         AXIS["geodetic latitude (Lat)",north,
#>             ORDER[1],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>         AXIS["geodetic longitude (Lon)",east,
#>             ORDER[2],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>     USAGE[
#>         SCOPE["Horizontal component of 3D system."],
#>         AREA["World."],
#>         BBOX[-90,-180,90,180]],
#>     ID["EPSG",4326]]
# then we transform/reproject it to SPCS Long Island, 2831
man_day_noise_2831 <- st_transform(man_day_noise_sf, 2831)

st_crs(man_day_noise_2831)
#> Coordinate Reference System:
#>   User input: EPSG:2831 
#>   wkt:
#> PROJCRS["NAD83(HARN) / New York Long Island",
#>     BASEGEOGCRS["NAD83(HARN)",
#>         DATUM["NAD83 (High Accuracy Reference Network)",
#>             ELLIPSOID["GRS 1980",6378137,298.257222101,
#>                 LENGTHUNIT["metre",1]]],
#>         PRIMEM["Greenwich",0,
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>         ID["EPSG",4152]],
#>     CONVERSION["SPCS83 New York Long Island zone (meter)",
#>         METHOD["Lambert Conic Conformal (2SP)",
#>             ID["EPSG",9802]],
#>         PARAMETER["Latitude of false origin",40.1666666666667,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8821]],
#>         PARAMETER["Longitude of false origin",-74,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8822]],
#>         PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8823]],
#>         PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8824]],
#>         PARAMETER["Easting at false origin",300000,
#>             LENGTHUNIT["metre",1],
#>             ID["EPSG",8826]],
#>         PARAMETER["Northing at false origin",0,
#>             LENGTHUNIT["metre",1],
#>             ID["EPSG",8827]]],
#>     CS[Cartesian,2],
#>         AXIS["easting (X)",east,
#>             ORDER[1],
#>             LENGTHUNIT["metre",1]],
#>         AXIS["northing (Y)",north,
#>             ORDER[2],
#>             LENGTHUNIT["metre",1]],
#>     USAGE[
#>         SCOPE["Engineering survey, topographic mapping."],
#>         AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
#>         BBOX[40.47,-74.26,41.3,-71.8]],
#>     ID["EPSG",2831]]

We see that the CRS are different for the two. One is geographic coordinate systems (longitude, latitude) using WGS84 with a unit of decimal degree. And the other is New York State Plane Long Island with a unit of meter.

We can also see the proj4 strings of the two.

st_crs(man_day_noise_sf)$proj4string
st_crs(man_day_noise_2831)$proj4string

: we have +proj=lcc... and +proj=longlat.... LCC refers to Lambert Conic Conformal, which is a projected coordinate system with numeric units.

We can use the st_bbox() method from the sf package to compare the coordinates before and after reprojection and confirm that we actually have transformed them. st_bbox() returns the min and max values of the two dimensions of a sf spatial object.

sf::st_bbox(man_day_noise_sf) # bounding box
#>      xmin      ymin      xmax      ymax 
#> -74.01208  40.70558 -73.90975  40.87673
st_bbox(man_day_noise_2831)
#>      xmin      ymin      xmax      ymax 
#> 298979.03  59842.73 307607.32  78853.14

We can also visually compare the two map projections. While the differences are little bit hard to see, the locally projected map does stretch more along the horizontal direction.

#par(mfrow=c(1,2))
plot(man_day_noise_sf %>% st_geometry(),pch=19,col='blue', axes=TRUE,xaxs='r', graticule=st_crs(4326))

plot(man_day_noise_2831 %>% st_geometry(),pch=19,col='red', axes=TRUE,xaxs='i', graticule=st_crs(2831))

We will see later that using a local map projection with a unit of meter is necessary for spatial operations like buffer or other distance-based analyses.

Set CRS and Reproject (transform)

  • Assigning a new CRS to a spatial object does not change its coordinates but the CRS determines how the software understand the coordinates. So, we must choose and set a CRS that is consistent with the coordinates. Otherwise, the coordinates will be misinterpreted.
  • Reprojecting spatial data with st_transform and gTransform really changes the underlying coordinates, which we can see from the bbox values.
  • In either case, the CRS must be consistent with the coordinates. It is a good practice to check projected spatial data against a basemap or a correctly referenced map to verify its CRS.
  • SpatialReference.org is a very good source for CRS information.
  • Some commonly used reference systems
    • US 48 Contiguous States: Albers Equal Area (ESRI:102003), Lambert Conformal (ESRI:102004)
    • New York State: UTM 18N (EPSG 3725, 3748, 26918)
    • New York City: State Plane Long Island (EPSG 2263, 2831, 3627)

1.5 Raster data in R

Raster files, as you might know, have a much more compact data structure than vectors. Because of their regular structure the coordinates do not need to be recorded for each pixel or cell in the rectangular extent. A raster is defined by:

  • a CRS
  • coordinates of its origin
  • a distance or cell size in each direction
  • a dimension or numbers of cells in each direction
  • an array of cell values

Given this structure, coordinates for any cell can be computed and don’t need to be stored.

The raster package2 is a major extension of spatial data classes to access large rasters and in particular to process very large files. It includes object classes for RasterLayer, RasterStacks, and RasterBricks, functions for converting among these classes, and operators for computations on the raster data. Conversion from sp type objects into raster type objects is possible.

If we wanted to do create a raster object from scratch we would do the following:

# specify the RasterLayer with the following parameters:
# - minimum x coordinate (left border)
# - minimum y coordinate (bottom border)
# - maximum x coordinate (right border)
# - maximum y coordinate (top border)
# - resolution (cell size) in each dimension
r <- raster(xmn=-0.5, ymn=-0.5, xmx=4.5, ymx=4.5, resolution=c(1,1))
r
#> class      : RasterLayer 
#> dimensions : 5, 5, 25  (nrow, ncol, ncell)
#> resolution : 1, 1  (x, y)
#> extent     : -0.5, 4.5, -0.5, 4.5  (xmin, xmax, ymin, ymax)
#> crs        : +proj=longlat +datum=WGS84 +no_defs

Note that this raster object has a CRS defined! If the crs argument is missing when creating the Raster object, the x coordinates are within -360 and 360 and the y coordinates are within -90 and 90, the WGS84 projection is used by default!

Good to know.

To add some values to the cells we could the following.

class(r)
#> [1] "RasterLayer"
#> attr(,"package")
#> [1] "raster"
r <- setValues(r, runif(25))
class(r)
#> [1] "RasterLayer"
#> attr(,"package")
#> [1] "raster"
plot(r); points(coordinates(r), pch=3)

(See the rasterVis package for more advanced plotting of Raster* objects.)

RasterLayer objects can also be created from a matrix.

class(volcano)
#> [1] "matrix" "array"
volcano.r <- raster(volcano)
class(volcano.r)
#> [1] "RasterLayer"
#> attr(,"package")
#> [1] "raster"

And to read in a raster file we can use the raster() function. This raster is generated as part of the NEON Harvard Forest field site.

library(raster)
HARV <- raster("data/HARV_RGB_Ortho.tif")

Typing the name of the object will give us what’s in there:

HARV
#> class      : RasterLayer 
#> band       : 1  (of  3  bands)
#> dimensions : 2317, 3073, 7120141  (nrow, ncol, ncell)
#> resolution : 0.25, 0.25  (x, y)
#> extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
#> source     : HARV_RGB_Ortho.tif 
#> names      : HARV_RGB_Ortho_1 
#> values     : 0, 255  (min, max)

We can plot it like this:

plot(HARV)

We can find out about the Coordinate Reference System with this:

crs(HARV)
#> Coordinate Reference System:
#> Deprecated Proj.4 representation:
#>  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
#> WKT2 2019 representation:
#> PROJCRS["unknown",
#>     BASEGEOGCRS["unknown",
#>         DATUM["World Geodetic System 1984",
#>             ELLIPSOID["WGS 84",6378137,298.257223563,
#>                 LENGTHUNIT["metre",1]],
#>             ID["EPSG",6326]],
#>         PRIMEM["Greenwich",0,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8901]]],
#>     CONVERSION["UTM zone 18N",
#>         METHOD["Transverse Mercator",
#>             ID["EPSG",9807]],
#>         PARAMETER["Latitude of natural origin",0,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8801]],
#>         PARAMETER["Longitude of natural origin",-75,
#>             ANGLEUNIT["degree",0.0174532925199433],
#>             ID["EPSG",8802]],
#>         PARAMETER["Scale factor at natural origin",0.9996,
#>             SCALEUNIT["unity",1],
#>             ID["EPSG",8805]],
#>         PARAMETER["False easting",500000,
#>             LENGTHUNIT["metre",1],
#>             ID["EPSG",8806]],
#>         PARAMETER["False northing",0,
#>             LENGTHUNIT["metre",1],
#>             ID["EPSG",8807]],
#>         ID["EPSG",16018]],
#>     CS[Cartesian,2],
#>         AXIS["(E)",east,
#>             ORDER[1],
#>             LENGTHUNIT["metre",1,
#>                 ID["EPSG",9001]]],
#>         AXIS["(N)",north,
#>             ORDER[2],
#>             LENGTHUNIT["metre",1,
#>                 ID["EPSG",9001]]]]

See what you can do with such an object:

methods(class=class(HARV))
#>   [1] !                     !=                    $                    
#>   [4] $<-                   %in%                  [                    
#>   [7] [[                    [[<-                  [<-                  
#>  [10] ==                    addLayer              adjacent             
#>  [13] aggregate             all.equal             area                 
#>  [16] Arith                 as.array              as.character         
#>  [19] as.data.frame         as.factor             as.integer           
#>  [22] as.list               as.logical            as.matrix            
#>  [25] as.raster             as.vector             asFactor             
#>  [28] atan2                 bandnr                barplot              
#>  [31] bbox                  blockSize             boundaries           
#>  [34] boxplot               brick                 buffer               
#>  [37] calc                  cellFromRowCol        cellFromRowColCombine
#>  [40] cellFromXY            cellStats             clamp                
#>  [43] click                 clump                 coerce               
#>  [46] colFromCell           colFromX              colSums              
#>  [49] Compare               contour               coordinates          
#>  [52] corLocal              couldBeLonLat         cover                
#>  [55] crop                  crosstab              crs<-                
#>  [58] cut                   cv                    density              
#>  [61] dim                   dim<-                 direction            
#>  [64] disaggregate          distance              extend               
#>  [67] extent                extract               flip                 
#>  [70] focal                 freq                  getValues            
#>  [73] getValuesBlock        getValuesFocal        gridDistance         
#>  [76] hasValues             head                  hist                 
#>  [79] image                 init                  initialize           
#>  [82] inMemory              interpolate           intersect            
#>  [85] is.factor             is.finite             is.infinite          
#>  [88] is.na                 is.nan                isLonLat             
#>  [91] KML                   labels                layerize             
#>  [94] length                levels                levels<-             
#>  [97] lines                 localFun              log                  
#> [100] Logic                 mapView               mask                 
#> [103] match                 Math                  Math2                
#> [106] maxValue              mean                  merge                
#> [109] metadata              minValue              modal                
#> [112] mosaic                names                 names<-              
#> [115] ncell                 ncol                  ncol<-               
#> [118] nlayers               nrow                  nrow<-               
#> [121] origin                origin<-              overlay              
#> [124] persp                 plot                  predict              
#> [127] print                 proj4string           proj4string<-        
#> [130] quantile              raster                rasterize            
#> [133] ratify                readAll               readStart            
#> [136] readStop              reclassify            rectify              
#> [139] res                   res<-                 resample             
#> [142] RGB                   rotate                rowColFromCell       
#> [145] rowFromCell           rowFromY              rowSums              
#> [148] sampleRandom          sampleRegular         sampleStratified     
#> [151] scale                 select                setMinMax            
#> [154] setValues             shift                 show                 
#> [157] spplot                stack                 stackSelect          
#> [160] stretch               subs                  subset               
#> [163] Summary               summary               t                    
#> [166] tail                  terrain               text                 
#> [169] trim                  unique                update               
#> [172] values                values<-              Which                
#> [175] which.max             which.min             wkt                  
#> [178] writeRaster           writeStart            writeStop            
#> [181] writeValues           xFromCell             xFromCol             
#> [184] xmax                  xmax<-                xmin                 
#> [187] xmin<-                xres                  xyFromCell           
#> [190] yFromCell             yFromRow              ymax                 
#> [193] ymax<-                ymin                  ymin<-               
#> [196] yres                  zonal                 zoom                 
#> see '?methods' for accessing help and source code

We can explore the distribution of values contained within our raster using the hist() function which produces a histogram. Histograms are often useful in identifying outliers and bad data values in our raster data.

hist(HARV)
#> Warning in .hist1(x, maxpixels = maxpixels, main = main, plot = plot, ...): 1%
#> of the raster cells were used. 100000 values used.

Notice that a warning message is produced when R creates the histogram.

This warning is caused by the default maximum pixels value of 100,000 associated with the hist function. This maximum value is to ensure processing efficiency as our data become larger! We can force the hist function to use all cell values.

ncell(HARV)
#> [1] 7120141
hist(HARV, maxpixels = ncell(HARV))

At times it may be useful to explore raster metadata before loading them into R. This can be done with:

GDALinfo("path-to-raster-here") 

A raster dataset can contain one or more bands. We can view the number of bands in a raster using the nlayers() function.

nlayers(HARV)
#> [1] 1

We can use the raster() function to import one single band from a single OR from a multi-band raster. For multi-band raster, we can specify which band we want to read in.

HARV_Band2 <- raster("data/HARV_RGB_Ortho.tif", band = 2)
plot(HARV_Band2)

To bring in all bands of a multi-band raster, we use the stack() function.

HARV_stack <- stack("data/HARV_RGB_Ortho.tif")

# how many layers?
nlayers(HARV_stack)
#> [1] 3
# view attributes of stack object
HARV_stack
#> class      : RasterStack 
#> dimensions : 2317, 3073, 7120141, 3  (nrow, ncol, ncell, nlayers)
#> resolution : 0.25, 0.25  (x, y)
#> extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
#> names      : HARV_RGB_Ortho_1, HARV_RGB_Ortho_2, HARV_RGB_Ortho_3 
#> min values :                0,                0,                0 
#> max values :              255,              255,              255

What happens when we plot?

plot(HARV_stack)

If we know that it is an RGB multiband raster we can plot them all in one

plotRGB(HARV_stack)

1.5.1 RasterStack vs RasterBrick

The R RasterStack and RasterBrick object types can both store multiple bands. However, how they store each band is different. The bands in a RasterStack are stored as links to raster data that is located somewhere on our computer. A RasterBrick contains all of the objects stored within the actual R object. Since in the RasterBrick, all of the bands are stored within the actual object its object size is much larger than the RasterStack object.

In most cases, we can work with a RasterBrick in the same way we might work with a RasterStack. However, a RasterBrick is often more efficient and faster to process - which is important when working with larger files.

We can turn a RasterStack into a RasterBrick in R by using brick(StackName). Use the object.size() function to compare stack and brick R objects.

object.size(HARV_stack)
#> 51096 bytes
HARV_brick <- brick(HARV_stack)
object.size(HARV_brick)
#> 170898520 bytes

Going back to the sp package, a simple grid can be built like this:

# specify the grid topology with the following parameters:
# - the smallest coordinates for each dimension, here: 0,0
# - cell size in each dimension, here: 1,1 
# - number of cells in each dimension, here: 5,5
gtopo <- GridTopology(c(0,0), c(1,1), c(5,5)) # create the grid
datafr <- data.frame(runif(25)) # make up some data
SpGdf <- SpatialGridDataFrame(gtopo, datafr) # create the grid data frame
summary(SpGdf)
#> Object of class SpatialGridDataFrame
#> Coordinates:
#>       min max
#> [1,] -0.5 4.5
#> [2,] -0.5 4.5
#> Is projected: NA 
#> proj4string : [NA]
#> Grid attributes:
#>   cellcentre.offset cellsize cells.dim
#> 1                 0        1         5
#> 2                 0        1         5
#> Data attributes:
#>    runif.25.     
#>  Min.   :0.0982  
#>  1st Qu.:0.4461  
#>  Median :0.6745  
#>  Mean   :0.6100  
#>  3rd Qu.:0.7878  
#>  Max.   :0.9915

1.6 Lab Assignment

For the R-Spatial Section labs, we will do some spatial data visualization on COVID-19 in NYC. More specifically, we will explore the distribution of confirmed cases across the city and their relationships with some demographic variables and essential services related to retail food stores and public health services.

The first lab is rather simple and straightforward but will be the foundation for the next steps.

Tasks for the first lab are:

  1. Set up a R project for the R-Spatial section.
  2. Read the NYC postal areas in Shapefiles into sf objects. As NYC DOH publishes COVID-19 data by zip code, we will utilize the postal area data later.
  3. Read and process the NYS health facilities spreadsheet data. Create sf objects from geographic coordinates.
  4. Read and process the NYS retail food stores data. Create sf objects from geographic coordinates for NYC.
  5. Use simple mapping method such as mapview with a basemap to verify the above datasets in terms of their geometry locations.
  6. Save the three sf objects in a RData file or in a single GeoPackage file/database.

The assignment and data are available on Blackboard. The data are also available for download at the Dropbox site.


  1. E. Pebesma & R. Bivand (2016)Spatial data in R: simple features and future perspectives↩︎

  2. Note that sp also allows to work with raster structures. The GridTopology class is the key element of raster representations. It contains: (a) the center coordinate pair of the south-west raster cell, (b) the two cell sizes in the metric of the coordinates, giving the step to successive centers, and (c) the numbers of cells for each dimension. There is also a SpatialPixels object which stores grid topology and coordinates of the actual points.↩︎