Lance A. Waller
University of Minnesota
[Editor's Note: The Epidemiology Section is sponsoring an invited session at the 1996 JSM entitled "GIS: New approaches to disease surveillance and analysis" on Tuesday Auguest 6 at 10:30.]
Epidemiology Monitor recently listed the "emerging use of geographic information systems in epidemiologic studies" as one of the top ten most notable developments in epidemiology in 1995 [1]. The purpose of this article is to briefly outline geographic information systems and describe how they impact epidemiologic analysis. Terms common to GIS users but possibly unfamiliar to statisticians and epidemiologists are placed in italics.
A geographic information system (GIS) is a collection of software routines for managing and displaying spatially referenced data. "Spatially referenced data" refers to data containing measured or observed values, termed attributes, associated with specific locations. Examples include annual rainfall at a measuring station, monitored air pollutant levels, or population size in a small region. GIS data management capabilities center around a relational database enabling the linking of multiple attribute values by common locations. For instance, the number of persons residing in a given region can be linked with the number of incident cases of a particular disease reported for the same region. These values can be linked with detailed digitized boundary files enabling the GIS display routines to produce high-quality maps. The recent rise in availability and power of GIS's raises the possibility of their use as tools in epidemiologic research.
A GIS treats data in one of two ways. A raster based GIS digitizes information into a grid of very small pixels, much the same way images and photographs are digitized. Attribute values for each pixel are stored in a large array and are common in the handling of remote sensing data, including satellite images. On the other hand, a vector based GIS stores attributes associated with points, lines and areas. Point data include coordinates of location and an attribute value. Line data link segments connecting points with an attribute value and represent features such as streets, rivers, and political boundaries. Area data include polygonal regions and associated attribute values, e.g. state, county or census tract population sizes. Vector data may be more economically stored than raster data. Many of the more popular GIS packages (e.g. ARC/INFO and MapInfo, see below) are vector based, although raster data can be incorporated. Here we limit discussion to vector based GIS's.
The internal relational database for a GIS enables one to link attribute values, possibly from different data sources, by associating values from a common point, area or line. The GIS effectively takes many maps of attribute values over the same study area and layers them electronically.
In addition, a GIS can use location data to provide summary measures of proximity. One common feature is the ability to create a new area, or buffer, around an existing point, line or area. This tool quickly defines the number of a certain type of points (e.g. residences of cases) that are found within the new area.
Consider the following environmental epidemiology example similar to the study by Stallones et al. [2]. Suppose we are interested in the incidence of congenital malformations with respect to proximity of residence to hazardous waste sites. Many states maintain registries of congenital malformations which may store events as point locations through geocoding (assigning location values based on street addresses). Often, confidentiality requirements limit data to counts from census districts.
Such data defines the first data layer for the study area. The decennial census provides a second data layer summarizing demographic information on the population at risk in census districts. Linking the census layer to the outcome layer allows indirect rate standardization for regions. Finally, the locations of hazardous waste sites appear in the National Priority List or are otherwise available from organizations such as the U.S. Environmental Protection Agency. This example includes three distinct data layers, each collected and maintained by a different organization.
For epidemiologic applications, GIS offers the promise of linking outcome exposure and confounding data by location. In addition, existing publicly available data may be included easily. For instance, registry data on incident outcomes may be linked with census data providing a mechanism for indirect age-adjustment of rates. Similarly, monitored exposure values may be linked with meteorologic data to model disperson of a suspected environmental risk factor.
The ability of GIS to quickly link existing data on possible confounding factors suggests that GIS can act as a helpful tool for proactive public health surveillance [3]. GIS offers the promise of linking outcome, exposure, and confounding data by location. Often data regarding confounders would be difficult and expensive to collect, but may be available through the census. A GIS serves as a tool to efficiently bring together many pieces of information relevant in monitoring for trends in incidence.
At the same time, one must carefully explore the weaknesses of GIS applications in epidemiologic research. Many such issues are outlined in Marbury [4] and we briefly review them here. First, most GIS-based epidemiologic studies will be ecologic in nature. The potential biases associated with attempting to infer individual-level effects from group-level data are well-known as the "ecologic fallacy". Recent statistical research [5] explores the limits of ecologic studies, and the analyst must understand the assumptions implicit in such analyses. A related geographic issue is the
modifiable areal unit problem
where analyses on regional data yield differing results, depending how the regions are defined. Monmonier [6, p.142] illustrates this issue using John Snow's cholera data. The spatial scale or level of aggregation will also impact the analysis of regional data.Second, using GIS to link past exposures to chronic outcomes with potentially long latent periods goes beyond simple spatial proximity. Such studies require detailed past exposure and location data to account for mobility and other time-varying effects. Such data may not be directly available in census data.
Another complicating issue is the availability of reliable exposure data. It is tempting to use the proximity measures given by a GIS as a surrogate for exposure, but analysts must realize that this is, at best, only an approximation of ambient exposure values. The use of monitored values improves the accuracy of ambient exposure models. However, the relationship between ambient exposure and personal exposure is multifaceted and complex. The largest predictors of benzene exposure, for example, involve smoking history and the amount of time spent in a car [7]. Ambient benzene measures are a far less potent predictor than either of these non-spatial behavioral issues.
Finally, the issue of data quality must be addressed. A GIS provides a convenient means to link spatially-referenced data from different sources. However, the accuracy of inference drawn from such data depends critically on the accuracy of each component. One can achieve a false sense of accuracy when a GIS layers a boundary map with minute detail and regional disease rates based on a very limited sample. The viewer of the map sees the detail in a river boundary and can easily suppose that a similar high level of accuracy is associated with the estimated disease rates defining the shades in a choropleth map. Currently, GIS's do not provide maps of uncertainties associated with attribute values, although MacEachren [8] outlines some proposed methods. In short, as Waller [9] points out, a good map of bad data appears more "accurate" than a bad map of good data (see Rejeski [10] for a more detailed discussion). Frisch, Shaw and Harris [11] and Twigg [12] survey the accuracy of currently available heath data for GIS's and find some areas for concern.
Often the above limitations are addressed by suggesting that GIS-based studies are merely hypothesis-generating, and intended to provide guidance on allocating limited resources for full-scale epidemiologic studies. Researchers must be careful to present the results of such studies in the proper context. As pointed out by Marbury [4, p.89]: "Labeling a study as hypothesis-generating does not give investigators license to use poor quality data and poor research designs."
The GIS ARC/INFO from Environmental Systems Research Institute (ESRI) plays a similar role in the GIS world that SAS plays in the world of statistical packages. That is, ARC/INFO is a comprehensive (large) system with extensive documentation and a rather steep learning curve. ARC/INFO programmers are often hired by those wishing to make the most of the system. The file format for ARC/INFO data is also becoming a standard in that few new GIS's will succeed unless they can access ARC/INFO formatted data (similar in spirit to the role the SAS data set plays in statistical computing).
While ARC/INFO is comprehensive in its data manipulation and display routines, it offers little in the way of statistical analysis beyond proximity calculations and simple summary statistics. StatSci has recently released S-Plus for ARC/INFO to address this issue [13]. S-Plus for ARC/INFO provides a data "bridge" between the statistics package S-Plus and ARC/INFO whereby S-Plus data frames are recognized by ARC/INFO and ARC/INFO data are decoded by S-Plus. Such bridges link established GIS packages with established statistical routines without either package attempting to emulate the strengths of the other to compensate for its own weaknesses.
A slightly different approach is taken by SAS/GIS currently under development by the SAS Institute. Here vector GIS capabilities are being built into an add-on to SAS. Attribute data will be stored in SAS data sets, but GIS-type graphical and logical selection will be possible. In addition, GIS-type displays will map the data. SAS/GIS may provide the first integrated GIS and statistical analysis package.
ESRI also offers ArcView, a menu-driven, pared-down version of ARC/INFO. While ArcView will not do all that ARC/INFO can, ArcView still does many, many things. ArcView will read, layer, modify and display ARC/INFO data. ArcView 1.0 is available for free from ESRI's World Wide Web pages (http://www.esri.com). Although version (3.0) of ArcView has many more features, ArcView 1.0 can accomplish a lot.
MapInfo (http://www.mapinfo.com) is a menu-driven GIS similar to ArcView. In fact, MapInfo and ArcView are close competitors addressing a similar niche among GIS users. Most of my personal GIS experience is with MapInfo. I find MapInfo fairly straightforward to use with enough flexibility for most applications and excellent output. As with most GIS's, the statistical analysis routines are limited to simple summary statistics and any modeling or spatial statistical analyses must be done on data transferred from the GIS.
Another GIS worth mention is the Geographic Resource Analysis Support System (GRASS), a public domain "open GIS" developed by the U.S. Army. GRASS has a wide user base and updates include code contributed by users. The Army will no longer continue development of GRASS, and its future maintenance and support are somewhat in question. The Unix version of GRASS 4.1 is freely available through the Internet ( http://www.cecer.army.mil/grass/GRASS.main.html). A commercial version, GRASSLAND, is available for Windows 95 and Windows NT.
The above is a very limited summary of available GIS's. For a more complete list (with summary reviews) see Oliver Weatherbee's GIS listing at http://triton.cms.udel.edu/~oliver/gis_gip/gis_gip_list.html.
GIS implementations develop rapidly and many applications appear in the proceedings of national and international GIS meetings. The University of Maine maintains electronic access to 1995 proceedings from many of these conferences through the World Wide Web site "Spatial Odyssey" (http://www.odyssey. maine.edu/gisweb/). There are an increasing number of health care and epidemiologic applications appearing in the proceedings.
In addition to the above information, the Centers for Disease Control and Prevention and the Agency for Toxic Substances and Disease Registry (CDC/ATSDR) maintain a GIS users group. The primary purpose of the group is to promote GIS and public health science among CDC/ATSDR staff, as well as public health professionals in state and local governments, and academe. CDC/ATSDR GIS users receive the free bimonthly electronic newsletter, GIS News and Information, which contains information on GIS projects, user correspondence, technical assistance and consultation, GIS public health training opportunities, and other GIS related topics. To subscribe, please notify the editor, Charles Croner, PhD, Geographer and Survey Statistician, National Center for Health Statistics at CMC2@NCH09A.EM. CDC.GOV.
Maps of disease and exposure have been part of epidemiology at least as long as John Snow's famous map of cholera and water pumps from 1854 London. Modern software packages bring vast amounts of data together and allow relatively quick construction of high quality maps. GIS software provides a powerful tool for linking and visualizing data relevant to public health surveillance. However, appropriate methods for the statistical analysis are lacking in most GIS's. This gap is being addressed through the release of software linking a GIS to a current statistical package (S-Plus for ARC/INFO), or by building GIS capabilities into statistical packages (SAS/GIS).
Much work remains, much of it interdisciplinary, to define appropriate methods for the analysis of GIS data. Statistical work is needed on what to map and how best to interpret models from data with multiple layers of uncertainty. Researchers need new developments in statistical and geographical graphics on how to accurately reflect variability associated with mapped values.
The concerns outlined above illustrate that GIS-based investigations will not replace the well-designed epidemiologic study. Epidemiologic researchers and medical geographers need a healthy respect for limitations of GIS in order to make the best use of this new tool. As statisticians, we should work with medical geographers, epidemiologists, and GIS experts to develop accurate and appropriate methodologies for applying GIS technology to public health issues.