'Scrubbing Source Data': The EPA Response

by EPA Office of Information

Thank you for the opportunity to comment on the article being prepared for publication in the DomPrep Journal by Mr. Jacoby. That article, “Scrubbing Source Data at the Local Level,” raises a number of crucial points about the data quality in EPA databases and the vital need for high-quality data for emergency responses and other issues affecting human health and safety. In emergency response situations, minutes count and, when responders are dispatched to incorrect locations with insufficient information about potential environmental hazards, it increases both response time and the potential for loss of life and property.

EPA’s Facility Registry Service (FRS) provides a comprehensive database of locations of interest for environmental issues, including some facilities that may pose a risk to life and/or property in certain disaster situations. The FRS is not, however, a primary data collection system. It is, instead, a data aggregator and, as such, integrates the data received from a variety of sources, including information reported by industry, as well as information reported and/or collected by state and federal governments. FRS provides a master record for a place of interest, under which are attached the individual source records – which contain the data reported from other sources. The source records are typically from a system of record and for legal purposes must remain unchanged – data contained in these source records is aggregated upward to compile a record that is then stored in the FRS file, which attempts to draw from the best available information contained within the source records.

The service also attempts to improve data quality of the master records in FRS through algorithmic validation and processing – for example, by checking on valid street address/city/state/ZIP code combinations; by comparing latitude/longitude values to given locations; and in various other ways. The validity of information is, however, not always the same as data accuracy. For example, the address that may have been provided might be valid, but may be accurate for the corporate office, as opposed to the actual facility location. Additionally, there are many other data challenges, such as incomplete addresses or P.O. Box locations, which by themselves cannot be used to derive a latitude/longitude value.

The FRS team also performs some data curation whereby incomplete, invalid, and/or unresolvable or ambiguous locations are researched and the master record for such data is corrected. However, in many instances the FRS stewards do not possess the adequate local knowledge needed to make fully and properly informed decisions about certain locations. Additionally, the sheer volume of records possessed by the FRS provides a significant stewardship challenge in and of itself.

In terms of technical approaches, a moreeal data stewardship paradigm would shift data validation and correction closer to its source – for instance, by providing instant feedback if invalid, incomplete, or ambiguous data is entered and/or, for example, by providing an aerial photograph for visual confirmation of the geographic location entered. This process would increase the likelihood of corrections being made by those reporting or entering data.

EPA is, in fact, beginning to pursue this process with some data collection systems within its purview. In addition, it is recognized that the greater engagement of local officials such as emergency response personnel, and others who are more intimately familiar with their own communities, could also improve data quality. As Mr. Jacoby notes, many corrections have in fact been provided by GIS staff in local governments in south central Pennsylvania. Ultimately, though, it also may become necessary to consider stronger mandates and to more actively promote “best practices” and more useful guidelines for the collection of high-quality locational data, as part of the basic facility lifecycle.

In closing, we broadly agree with Mr. Jacoby’s assessment of the data-quality issues and the points that he raises. EPA’s own FRS team: (a) is currently working with several EPA program offices to expand its front-end facility data “lookup and validation” processes as data is collected; (b) is working with state agencies to improving facility data flows; and (c) has recently established an FRS workgroup with monthly teleconferences, broadcast through the Exchange Network – these have typically been attended by 20 to 30 participants from state agencies as well as EPA program offices and regions.

In these ways, and others, the FRS team is seeking both to expand its current network of stewards and to enhance overall capabilities for facility data reconciliation and stewardship. These efforts are expected to improve tools forentifying invalid, duplicative, or incomplete data, facilitate the prioritization of data correction efforts, and help in various other ways to close the loop in terms of reconciliation of data in FRS vs. source systems. We are also evaluating ways to provide FRS information back to source systems, such as facilities researched and updated by stewards, orentified as incomplete, invalid, or indicating other possible problems that can be corrected in the source systems as well.