Category Archives: GIS

Earthquakes – Reverse Geocoding Coming Soon

One of the most vexing sets of data to make usable for a data analyst is the earthquake dataset available via the NCEDC search site.  While the site returns results quickly enough to an anonymous FTP site, they do not contain any columns representing the country, state, county, or city.  These columns are some of the most useful for analysis of questions such as: “Oklahoma now rivals and even exceeds California for the number of significant earthquakes?”

Believe it or not, the answer to the preceding question is “True,” especially when one can analyze reverse-geocode earthquakes using relatively simple SQL queries.

The difficulty was in the reverse-geocoding of the latitudes and longitudes to their respective countries, states, counties, and cities.  Originally, I had authored a Java program that used various ESRI shape files and discerned to which administrative units a lat/long belonged.  That is, if you wanted to wait 12 hours for it to run.

Given the long run time and inconvenience of obtaining the shape files, I declined to publish it except as source code with little, if any, explanation as to its operation and use.  I just didn’t think it was suitable for public consumption yet, as the reverse-geocoding was only tediously repeatable.  I knew there was a better way, and as of a few weeks ago, after some research, I authored a better mousetrap:

  • a Python 3.4 script
  • using the reverse-geocoder package
  • which uses K-D trees
  • and datasets from GeoNames
  • reverse-geocoding 2.8 million rows in approximately 210 seconds

So, in the next few weeks, the Python script will be pushed to GitHub and the reverse-geocoded earthquake dataset to http://frackingdata.info/downloads.  A posting or postings will be pushed when this is done.

How cool, from 12 hours to 210 seconds.

Finally, some progress…

Khepry Quixote
12 April 2016

Bill of Rights for Fracking Information

  1. That all of the data and its documentation:
    1. Should be
      1. in a machine-readable form
      2. Suitable for aggregation
      3. And downloadable in a compact form (e.g. ZIP, 7z).
    2. Should be suitable for its nominal purposes of research and reporting by
        1. Reporters
        2. Data Analysts
        3. Citizen Scientists
        4. Regulators
    3. Should be released
      1. In a frequent and timely manner.
      2. With “delta” datasets available, with “delta” being differences between the current and previous releases.
        1. The “delta” datasets should contain the following machine-readable “images”:
          1. “Previous” image.
          2. “Current” image.
          3. “Changed” image with only the values that are different being reported.
    4. Should NOT reside:
      1. Behind a pay-wall.
      2. Behind a registration-wall.
    5. Should be accessible:
      1. Interactively.
      2. ReST-fully via an API.
    6. Should be curated in a manner consistent with:
      1. The norms of professional, responsible data-warehousing.
        1. For example, the elimination of extraneous TAB, LINEFEED, or DIACRITIC characters that should NOT appear within a column.
        2. The resolution of disparate geographical projections (e.g. NAD27, NAD83) into a unified geographical projection (WGS84) suitable for mapping via geographic information systems or platforms such as Google Maps (WGS84).
      2. The needs of others to reliably export the data to alternative formats (e.g. CSV, XML, JSON).