Category Archives: data

MS Access: Database posted to downloads site

Microsoft Access, while not SQL-92 compliant, is a very popular database program suitable for analytical use by many people that don’t use R, SAS, or Tableau for analysis and reporting purposes.

Concerning FracFocus.org-related data, and back again by popular demand, FrackingData.org is now providing (see link below) a Microsoft Access database in “accdb” format containing various tables as follows:

  • FracFocus.org-related tables
    • dbo_RegistryUpload
    • dbo_RegistryUploadPurpose
    • dbo_RegistryUploadIngredients
  • Earthquakes-related tables
    • NCEDC_earthquakes_reverse_geocoded (worldwide, 1898 to date, magnitude 0 and up)
  • Toxicities-related tables
    • Chemical_Toxicities_Blended_Sorted
    • Chemical_Toxicities_Blended_Grouped
    • Chemical_Toxicities_Blended_Flattened_Boolean
  • Views utilizing the above tables
    • vue_Registry_Upload_Purpose_Ingredients
    • vue_Registry_Upload_Purpose_Ingredients_Toxicities
  • Link(s) to Microsoft Access database(s), compressed with the 7-Zip program:

Henceforth, this database will be available on the same schedule as the CSV, SQLite, and PostgreSQL files and a page holding the latest link can be found on the FracFocus Data page of FrackingData.org’s site (link below):

Khepry Quixote
10 June 2016

Earthquakes: Reverse-geocoder published on GitHub

Making good on my previous promise, I have released the source code for the NCEDC-formatted earthquake CSV file reverse-geocoder, written in Python 3, on GitHub as both as “Gist” and as an Eclipse-PyDev project .

Each of the above links has a README file with instructions on its use, arguments, and dependencies.

I dedicate this project and Gist to those about to endure the dubious “benefits” of fracking operations in the United Kingdom.

Khepry Quixote
7 June 2016

Status Update 2016-06-01: FrackingData_FracFocusRegistry 2016-05 Files Uploaded

As of 01 June 2016, various files (e.g. SQlite, CSV, and PgSQL) derived from FracFocus.org’s April 2016 FracFocusRegistry have been downloaded, extracted, transformed, loaded, archived, and uploaded to the frackingdata.info/downloads site and their respective links also posted to FrackingData’s FracFocus Data Page .

This time, FracFocus posted their SQL Server backup on 23 May 2016, almost a month later than its previous posting of 26 April 2016.

Once again, of significance this time was that the download of the files from the FracFocus.org website and their subsequent extract, transform, load, archiving, and exporting to CSV, SQLite, and PostgreSQL files was performed by a Windows batch script without human intervention. This automated method shaved hours from the extract, transform, load, archive, and export process.  In addition, the batch script now uses WinSCP to automatically upload the files in question to the http://frackingdata.info/downloads page.

When this Windows batch file is sufficiently stable, and I’ve soft-coded the data-cleansing views into the script itself,  I’ll post a link to it in the Source Code section of this blog.  Soft-coding of the data-cleansing views is the last hurdle to publishing this script.

Khepry Quixote 2016-06-01

Earthquakes: Reverse-Geocoded Files Posted to frackingdata.info/downloads

As I promised earlier, I’ve downloaded earthquakes from NCEDC’s web site (1898 to date), reverse-geocoded them via GeoNames and K-D Trees (thereby obtaining their country, state, county, and city/village values), archived the resulting files via 7-ZIP and uploaded both the CSV and SQLite datasets to:

I have authored a program in Python 3 that reverse-geocodes (via GeoNames and K-D Trees) the lat/longs into their respective countries, states, counties, and cities/villages.  I will post a link to the open-source project shortly once I’ve vetted its license and repository.  The program processes nearly 3 million rows in approximately 240 seconds.

Status Update 2016-05-02: FrackingData_FracFocusRegistry 2016-04 Files Uploaded

As of 02 May 2016, various files (e.g. SQlite, CSV, and PgSQL) derived from FracFocus.org’s March 2016 FracFocusRegistry have been downloaded, extracted, transformed, loaded, archived, and uploaded to the frackingdata.info/downloads site and their respective links also posted to FrackingData’s FracFocus Data Page .

The substantial delay between the last posting of the transformed FracFocusRegistry download in early March and this one in May was mostly due to FracFocus NOT posting anything until 26 April 2016.  This tardiness on FracFocus’s part is becoming a pattern.

Once again, of significance this time was that the download of the files from the FracFocus.org website and their subsequent extract, transform, load, archiving, and exporting to CSV, SQLite, and PostgreSQL files was performed by a Windows batch script without human intervention. This automated method shaved hours from the extract, transform, load, archive, and export process.  In addition, the batch script now uses WinSCP to automatically upload the files in question to the http://frackingdata.info/downloads page.

When this Windows batch file is sufficiently stable, and I’ve soft-coded the data-cleansing views into the script itself,  I’ll post a link to it in the Source Code section of this blog.  Soft-coding of the data-cleansing views is the last hurdle to publishing this script.

Khepry Quixote 2016-05-02

Earthquakes – Reverse Geocoding Coming Soon

One of the most vexing sets of data to make usable for a data analyst is the earthquake dataset available via the NCEDC search site.  While the site returns results quickly enough to an anonymous FTP site, they do not contain any columns representing the country, state, county, or city.  These columns are some of the most useful for analysis of questions such as: “Oklahoma now rivals and even exceeds California for the number of significant earthquakes?”

Believe it or not, the answer to the preceding question is “True,” especially when one can analyze reverse-geocode earthquakes using relatively simple SQL queries.

The difficulty was in the reverse-geocoding of the latitudes and longitudes to their respective countries, states, counties, and cities.  Originally, I had authored a Java program that used various ESRI shape files and discerned to which administrative units a lat/long belonged.  That is, if you wanted to wait 12 hours for it to run.

Given the long run time and inconvenience of obtaining the shape files, I declined to publish it except as source code with little, if any, explanation as to its operation and use.  I just didn’t think it was suitable for public consumption yet, as the reverse-geocoding was only tediously repeatable.  I knew there was a better way, and as of a few weeks ago, after some research, I authored a better mousetrap:

  • a Python 3.4 script
  • using the reverse-geocoder package
  • which uses K-D trees
  • and datasets from GeoNames
  • reverse-geocoding 2.8 million rows in approximately 210 seconds

So, in the next few weeks, the Python script will be pushed to GitHub and the reverse-geocoded earthquake dataset to http://frackingdata.info/downloads.  A posting or postings will be pushed when this is done.

How cool, from 12 hours to 210 seconds.

Finally, some progress…

Khepry Quixote
12 April 2016

Breaking the “Fracking Wall”

This post describes why I’ve resolved to break the “fracking wall” surrounding the data sources of oil well locations, fracking chemical disclosures, and earthquake sources.

BACKGROUND

I am a software/database/systems developer/designer/analyst with over thirty (30) of IT experience in a variety of domains: petrochemical plant applications, tax appraisal, county-level governmental agencies, law enforcement applications, point-of-sale systems, data warehousing and analysis, insurance, near-realtime aircraft/vessel dispatch and tracking, mapping applications, search engines, desktop and web applications, health care extraction, transformation, loading (ETL) and analysis. In short, there’s not a lot I haven’t done over my career.

SELF-EDUCATION

One of my continuing challenges is to self-educate on emerging languages, databases, and software on a frequent basis. This I do as a “night job” a few nights each week, every week, every month of every year. Having enjoyed applications involving mapping the most, in the Spring of 2012 I decided on a course of self-education with variety of mapping packages, but covering a single domain with free information: earthquakes. I choose this domain for no other reason than the data was freely available and of modest size, sources of publicly-available data being just a few million records.

EARTHQUAKES

And so I merrily went about my self-education on various mapping packages using the free source of earthquake data, enjoying positive results and pretty graphics along the way. Then, quite to my surprise, “swarms” of earthquakes began to materialize on the various maps I was creating. Interestingly enough, some of those swarms were in Oklahoma, a state of the union in which I had the privilege of living in from 1979 through 1982. What struck me as interesting was that I didn’t recall that many, actually relatively few, earthquakes during those three years I lived in that state. Needless to say, my interest was piqued.

SWARMS EMERGE

So, I began to plot out the earthquakes for Oklahoma on a wider scale, on a year-by-year basis, and I could discern that there were “swarms” of earthquakes materializing in places where there had been very few in the preceding decades. As I have a B.S. in Zoology, a scientific bend to my mind and an absolute passion for the discernment of emerging patterns, mental alarm bells went off that I was seeing an emerging pattern that might have a more anthropogenic than natural origin. Casually, as this was a “night job,” I began searching the Internet for possible causalities and ran across the hypothesis that hydro-fracturing a.k.a. “fracking” operations, specifically the injection of water and chemicals into “fracking” and underground “disposal” wells, was causing the emerging swarms of earthquakes.

EARTHQUAKE SWARMS vs. OIL WELL LOCATIONS

At a magnitude of 5.6, the largest earthquake ever recorded in Oklahoma up to that time struck on November 5, 2011, being preceded by 4.7 through 5.0 foreshocks earlier in the day. It was this earthquake and its foreshocks that really raised my interest as I was mapping not only the locations of the earthquakes on the map but also their intensity via color, the more reddish the stronger the quake. To me, where there’s smoke there’s fire, and that being said I resolved to start mapping out well locations as well. It was my quest for well locations on a state-by-state basis that turn self-education into an avocation of sorts, and introduced me to the “fracking wall.”

HITTING THE “FRACKING WALL”

In an effort to obtain the locations of oil wells, I contacted various state agencies of the State of Oklahoma with virtually no results. It wasn’t that the data wasn’t available, it’s that what data was available was not easily downloaded and most importantly did not contain the latitudes and longitudes of the wells. In other words, I could roll the cigar between my fingers but I could neither light nor puff upon it. I was told by one state official, emphatically, that such location data was not available. Agency-by-agency, I wrote and/or called the appropriate personnel, and although most of the employees were polite, they were also equally unhelpful. It took me several months to find out where the data sets containing oil well location data had been posted. There was one, I repeat one, mention of a link to Oklahoma’s oil well location datasets in an obscure forum in a backwater of the Internet. This was the clue I needed, and finally I was able to plot the oil well locations against the occurrence of earthquakes and confirm that “where there’s smoke, there’s fire.” The refusal of the State of Oklahoma to point me to the location(s) of oil well location data was my first experience with the “fracking wall,” and it wouldn’t be my last although the “fracking wall” would be manifested by different states and agencies in different ways.

BREAKING THE “FRACKING WALL”

Because of the State of Oklahoma’s behavior and lack of cooperation, I resolved to break the “fracking wall” for both myself and all others needing access to the same type of data. In an effort to collect all of the oil well location and fracking chemical disclosure hyperlinks in one place, as well as offer curated datasets of the aforementioned data, I created the frackingdata.org website with curated FracFocus data extracts, chemical toxicities and their datasets, state-by-state sources of well location data, and the source code used to extract the datasets into more usable forms. In short, I created frackingdata.org to be a one-stop shop for anyone wishing to conduct analysis of fracking-related data.

FUTURE INITIATIVES

+ Link earthquake data to oil well locations in a manner convenient to anyone wishing to analyze such data (In progress)

+ Automate the download, extraction, transformation, and loading of FracFocus.org data into datasets more suitable for use by analysts or citizen-scientists. (Done)

+ Transform the FracFocus.org GUID keys into more user-friendly integer keys that also reduce storage by over 25% (Done)

+ Push the curated data sets to ODATA repositories, e.g. Google Fusion Tables, so that analysts can more easily access the data via packages like Tableau, SAS, or R. (In progress)

+ Push more of the source code used to do this extraction, transformation, and loading to repositories like GitHub so that all may share in its presence and perhaps even contribute to its maintenance. (Partially done)

As I have a “day job,” progress is painful but the results are worth it.

Khepry Quixote
11 March 2016

FracFocus Database Dump Posting Tardy

It’s 2 March 2016 and FracFocus has yet to post January’s data as the last posting was 11 January 2016 and would only be cumulative through December 2015, let alone February’s which should be out in a week or so.  FracFocus’s tardiness and lack of consistent posting is starting to smell like either passive resistance or a lack of batch scripting competence.  So much for timely “transparency.”  Through intermediaries, to no avail, I’ve offered to help them automate the extraction and posting of the database dump as it would be a win-win for all concerned. *Sigh*

Khepry Quixote
Little Rock, Arkansas
2 March 2016 @ 21:45

Preliminary Analysis of FracFocus.org Data

Abstract

This post covers the oil/gas well and chemical data extracted from the FracFocus.org website by the staff of SkyTruth.org.  The FracFocus.org website is gradually becoming the default repository for general hydro-fracturing well information as well as the chemicals used on each well during the fracturing process.  The “findings” I present will be sample  Structured Query Language (SQL) queries of a substantial subset of the extracted FracFocus.org data (800,000+ records).

FracFocus.org

When I first saw FracFocus.org’s website, I thought I had found an answer to my oil and gas well location information difficulties, as well of chemicals used in their operations.  Then I tried to use it and encountered another “fracking wall.”  As objective as I can be now and in my professional opinion based upon my experience cited in my biography below, FracFocus.org’s website at that time was intentionally constructed to make information extraction difficult.

I found a Frequently Asked Questions webpage upon which I found the following question:

“Are the records from FracFocus available in a digital format such as Excel?”

It was answered as follows:

“No. FracFocus was originally designed to serve records one at a time in Adobe pdf format in order to ensure accurate, unaltered and uncompromised data. Consequently the chemical information gathered does not currently reside in a database or spreadsheet format.”

Another question on the same page was as follows:

“Why can’t the system show me the information on more than one well at a time?”

It was was answered as follows:

“The purpose of the FracFocus records presentation system was to provide those who may live near a well that has been fractured with information concerning the materials used to fracture the well.  All information other than the information used in the search form is available only in an Adobe pdf format.  As such, information such as Ingredients, Trade Names and CAS numbers is not available for search or data aggregation purposes.”

It was this answer that surprised me the most.  Once again, I was stymied.  At the time this was only a “night job” self-education project, so I moved on to looking for well information from more easily-obtained sources.

SkyTruth.org Extracts Data from FracFocus.org

Several months later, while surfing the web, I noted that SkyTruth.org had extracted data from the FracFocus.org website.  I went to the article and attempted to obtain the data from the links they had posted on the http://frack.skytruth.org/fracking-chemical-database/frack-chemical-data-download web page.  I found it to be a little bit difficult to use for analytical reasons, so I applied some of my programming skills and pulled all of the extracted data into two data sets: one of “reports” that contained information about each well, and another of “chemicals” citing which chemicals were used in each reported well.  I then joined these two data sets into a “blended” data set and pushed all three of the resulting data sets to the “Download” section of an open-source repository at http://code.google.com/p/fracking-analysis/.

Database Preparation

As I wished to do some Structured Query Language (SQL) analysis of the extracted data and take a swipe at creating a high-speed, easily-searchable website using Apache Solr 4 as the back-end search engine, I imported the data into two tables within a MySQL database on an Amazon Web Services (AWS) Relational Database Service (RDS) instance.  I used an AWS RDS instance so that I could expand or contract its capacity and capability at will and as needed.  Having been pointed to a source of Chemical Abstract Number  information coupled with recognized and suspected toxicities, I consolidated this data into a table suitable for import into the MySQL database as well.

Once these three tables, as well as a “view” that joined the chemicals to their respective well reports, had been created in the MySQL database, I backed it up and posted it to the open-source repository at http://code.google.com/p/fracking-analysis/downloads/detail?name=frackanalysis_db_backup.zip&can=2&q=.

The tables presently contained within the database are as follows:

  • “chemicals” – a table of approximately 800,000 records citing information about each chemical documented as having been used on the reported well.
  • “reports” – a table of approximately 27,000 records citing information about each well reported as of September 21, 2012.
  • “uvw-chemicals-reports” – a view joining the each record in the “chemicals” table with its respective “reports” table.
  • “cas_chemicals” – a table of approximately 8,000 rows citing information about each chemical with its noted toxicity (e.g. cancer, respiratory, blood) and its category (e.g. recognized or suspected).

The above database and the tables within it are the basis of the queries cited below.

Simple Analytical Queries

Let’s start off with some simple analytical queries of the database.  These will be prefaced with an explanation of their purpose in the format of a natural language query.

“What is the maximum number of chemicals entered for a reported well?”

SELECT MAX(`Count`) Max_Count FROM (SELECT COUNT(*) `Count`    FROM frackanalysis.`chemicals` GROUP BY pdf_seqid) subqry1;

Which gives one an answer of 310.

“What is the minimum number of chemicals entered for a reported well?”

SELECT MIN(`Count`) Min_Count FROM (SELECT COUNT(*) `Count`    FROM frackanalysis.`chemicals` GROUP BY pdf_seqid) subqry1;

Which gives one an answer of 1.

“What is the average number of chemicals entered for a reported well?”

SELECT AVG(`Count`) Avg_Count FROM (SELECT COUNT(*) `Count`    FROM frackanalysis.`chemicals` GROUP BY pdf_seqid) subqry1;

Which gives one an answer of 29.0704.

More Complex Analytical Queries

Let’s continue with some more complex analytical queries of the database.  These will be prefaced with an explanation of their purpose in the format of a natural language query.

“What is the maximum number of chemicals entered for a reported well that are recognized carcinogens?”

SELECT MAX(`Count`) Max_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` c INNER JOIN frackanalysis.`cas_chemicals` cc ON c.`cas_number` = cc.`CAS_EDF_ID` WHERE c.`cas_type` = ‘valid’ AND cc.`Toxicity` = ‘cancer’ AND cc.`Category` = ‘recognized’ GROUP BY pdf_seqid) subqry1;

Which gives one an answer of 6.

“What is the average number of chemicals entered for a reported well that are recognized carcinogens?”

SELECT AVG(`Count`) Avg_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` c INNER JOIN frackanalysis.`cas_chemicals` cc ON c.`cas_number` = cc.`CAS_EDF_ID` WHERE c.`cas_type` = ‘valid’ AND cc.`Toxicity` = ‘cancer’ AND cc.`Category` = ‘recognized’ GROUP BY pdf_seqid) subqry1;

Which gives one an answer of 1.2445.

A More Interesting Analytical Query

Let’s continue with some more interesting analytical queries of the database.  These will be prefaced with an explanation of their purpose in the format of a natural language query.

“What are the most reported chemicals that are recognized carcinogens in descending order?”

SELECT * FROM (SELECT cc.`Chemical_Name`, COUNT(*) `Count` FROM   frackanalysis.`chemicals` c INNER JOIN frackanalysis.`cas_chemicals` cc ON c.`cas_number` = cc.`CAS_EDF_ID` WHERE c.`cas_type` = ‘valid’ AND cc.`Toxicity` = ‘cancer’ AND cc.`Category` = ‘recognized’ GROUP BY cc.`Chemical_Name`) subqry1 ORDER BY `Count` DESC;

Which gives one the resultset below:

NAPHTHALENE 6691

BENZYL CHLORIDE   1636

FORMALDEHYDE      1563

EPICHLOROHYDRIN   378

ATTAPULGITE 334

NITRILOTRIACETIC ACID, TRISODIUM SALT MONOHYDRATE     259

ACRYLAMIDE  233

BIS(2-CHLOROETHYL) ETHER      204

ETHYLBENZENE      134

THIOUREA    96

ACETALDEHYDE      29

1,4-DIOXANE 11

ETHYLENE OXIDE    11

ANTIMONY TRIOXIDE 5

NITRILOTRIACETIC ACID   1

propylene glycol butyl ether  1

Summary

The queries above are just a few of the many that could be made against the cited database and its tables.  A more thorough set of queries seem to be in order by professionals more versed in health studies.  Now that a database suitable for analysis is available, such may yet be possible.    As the records in the reports table contain latitude and longitude values, queries relevant to location seem to be in order as well.  A list of recommendations concerning further development of this database appears below.

Recommendations

  • A table of CAS chemicals along with their toxicity on a relative scale be found and imported into the database.  Perhaps information from the Material Safety Data Sheets (MSDS) can be used to obtain the relative toxicity as the MSDS ranks chemicals on a scale from 0 to 4.
  • A table of cities is obtained from a source like the National Atlas Data Download and the reported wells plotted in relation to said cities.
  • A table of wells drilled be imported into the database so that a query of wells drilled versus wells reported to the FracFocus.org site can be made to assess the level of reporting compliance in jurisdictions in which reporting is mandatory.
  • A table of schools is imported into the database so that proximity-oriented queries can be made against both wells and chemicals.
  • A table of earthquakes is imported into the database so that proximity-oriented and time-sequenced queries can be made against reported wells.
  • An open-source repository is created for storing and retrieval of both the data and analytical queries.
  • An open-source, high-speed search engine web site is created as an aid to the data analysis community as well as a demonstration site for presentations to policy makers as to what is possible with a relative minimum of effort.

Biography

David Darling is a U.S. Army veteran (tactical nuclear weapons) with a B.S. in Zoology and over thirty years of professional experience in Information Technology as a software developer, corporate-level database developer/administrator and search engine developer/administrator. He presently authors software in Java, Java Server Faces, PrimeFaces, coupled with the Apache open-source technologies of Velocity, Solr, and Hadoop.  Additional authorship in databases is done in SQL Server, MySQL, PostgreSQL/PostGIS, and Oracle. He especially enjoys any applications dealing with mapping, spatial data, and Geographical Information Services (GIS). He has extensive experience in petrochemical, law enforcement, local/county/regional governments, voter registration, taxation and property tax appraisal, telecommunications, real-time tracking (spatial/status), and health-care information.

Bill of Rights for Fracking Information

  1. That all of the data and its documentation:
    1. Should be
      1. in a machine-readable form
      2. Suitable for aggregation
      3. And downloadable in a compact form (e.g. ZIP, 7z).
    2. Should be suitable for its nominal purposes of research and reporting by
        1. Reporters
        2. Data Analysts
        3. Citizen Scientists
        4. Regulators
    3. Should be released
      1. In a frequent and timely manner.
      2. With “delta” datasets available, with “delta” being differences between the current and previous releases.
        1. The “delta” datasets should contain the following machine-readable “images”:
          1. “Previous” image.
          2. “Current” image.
          3. “Changed” image with only the values that are different being reported.
    4. Should NOT reside:
      1. Behind a pay-wall.
      2. Behind a registration-wall.
    5. Should be accessible:
      1. Interactively.
      2. ReST-fully via an API.
    6. Should be curated in a manner consistent with:
      1. The norms of professional, responsible data-warehousing.
        1. For example, the elimination of extraneous TAB, LINEFEED, or DIACRITIC characters that should NOT appear within a column.
        2. The resolution of disparate geographical projections (e.g. NAD27, NAD83) into a unified geographical projection (WGS84) suitable for mapping via geographic information systems or platforms such as Google Maps (WGS84).
      2. The needs of others to reliably export the data to alternative formats (e.g. CSV, XML, JSON).