2005/01/04

We know so little, and what we do know...

Is hidden.
  • There is a truly enormous amount of raw figures on the web. This is a link to pages of links to hundreds of thousands if not millions of pages of tables of numbers, from the EU, the UN, and the US Government. That is only the beginning.

    These figures concern every facet of the human condition and more. Those numbers are not boring! Each number is boring, but there are stories hidden in that pile of data, waiting for people to find them.

  • Despite this, a vanishingly small percentage of it is accessible via a machine-usable API. What I mean is this:
    • No metadata, UDDI-like registry, ontologies or taxonomies. The infoarch that does exist is ad-hoc at every level: site, sub-site, and project.
    • No SOAP, XML-RPC, REST, or other pre-web means of access.
    • No machine access except by screen-scraping
    • Subscription access to stats being funded by taxpaying nationals, in most cases, and price levels targeted at institutions and NGOs
    • Trapped behind query interfaces opened as framed, javascripting multi-post powered wizards in popup windows.
    • Stupid restrictions (e.g. 5000 rows max per query) that force needless circumventions.

  • Of all this huge body of data, a good 40-60% (based on my random sampling) of these numbers are in PDF. Extracting tables of figures from PDF into machine-readable form is complex, error-prone and processor intensive. Of the other 60-40%, half is in Excel (which I can live with) and the rest is in CSV or something similar.

  • However, none of this data is theoretically out of reach. What is needed:
    • One smart repository
    • One smart screen-scraper
    • Something to extract tables from PDF documents
    • Trained monkeys to operate the scraping gizmo and classifying gadget.