NYCFacets

NYCFacets is a Smart Open Data Exchange that catalogs all the metadata for New York City (NYC) related datasources.

It was created November 2011 in response to the NYCBigApps 3.0 challenge and was awarded the Grand Prize!

Background
NYC is currently executing an ambitious multi-year roadmap to become the Premier Digital City of the Future. A critical part of its strategy calls for exposing as much City Data as possible through an API so that businesses and entrepreneurs can build innovative Smart City solutions.

For the past three years, it has been making good on this strategy by mounting NYCBigApps, - an annual crowdsourcing contest where it exposes select City Data and challenge Open Data innovators to create software solutions that use the data.

For the first two NYCBigApps, before the roadmap was formalized, City data was exposed through the "NYC Datamine" - a hodge-podge of files in various formats sitting on a web server. For this year's challenge, it created "NYC Open Data" - an Open Government Framework featuring APIs for City Data powered by Socrata - the very same framework recently adopted by data.gov, and several other cities in the US and around the world.

The problem
After NYCBigApps 3.0 was launched October 2011, NYC convened several meetups. A recurring question/complaint among developers during these meetups was that there was no easy way to make sense of the 850+ datasets in the NYC Open Data Catalog. Even though there was an API, metadata and multiple search mechanisms to navigate the catalog, it wasn't exposed in such a way that promoted quick navigation, discovery and exploration.

Furthermore, even though NYC Open Data was a quantum leap from the previous NYC DataMine, the task of correlating datasets was still left to the developer. There was also no quick way to discern dataset quality. This information overload problem will only get exponentially worse once NYC exposes ALL City Data, as planned in the roadmap.

The solution
NYCFacets exposes all the metadata in the NYC Open Data Catalog and leverages SMW's built-in searching mechanisms to improve discovery and exploration. Going further, it also derives additional metadata - "extrametadata", a term coined by NYCFacet's originators - to further aid understanding of the underlying dataset by scoring each one along several aspects of quality. It computes several scores (i.e. freshness, downloadcount, viewcount, sparseness, etc.), creating a compound index called Pediacities Rank reminiscent of Google's PageRank.

For this early version, NYCFacets is primarily a developer reference for the NYC Open Data community (publishers and developers) to collaboratively use and refine City Data. Future versions will have entry points more suited to researchers (i.e. journalists, academe, etc.) and the general public.

Extensions used
NYCFacets uses SMW+1.6 Community Edition, Semantic Forms, Semantic Forms Inputs, Automatic Semantic Forms, Semantic Drilldown, Semantic Maps, Semantic Internal Objects, Enhanced Retrieval (to enable Faceted Search), Data Import Extension, Collaboration Extension, Semantic Gardening Extension, Header Tabs and Widgets. It also uses TSC Basic and the Rule Knowledge Extension from the SMW+ suite to compute scores.

The NYC Open Data Catalog is crawled by the NYCFacets bot every Saturday. It then pumps all the metadata (both explicit and derived - roughly 1.58M triples as of Feb 25, 2012) into SMW the same day. Gardening Bots then scan the imported data and automatically flag data quality issues. Corrections and curations are then entered by leveraging Automatic Semantic Forms - which allows site editors to adjust the data without having to manually develop forms, since SMW categories and properties are also derived from the crawled data.

Future plans
NYCFacets will be expanded to cover additional NYC-related datasources outside the city's catalog. These sources include semantic mappings of traditional relational databases, APIs and Linked Data. On the shortlist are 2010 US Census Data, data.nytimes.com, and NYCBigApps partner APIs. It will also catalog apps and websites that use these datasources - becoming a discovery mechanism for NYC-related apps and websites as well.

Several domain/integration ontologies will also be created to allow federated queries and feeds, not only of the metadata, but of the underlying data as well. A publicly-available SPARQL endpoint is also planned.

Pediacities Rank will be further expanded with a more robust reasoner (TSC Professional) to allow inferencing of more sophisticated scores and classifications (e.g. auto-detection of conditions like duplicates, test, similar, sparse, etc.) to compute Trust. More collaborative features are also planned (Bugzilla integration, annotations, etc.) to actively engage the user community in refining the metadata.

New result format printers are also planned to facilitate easier connectivity to powerful visualization frameworks like Many Eyes, Google Charts and Tableau Public.

Finally, a complementary SMW-powered site targeted to the general public will be launched Summer 2012. This new site will utilize the exposed metadata and the federated query feeds from NYCFacets.