This is a selection of projects I have worked on over the past years.
Vizier - Streamlined Data Curation
Vizier aims to streamline data curation and enable domain experts who do not have computer science expertise to curate their own data. Vizier features an intuitive interface combining elements of notebooks and spreadsheets, allowing analysts to quickly see, edit, and revise data. This capability is complemented by a framework for automated data cleaning steps that are seamlessly integrated with manual curation operations. The heart of Vizier is a system for managing uncertainty and provenance of curation workflows and data, enabling the user to keep track of higher-level curation operations as well as track the lineage of data.
This project is a collaboration with Juliana Freire (NYU), Oliver Kennedy (University at Buffalo), and Boris Glavic (Illinois Institute of Technology). An overview of project ideas can be found in our paper at HILDA 2016.
Standard Cortical Observer
The Standard Cortical Observer (SCO) provides a platform to support reproducible and comparable research results on computational models that predict brain responses to arbitrary sensory inputs. As part of the project three major components were developed: (1) a computational model (The Standard Cortical Observer Model 1.0) of how visual images are transformed into patterns of activity in the visual parts of the human brain, (2) a standard dataset of images and human fMRI measurements for validation of the SCO model and future models, similar to standard datasets used for machine vision and machine learning, and (3) a computational platform including data store, web API, and graphical user interface.
The project is a collaboration with Jonathan Winawer and Noah Benson from the NYU Department of Psychology and Center for Neural Science. The work was funded in part by an MSDSE Data Science Seed Grant. All system components are available on GitHub: SCO Model, Data Store, Engine, Web Service, Worker, Python Client, and , User Interface.
i-EKbase - Intelligent Environmental Knowledgebase
The i-EKBase system is designed to monitor large farming areas using remote sensor data. The main source of input is data from Landsat and Modis satellites. i-EKBase integrates remote sensing data with various other sources (e.g., Bureau of Meteorology weather observations, Australian Soil Data, etc.). Farmers are provided with information and guidance related to the local biodiversity, soil quality, water availability, irrigation, topography, and early pest and plant disease prevention to improve crop yield management.
Semantic Sensor Data
Semantic enrichment of sensor data addresses the problems of (re-)use, integration, and discovery. A critical issue is how to generate semantic sensor data from existing data sources. In this project, we developed an approach to semantically augment an existing sensor data infrastructure to re-publish the data as Linked Open Data. In our use case we show how semantic sensor data can help with the growing challenge of selecting sensors that are fit for purpose.
Link My Data
One of the biggest obstacles to reuse of third-party sensor data is a lack of knowledge about data properties (e. g., provenance and quality) leading to a lack of trust in the data. Link My Data is a first step towards overcoming this problem. Link My Data provides a platform for data curation that allows users to share knowledge about individual sensors and sensor observations. The system supports annotation and transformation of sensor data on the Web to improve data quality and (re-)usability.
South Esk Hydrological Sensor Web
Limited freshwater resources in many parts of Australia have led to a highly regulated system of water allocation. Poor situation awareness can result in over-extraction of water from river systems, compromising river ecosystems. To increase situation awareness, we developed a continuous flow forecasting system based on the Open Geospatial Consortium Sensor Web Enablement standards. A prototype Hydrological Sensor Web has been established in the South Esk river catchment in north-eastern Tasmania. Observations from the aggregated sensor assets drive a rainfall-runoff model that predicts river flows at key monitoring points in the catchment.
The South Esk Hydrological Sensor Web was our test-bed for research on management and re-use of sensor observations and sensor metadata. I was involved in developing a provenance management system for a continuous flow forecasts system. The generation of predicted river flows involves complex interactions between instruments, simulation models, computational facilities and data providers. Correct interpretation of information produced at various stages of the information life-cycle requires detailed knowledge of data creation and transformation processes. Such provenance information allows hydrologists and decision-makers to make sound judgments about the trustworthiness of hydrological information.
This project won the Asia Pacific ICT (APICTA) Award for Sustainability and Green IT, 2012 and the Australian iAward for Green IT and Sustainability, 2011.
Both relational databases and wikis have strengths and weaknesses for use in collaborative data management and data curation. Relational databases offer many advantages such as scalability, query optimization and concurrency control, but are not easy to use and lack other features needed for collaboration. Wikis have proved enormously successful as a means to collaborate because they are easy to use, encourage sharing, and provide built-in support for archiving, history-tracking and annotation. However, wikis lack support for structured data, efficiently querying data at scale, and fine-grained data provenance. To achieve the best of both worlds, we implemeted a general-purpose platform for collaborative data management, called DBWiki. Our system not only facilitates the collaborative creation of a structured database; it also provides features not usually provided by database technology such as versioning, provenance tracking, citability, and annotation.
XArch - XML Archiver
XArch is an archive management system that allows one to create, populate, and query archives of multiple database versions. XArch is based on a nested merge approach that efficiently stores multiple database versions in a compact archive. The system allows one to create new archives, to merge new versions of data into existing archives, and execute both snapshot and temporal queries using a declarative query language. XArch has an extensible IO layer and is currently capable of archiving data in XML format as well as relational databases.
This is a list of selected publications.
The exception that improves the rule
Workshop on Human-In-the-Loop Data Analytics (HILDA), 2016
A Use Case in Semantic Modelling and Ranking for the Sensor Web
International Semantic Web Conference (IWSC), 2014
Towards Content-Aware SPARQL Query Caching for Semantic Web Applications
Web Information Systems Engineering (WISE), 2013
From RESTful to SPARQL: A Case Study on Generating Semantic Sensor Data
SSN@ISWC 2013: 51-66
Link My Data: Community-based Curation of Environmental Sensor Data
Intl. Symposium on Spatial and Temporal Databases (SSTD), Demo Track, 2013
Discovering conditional inclusion dependencies
ACM Conf. on Information and Knowledge Management (CIKM), 2012
Improving data quality by source analysis.
ACM J. Data and Information Quality, Vol. 2, Issue 4, March 2012
The Database Wiki Project: A General-Purpose Platform for Data Curation and Collaboration
SIGMOD Record, Vol. 40, No. 3, September 2011
Using Links to prototype a Database Wiki
Symposium on Database Programming Languages (DBPL), Seattle, WA, 2011
DBWiki: A Structured Wiki for Curated Data and Collaborative Data Management
ACM International Conference on Management of Data (SIGMOD), Demo Track, 2011
Detecting Inconsistencies in Distributed Data
IEEE International Conference on Data Engineering (ICDE), 2010
Curating the CIA World Factbook
International Journal of Digital Curation, Issue 3, Volume 4, 2009
XArch: Archiving Scientific and Reference Data
ACM International Conference on Management of Data (SIGMOD), Demo Track, 2008
Sorting Hierarchical Data in External Memory for Archiving
Proceedings of the VLDB Endowment, Volume 1, Issue 1, 2008
Describing Differences between Databases
ACM Conf. on Information and Knowledge Management (CIKM), 2006
Columba: An Integrated Database of Proteins, Structures, and Annotations
BMC Bioinformatics, 6(1):8, 2005
Mining for Patterns in Contradictory Data
ACM Workshop on Information Quality for Information Systems, 2004
COLUMBA: Multidimensional Data Integration of Protein Annotations
Workshop on Data Integration in Life Sciences (DILS), 2004
Data Quality in Genome Databases
Proceedings of the Conference on Information Quality (IQ), 2003
Problems, Methods, and Challenges in Comprehensive Data Cleansing
HUB-IB-164, Humboldt University Berlin, 2003