I am a Research Engineer at NYU's Center for Data Science, developing tools and infrastructure to support data scientists. I have a research interest in databases, in particular data curation, data integration, data quality and sensor data management and analytics. Bioinformatics is another interest of mine, in particular genome assembly and annotation.

I received my Diploma in Computer Science from the Technical University of Berlin. I was a member of the Berlin-Brandenburg Graduate School on Distributed Information Systems and received my PhD from Humboldt University Berlin. The main focus of my dissertation was on quality of genome data and on conflict resolution in data integration. I spend four years as a Research Fellow in the Database Group at the University of Edinburgh, working on aspects of data curation (e.g., managing evolving databases, provenance management, and annotation). Before joining NYU, I was a Research Scientist at CSIRO where I led a team of Research Scientists and Engineers working on aspects of meta-data management and interoperability.


This is a selection of projects I have worked on over the past years.

i-EKbase - Intelligent Environmental Knowledgebase

The i-EKBase system is designed to monitor large farming areas using remote sensor data. The main source of input is data from Landsat and Modis satellites. i-EKBase integrates remote sensing data with various other sources (e.g., Bureau of Meteorology weather observations, Australian Soil Data, etc.). Farmers are provided with information and guidance related to the local biodiversity, soil quality, water availability, irrigation, topography, and early pest and plant disease prevention to improve crop yield management.

This is joint work with Ritaban Dutta who leads the project. A prototype of the system is available at iekbaseanalytics.csiro.au.

Semantic Sensor Data

Semantic enrichment of sensor data addresses the problems of (re-)use, integration, and discovery. A critical issue is how to generate semantic sensor data from existing data sources. In this project, we developed an approach to semantically augment an existing sensor data infrastructure to re-publish the data as Linked Open Data. In our use case we show how semantic sensor data can help with the growing challenge of selecting sensors that are fit for purpose.

This is joint work with Liliana Cabral, Michael Compton, Ahsan Morshed, and Yanfeng Shu. The work has been presented at SSN 2013 and ISWC 2014.

Link My Data

One of the biggest obstacles to reuse of third-party sensor data is a lack of knowledge about data properties (e. g., provenance and quality) leading to a lack of trust in the data. Link My Data is a first step towards overcoming this problem. Link My Data provides a platform for data curation that allows users to share knowledge about individual sensors and sensor observations. The system supports annotation and transformation of sensor data on the Web to improve data quality and (re-)usability.

Link My Data has been demonstrated at SSTD 2013 (Poster).

South Esk Hydrological Sensor Web

Limited freshwater resources in many parts of Australia have led to a highly regulated system of water allocation. Poor situation awareness can result in over-extraction of water from river systems, compromising river ecosystems. To increase situation awareness, we developed a continuous flow forecasting system based on the Open Geospatial Consortium Sensor Web Enablement standards. A prototype Hydrological Sensor Web has been established in the South Esk river catchment in north-eastern Tasmania. Observations from the aggregated sensor assets drive a rainfall-runoff model that predicts river flows at key monitoring points in the catchment.

The South Esk Hydrological Sensor Web was our test-bed for research on management and re-use of sensor observations and sensor metadata. I was involved in developing a provenance management system for a continuous flow forecasts system. The generation of predicted river flows involves complex interactions between instruments, simulation models, computational facilities and data providers. Correct interpretation of information produced at various stages of the information life-cycle requires detailed knowledge of data creation and transformation processes. Such provenance information allows hydrologists and decision-makers to make sound judgments about the trustworthiness of hydrological information.

This project won the Asia Pacific ICT (APICTA) Award for Sustainability and Green IT, 2012 and the Australian iAward for Green IT and Sustainability, 2011.

Database Wiki

Both relational databases and wikis have strengths and weaknesses for use in collaborative data management and data curation. Relational databases offer many advantages such as scalability, query optimization and concurrency control, but are not easy to use and lack other features needed for collaboration. Wikis have proved enormously successful as a means to collaborate because they are easy to use, encourage sharing, and provide built-in support for archiving, history-tracking and annotation. However, wikis lack support for structured data, efficiently querying data at scale, and fine-grained data provenance. To achieve the best of both worlds, we implemeted a general-purpose platform for collaborative data management, called DBWiki. Our system not only facilitates the collaborative creation of a structured database; it also provides features not usually provided by database technology such as versioning, provenance tracking, citability, and annotation.

This is joint work with Peter Buneman, James Cheney, and Sam Lindley. DBWiki has been demonstrated at SIGMOD 2011.

XArch - XML Archiver

XArch is an archive management system that allows one to create, populate, and query archives of multiple database versions. XArch is based on a nested merge approach that efficiently stores multiple database versions in a compact archive. The system allows one to create new archives, to merge new versions of data into existing archives, and execute both snapshot and temporal queries using a declarative query language. XArch has an extensible IO layer and is currently capable of archiving data in XML format as well as relational databases.

This is joint work with Peter Buneman and Ioannis Koltsidas. XArch has been demonstrated at SIGMOD 2008 and is available for free download.


This is a list of selected publications.


A Use Case in Semantic Modelling and Ranking for the Sensor Web

Liliana Cabral, Michael Compton, Heiko Müller

International Semantic Web Conference (IWSC), 2014


Towards Content-Aware SPARQL Query Caching for Semantic Web Applications

Yanfeng Shu, Michael Compton, Heiko Müller, Kerry Taylor

Web Information Systems Engineering (WISE), 2013

From RESTful to SPARQL: A Case Study on Generating Semantic Sensor Data

Heiko Müller, Liliana Cabral, Ahsan Morshed, Yanfeng Shu

SSN@ISWC 2013: 51-66

Link My Data: Community-based Curation of Environmental Sensor Data

Heiko Müller, Chris Peters, Peter Taylor, Andrew Terhorst

Intl. Symposium on Spatial and Temporal Databases (SSTD), Demo Track, 2013


Discovering conditional inclusion dependencies

Jana Bauckmann, Ziawasch Abedjan, Ulf Leser, Heiko Müller, Felix Naumann

ACM Conf. on Information and Knowledge Management (CIKM), 2012

Improving data quality by source analysis.

Heiko Müller, Johann-Christoph Freytag, Ulf Leser.

ACM J. Data and Information Quality, Vol. 2, Issue 4, March 2012


The Database Wiki Project: A General-Purpose Platform for Data Curation and Collaboration

Peter Buneman, James Cheney, Sam Lindley, Heiko Müller

SIGMOD Record, Vol. 40, No. 3, September 2011

Using Links to prototype a Database Wiki

James Cheney, Sam Lindley, Heiko Müller

Symposium on Database Programming Languages (DBPL), Seattle, WA, 2011

DBWiki: A Structured Wiki for Curated Data and Collaborative Data Management

Peter Buneman, James Cheney, Sam Lindley, Heiko Müller

ACM International Conference on Management of Data (SIGMOD), Demo Track, 2011


Detecting Inconsistencies in Distributed Data

Wenfei Fan, Floris Geerts, Shuai Ma, Heiko Müller

IEEE International Conference on Data Engineering (ICDE), 2010


Curating the CIA World Factbook

Peter Buneman, Heiko Müller, Chris Rusbridge

International Journal of Digital Curation, Issue 3, Volume 4, 2009


XArch: Archiving Scientific and Reference Data

Heiko Müller, Peter Buneman, Ioannis Koltsidas

ACM International Conference on Management of Data (SIGMOD), Demo Track, 2008

Sorting Hierarchical Data in External Memory for Archiving

Ioannis Koltsidas, Heiko Müller, Stratis Viglas

Proceedings of the VLDB Endowment, Volume 1, Issue 1, 2008


Describing Differences between Databases

Heiko Müller, Johann-Christoph Freytag, Ulf Leser

ACM Conf. on Information and Knowledge Management (CIKM), 2006


Columba: An Integrated Database of Proteins, Structures, and Annotations

Kristian Rother, Silke Trißl , Heiko Müller, Thomas Steinke, Ina Koch, Robert Preissner, Cornelius Frömmel, Ulf Leser

BMC Bioinformatics, 6(1):8, 2005


Mining for Patterns in Contradictory Data

Heiko Müller, Ulf Leser, Johann-Christoph Freytag

ACM Workshop on Information Quality for Information Systems, 2004

COLUMBA: Multidimensional Data Integration of Protein Annotations

Kristian Rother, Heiko Müller, Silke Trissl, Ina Koch, Thomas Steinke, Robert Preissner, Cornelius Frömmel, Ulf Leser

Workshop on Data Integration in Life Sciences (DILS), 2004


Data Quality in Genome Databases

Heiko Müller, Felix Naumann, Johann-Christoph Freytag

Proceedings of the Conference on Information Quality (IQ), 2003

Problems, Methods, and Challenges in Comprehensive Data Cleansing

Heiko Müller, Johann-Christoph Freytag

HUB-IB-164, Humboldt University Berlin, 2003

Contact Me

Mail Address

Heiko Müller

Center for Data Science

60 5th Avenue, Room 622

New York, NY, 10011