I am a Research Engineer at NYU's Center for Data Science, developing tools and infrastructure to support data scientists. I have a research interest in databases, in particular data curation, data integration, data quality and sensor data management and analytics. Bioinformatics is another interest of mine, in particular genome assembly and annotation.

I received my Diploma in Computer Science from the Technical University of Berlin. I was a member of the Berlin-Brandenburg Graduate School on Distributed Information Systems and received my PhD from Humboldt University Berlin. The main focus of my dissertation was on quality of genome data and on conflict resolution in data integration. I spend four years as a Research Fellow in the Database Group at the University of Edinburgh, working on aspects of data curation (e.g., managing evolving databases, provenance management, and annotation). Before joining NYU, I was a Research Scientist at CSIRO where I led a team of Research Scientists and Engineers working on aspects of meta-data management and interoperability.


This is a selection of projects I have worked on over the past years.

Vizier - Streamlined Data Curation

Vizier aims to streamline data curation and enable domain experts who do not have computer science expertise to curate their own data. Vizier features an intuitive interface combining elements of notebooks and spreadsheets, allowing analysts to quickly see, edit, and revise data. This capability is complemented by a framework for automated data cleaning steps that are seamlessly integrated with manual curation operations. The heart of Vizier is a system for managing uncertainty and provenance of curation workflows and data, enabling the user to keep track of higher-level curation operations as well as track the lineage of data.

This project is a collaboration with Juliana Freire (NYU), Oliver Kennedy (University at Buffalo), and Boris Glavic (Illinois Institute of Technology). An overview of project ideas can be found in our paper at HILDA 2016.

Standard Cortical Observer

The Standard Cortical Observer (SCO) provides a platform to support reproducible and comparable research results on computational models that predict brain responses to arbitrary sensory inputs. As part of the project three major components were developed: (1) a computational model (The Standard Cortical Observer Model 1.0) of how visual images are transformed into patterns of activity in the visual parts of the human brain, (2) a standard dataset of images and human fMRI measurements for validation of the SCO model and future models, similar to standard datasets used for machine vision and machine learning, and (3) a computational platform including data store, web API, and graphical user interface.

The project is a collaboration with Jonathan Winawer and Noah Benson from the NYU Department of Psychology and Center for Neural Science. The work was funded in part by an MSDSE Data Science Seed Grant. All system components are available on GitHub: SCO Model, Data Store, Engine, Web Service, Worker, Python Client, and , User Interface.

i-EKbase - Intelligent Environmental Knowledgebase

The i-EKBase system is designed to monitor large farming areas using remote sensor data. The main source of input is data from Landsat and Modis satellites. i-EKBase integrates remote sensing data with various other sources (e.g., Bureau of Meteorology weather observations, Australian Soil Data, etc.). Farmers are provided with information and guidance related to the local biodiversity, soil quality, water availability, irrigation, topography, and early pest and plant disease prevention to improve crop yield management.

This is joint work with Ritaban Dutta. The system is the foundation for iekbase.com.

Semantic Sensor Data

Semantic enrichment of sensor data addresses the problems of (re-)use, integration, and discovery. A critical issue is how to generate semantic sensor data from existing data sources. In this project, we developed an approach to semantically augment an existing sensor data infrastructure to re-publish the data as Linked Open Data. In our use case we show how semantic sensor data can help with the growing challenge of selecting sensors that are fit for purpose.

This is joint work with Liliana Cabral, Michael Compton, Ahsan Morshed, and Yanfeng Shu. The work has been presented at SSN 2013 and ISWC 2014.

Link My Data

One of the biggest obstacles to reuse of third-party sensor data is a lack of knowledge about data properties (e. g., provenance and quality) leading to a lack of trust in the data. Link My Data is a first step towards overcoming this problem. Link My Data provides a platform for data curation that allows users to share knowledge about individual sensors and sensor observations. The system supports annotation and transformation of sensor data on the Web to improve data quality and (re-)usability.

Link My Data has been demonstrated at SSTD 2013 (Poster).

South Esk Hydrological Sensor Web

Limited freshwater resources in many parts of Australia have led to a highly regulated system of water allocation. Poor situation awareness can result in over-extraction of water from river systems, compromising river ecosystems. To increase situation awareness, we developed a continuous flow forecasting system based on the Open Geospatial Consortium Sensor Web Enablement standards. A prototype Hydrological Sensor Web has been established in the South Esk river catchment in north-eastern Tasmania. Observations from the aggregated sensor assets drive a rainfall-runoff model that predicts river flows at key monitoring points in the catchment.

The South Esk Hydrological Sensor Web was our test-bed for research on management and re-use of sensor observations and sensor metadata. I was involved in developing a provenance management system for a continuous flow forecasts system. The generation of predicted river flows involves complex interactions between instruments, simulation models, computational facilities and data providers. Correct interpretation of information produced at various stages of the information life-cycle requires detailed knowledge of data creation and transformation processes. Such provenance information allows hydrologists and decision-makers to make sound judgments about the trustworthiness of hydrological information.

This project won the Asia Pacific ICT (APICTA) Award for Sustainability and Green IT, 2012 and the Australian iAward for Green IT and Sustainability, 2011.

Database Wiki

Both relational databases and wikis have strengths and weaknesses for use in collaborative data management and data curation. Relational databases offer many advantages such as scalability, query optimization and concurrency control, but are not easy to use and lack other features needed for collaboration. Wikis have proved enormously successful as a means to collaborate because they are easy to use, encourage sharing, and provide built-in support for archiving, history-tracking and annotation. However, wikis lack support for structured data, efficiently querying data at scale, and fine-grained data provenance. To achieve the best of both worlds, we implemeted a general-purpose platform for collaborative data management, called DBWiki. Our system not only facilitates the collaborative creation of a structured database; it also provides features not usually provided by database technology such as versioning, provenance tracking, citability, and annotation.

This is joint work with Peter Buneman, James Cheney, and Sam Lindley. DBWiki has been demonstrated at SIGMOD 2011.

XArch - XML Archiver

XArch is an archive management system that allows one to create, populate, and query archives of multiple database versions. XArch is based on a nested merge approach that efficiently stores multiple database versions in a compact archive. The system allows one to create new archives, to merge new versions of data into existing archives, and execute both snapshot and temporal queries using a declarative query language. XArch has an extensible IO layer and is currently capable of archiving data in XML format as well as relational databases.

This is joint work with Peter Buneman and Ioannis Koltsidas. XArch has been demonstrated at SIGMOD 2008 and is available for free download.


This is a list of selected publications.


The exception that improves the rule

Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Müller

Workshop on Human-In-the-Loop Data Analytics (HILDA), 2016


A Use Case in Semantic Modelling and Ranking for the Sensor Web

Liliana Cabral, Michael Compton, Heiko Müller

International Semantic Web Conference (IWSC), 2014


Towards Content-Aware SPARQL Query Caching for Semantic Web Applications

Yanfeng Shu, Michael Compton, Heiko Müller, Kerry Taylor

Web Information Systems Engineering (WISE), 2013

From RESTful to SPARQL: A Case Study on Generating Semantic Sensor Data

Heiko Müller, Liliana Cabral, Ahsan Morshed, Yanfeng Shu

SSN@ISWC 2013: 51-66

Link My Data: Community-based Curation of Environmental Sensor Data

Heiko Müller, Chris Peters, Peter Taylor, Andrew Terhorst

Intl. Symposium on Spatial and Temporal Databases (SSTD), Demo Track, 2013


Discovering conditional inclusion dependencies

Jana Bauckmann, Ziawasch Abedjan, Ulf Leser, Heiko Müller, Felix Naumann

ACM Conf. on Information and Knowledge Management (CIKM), 2012

Improving data quality by source analysis.

Heiko Müller, Johann-Christoph Freytag, Ulf Leser.

ACM J. Data and Information Quality, Vol. 2, Issue 4, March 2012


The Database Wiki Project: A General-Purpose Platform for Data Curation and Collaboration

Peter Buneman, James Cheney, Sam Lindley, Heiko Müller

SIGMOD Record, Vol. 40, No. 3, September 2011

Using Links to prototype a Database Wiki

James Cheney, Sam Lindley, Heiko Müller

Symposium on Database Programming Languages (DBPL), Seattle, WA, 2011

DBWiki: A Structured Wiki for Curated Data and Collaborative Data Management

Peter Buneman, James Cheney, Sam Lindley, Heiko Müller

ACM International Conference on Management of Data (SIGMOD), Demo Track, 2011


Detecting Inconsistencies in Distributed Data

Wenfei Fan, Floris Geerts, Shuai Ma, Heiko Müller

IEEE International Conference on Data Engineering (ICDE), 2010


Curating the CIA World Factbook

Peter Buneman, Heiko Müller, Chris Rusbridge

International Journal of Digital Curation, Issue 3, Volume 4, 2009


XArch: Archiving Scientific and Reference Data

Heiko Müller, Peter Buneman, Ioannis Koltsidas

ACM International Conference on Management of Data (SIGMOD), Demo Track, 2008

Sorting Hierarchical Data in External Memory for Archiving

Ioannis Koltsidas, Heiko Müller, Stratis Viglas

Proceedings of the VLDB Endowment, Volume 1, Issue 1, 2008


Describing Differences between Databases

Heiko Müller, Johann-Christoph Freytag, Ulf Leser

ACM Conf. on Information and Knowledge Management (CIKM), 2006


Columba: An Integrated Database of Proteins, Structures, and Annotations

Kristian Rother, Silke Trißl , Heiko Müller, Thomas Steinke, Ina Koch, Robert Preissner, Cornelius Frömmel, Ulf Leser

BMC Bioinformatics, 6(1):8, 2005


Mining for Patterns in Contradictory Data

Heiko Müller, Ulf Leser, Johann-Christoph Freytag

ACM Workshop on Information Quality for Information Systems, 2004

COLUMBA: Multidimensional Data Integration of Protein Annotations

Kristian Rother, Heiko Müller, Silke Trissl, Ina Koch, Thomas Steinke, Robert Preissner, Cornelius Frömmel, Ulf Leser

Workshop on Data Integration in Life Sciences (DILS), 2004


Data Quality in Genome Databases

Heiko Müller, Felix Naumann, Johann-Christoph Freytag

Proceedings of the Conference on Information Quality (IQ), 2003

Problems, Methods, and Challenges in Comprehensive Data Cleansing

Heiko Müller, Johann-Christoph Freytag

HUB-IB-164, Humboldt University Berlin, 2003

Contact Me

Mail Address

Heiko Müller

Center for Data Science

60 5th Avenue, Room 760

New York, NY, 10011