Data Integration

73 papers with code • 0 benchmarks • 7 datasets

Data integration (also called information integration) is the process of consolidating data from a set of heterogeneous data sources into a single uniform data set (materialized integration) or view on the data (virtual integration). Data integration pipelines involve subtasks such as schema matching, table annotation, entity resolution, value normalization, data cleansing, and data fusion. Application domains of data integration include data warehousing, data lakes, and knowledge base consolidation. Surveys on Data integration:

Dong, Srivastava: Big data integration, 2013.
Doan, Halevy, Ives: Principles of Data Integration, 2012.

Benchmarks

Add a Result

These leaderboards are used to track progress in Data Integration

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Libraries

Use these libraries to find Data Integration models and implementations

morph-kgc/morph-kgc

4 papers

155

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Elastic Coupled Co-clustering for Single-Cell Genomic Data

cuhklinlab/elasticC3 • 29 Mar 2020

The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species.

Paper
Code

A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning

DOE-NCI-Pilot1/CCLFeatureComparison • 30 Apr 2020

By combining various cancer cell line (CCL) drug screening panels, the size of the data has grown significantly to begin understanding how advances in deep learning can advance drug response predictions.

Paper
Code

The scalable Birth-Death MCMC Algorithm for Mixed Graphical Model Learning with Application to Genomic Data Integration

wangnanwei/Birth-death-MCMC-Model-Selection • 8 May 2020

Recent advances in biological research have seen the emergence of high-throughput technologies with numerous applications that allow the study of biological mechanisms at an unprecedented depth and scale.

Paper
Code

Consistent and Flexible Selectivity Estimation for High-Dimensional Data

yaoshuwang/SelNet-Estimation • 20 May 2020

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion.

Paper
Code

An Empirical Meta-analysis of the Life Sciences (Linked?) Open Data on the Web

maulikkamdar/LSLODQuery • 7 Jun 2020

While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources.

Paper
Code

Kernel learning approaches for summarising and combining posterior similarity matrices

acabassi/combine-psms • 27 Sep 2020

Here we build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.

Paper
Code

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization

yueyu1030/SumGNN • • 4 Oct 2020

Furthermore, most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is a more meaningful but harder task.

Paper
Code

BayReL: Bayesian Relational Learning for Multi-omics Data Integration

ehsanhajiramezanali/BayReL • • NeurIPS 2020

High-throughput molecular profiling technologies have produced high-dimensional multi-omics data, enabling systematic understanding of living systems at the genome scale.

Paper
Code

Profiling Entity Matching Benchmark Tasks

wbsg-uni-mannheim/EntityMatchingTaskProfiler • International Conference on Information & Knowledge Management 2020

In order to enable the exact reproducibility of evaluation results, matching tasks need to contain exactly defined sets of matching and non-matching record pairs, as well as a fixed development and test split.

Paper
Code

GripNet: Graph Information Propagation on Supergraph for Heterogeneous Graphs

NYXFLOWER/GripNet • • 29 Oct 2020

Heterogeneous graph representation learning aims to learn low-dimensional vector representations of different types of entities and relations to empower downstream tasks.

Paper
Code

Data Integration

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result