The Mathematics of Data Integration and Data Quality

The Data Exchange Podcast: Ryan Wisnesky on how the mathematical theory of structures can be used in data integration, entity resolution, knowledge management and beyond.

SubscribeApple • Android • Spotify • Stitcher • Google • RSS.

In this episode of the Data Exchange, I speak with Ryan Wisnesky, CTO and co-founder of Conexus, a startup that uses techniques from mathematics and incorporates them into novel tools for data integration, data management, and knowledge management.

Download the 2021 Trends in Data and AI Report, and learn emerging trends in Data, Machine Learning, and AI.

As listeners of this podcast subscribers to our newsletter know, we have recently been highlighting the active community of companies, projects, and developers coalescing around data management, data pipelines, and DataOps. As Ryan points out, Category Theory – the mathematical theory of structures and systems of structures – has applications in many areas such as data wrangling and data integration, entity resolution, and knowledge management.  Category theory provides a framework for understanding when two objects might be equivalent, which leads to many interesting applications in data management:

    ❛ What category theory gives you in this context is like a modular way to separate the merging of data from the finding out what’s related to what. And so there’s these algebraic constructions that take you from things like “these two people are related”, to “here’s how I create a database in which you can’t distinguish them”.

    … So when you say entity resolution, as I was saying record linkage, those are the same concept. And so what category theory shines is putting those to work. It’s one thing to find, like, “Oh, these two people, they’re probably related because they have similar names,” and then making your database such that there are no queries that can distinguish them, because you wanted to say that they’re the same person. So category theory is very good at putting records links to work, and it gives you an optimally merged database, relative to your record links.

Access to high-quality data lies at the heart of modern AI applications. To that end, Ryan and his collaborators have written on how category theory can be used to improve data quality and thus help improve the performance of AI applications.

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.

Related content and resources:

Free Report

[Image by Gerd Altmann from Pixabay.]