Data quality is key to great AI products and services

The Data Exchange Podcast: Abe Gong on building tools that help teams improve and maintain data quality.

SubscribeApple • Android • Spotify • Stitcher • Google • RSS

In this episode of the Data Exchange, I speak with Abe Gong, CEO and co-founder at Superconductive,  a startup founded by the team behind the Great Expectations (GE) open source project. GE is one of a growing number of tools aimed at improving data quality through tools for validation and testing. Other projects in this area include TensorFlow DV, assertr, dataframe-rules-engine, deequ, data-describe, and Apache Griffin.

Download the 2021 Trends in Data and AI Report, and learn emerging trends in Data, Machine Learning, and AI.

In this episode Abe provided an overview of the GE open source project and its growing community of users and contributors. We also discussed a variety of topics in data engineering and DataOps including data quality, pipelines, and data integration. Abe described what led them to start the GE project:

    ❛ We originally conceived of it as data testing. In the software world, if you build complex systems, and you don’t test them, you just know they will break. Ever since CI/CD became a thing about 15 years ago, this has become table stakes for serious software engineers. So we saw ourselves as enabling the same kind of thing for data engineering and data science.

    There’s one twist, which is in the data world code changes, but data changes much more frequently. As a data scientist or data engineer, you usually don’t control all of the inputs to your system. So the way Great Expectations is written, you can use it to wrap pipelines in tests as the code changes. But it’s actually usually more important to test the data as it changes.

    The fundamental concept is an Expectation, which is an assertion about how data should look or how it should work. That can be everything from schema checks, to checks about what type is it, or what the column names are. But it also includes tests for the contents of cells, which can include regular expressions or statistical tests.

    You can build up a pretty complex vocabulary of things like correlations and relationships among columns, or across tables. For example, a very common one is to assert that the row counts between various tables don’t deviate by too much. So as I go from Table A to Table B, it’s okay to lose a little bit of data during transformation. But at the end of the day, I should have 98% of the same number of rows.

A few years ago, Ihab Ilyas and I wrote a post describing the emergence of machine learning tools for data quality. We highlighted the academic project HoloClean, a probabilistic cleaning framework for ML-based automatic error detection and repair. In addition to machine learning, today we’re seeing the rise of exciting new tools that use knowledge graphs and category theory.

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.

Related content and resources:

FREE Report

[Image by Sipotek-Visual-Inspection-Machine from Wikimedia.]