The Data Exchange Podcast: Abe Gong on building tools that help teams improve and maintain data quality.
Subscribe: Apple • Android • Spotify • Stitcher • Google • RSS
In this episode of the Data Exchange, I speak with Abe Gong, CEO and co-founder at Superconductive, a startup founded by the team behind the Great Expectations (GE) open source project. GE is one of a growing number of tools aimed at improving data quality through tools for validation and testing. Other projects in this area include TensorFlow DV, assertr, dataframe-rules-engine, deequ, data-describe, and Apache Griffin.
In this episode Abe provided an overview of the GE open source project and its growing community of users and contributors. We also discussed a variety of topics in data engineering and DataOps including data quality, pipelines, and data integration. Abe described what led them to start the GE project:
- ❛ We originally conceived of it as data testing. In the software world, if you build complex systems, and you don’t test them, you just know they will break. Ever since CI/CD became a thing about 15 years ago, this has become table stakes for serious software engineers. So we saw ourselves as enabling the same kind of thing for data engineering and data science.
There’s one twist, which is in the data world code changes, but data changes much more frequently. As a data scientist or data engineer, you usually don’t control all of the inputs to your system. So the way Great Expectations is written, you can use it to wrap pipelines in tests as the code changes. But it’s actually usually more important to test the data as it changes.
The fundamental concept is an Expectation, which is an assertion about how data should look or how it should work. That can be everything from schema checks, to checks about what type is it, or what the column names are. But it also includes tests for the contents of cells, which can include regular expressions or statistical tests.
You can build up a pretty complex vocabulary of things like correlations and relationships among columns, or across tables. For example, a very common one is to assert that the row counts between various tables don’t deviate by too much. So as I go from Table A to Table B, it’s okay to lose a little bit of data during transformation. But at the end of the day, I should have 98% of the same number of rows.
A few years ago, Ihab Ilyas and I wrote a post describing the emergence of machine learning tools for data quality. We highlighted the academic project HoloClean, a probabilistic cleaning framework for ML-based automatic error detection and repair. In addition to machine learning, today we’re seeing the rise of exciting new tools that use knowledge graphs and category theory.
Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.
- A video version of this conversation is available on our YouTube channel.
- Data Cascades: Why we need feedback channels throughout the machine learning lifecycle
- One Simple Chart: Data Engineering jobs in the U.S.
- Ryan Wisnesky: “The Mathematics of Data Integration and Data Quality”
- Mayank Kejriwal: “Building and deploying knowledge graphs”
- Ihab Ilyas and Ben Lorica: “The quest for high-quality data”
- Assaf Araki and Ben Lorica: The Growing Importance of Metadata Management Systems
- Sonal Goyal and Ben Lorica in conversation with Jenn Webb: “Creating Master Data at Scale with AI”
[Image by Sipotek-Visual-Inspection-Machine from Wikimedia.]