The Data Exchange Podcast: Marco Ribeiro on why accuracy on benchmarks is not sufficient for evaluating NLP models.
In this episode of the Data Exchange I speak with Marco Ribeiro, Senior Researcher at Microsoft Research, and lead author of the award-winning paper ”Beyond Accuracy: Behavioral Testing of NLP models with CheckList”. As machine learning gains importance across many application domains and industries, there is a growing need to formalize how ML models get built, deployed, and used. MLOps is an emerging set of practices focused on productionizing the machine learning lifecycle, that draws ideas from CI/CD. But even before we talk about deploying a model to production, how do we inject more rigor into the model development process?
Marco and his collaborators address this question – at least in the context of natural language models – in their well-received paper. Recall that NLP model training tends to follow the following simple process: split your data into train-validation data sets. Build a model using your training subset, and test its efficacy using your validation set. As Marco and his collaborators point out:
- While performance on held-out data is a useful indicator, held-out datasets are often not comprehensive, and contain the same biases as the training data, such that real-world performance may be overestimated. Further, by summarizing the performance as a single aggregate statistic, it becomes difficult to figure out where the model is failing, and how to fix it.
CheckList is an open source project for testing your NLP models. Behavioral or black-box testing is a longstanding testing methodology that focuses on validating input-output behavior of a software system.
- If you are testing a piece of software, if you have access to the implementation, you might write tests that fit the implementation well. And you end up not testing cases where the implementation doesn’t fit well. But in behavioral testing you treat it as a black box, you only know what to expect given an input. … In ML, we assume the model is a black box and we test different behaviors: what happens if I have the following examples? What happens if I perturb example in a certain way? Does the model behave in the way I expect?
CheckList provides tools for testing natural language models across many different capabilities including:
- Vocabulary+POS (important words or word types for the task)
- Taxonomy (synonyms, antonyms, etc)
- Robustness (to typos, irrelevant changes, etc)
- NER (appropriately understanding named entities)
- Temporal (understanding order of events)
- Role Labeling (understanding roles such as agent, object, etc)
- Logic (ability to handle symmetry, consistency, and conjunctions)
As machine learning and natural language models continue to grow in importance, companies need to inject more rigor into their model development, deployment, and monitoring processes. Borrowing from longstanding practices in software engineering, CheckList should be a welcome addition to the toolbox of any developer building natural language models.
Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.
- A video version of this conversation is available on our YouTube channel.
- Download the 2020 NLP Survey Report and learn how companies are using and implementing natural language technologies.
- Matthew Honnibal: “Building open source developer tools for language applications”
- Amy Heineike: “Machines for unlocking the deluge of COVID-19 papers, articles, and conversations”
- Alan Nichol: “Best practices for building conversational AI applications”
- Weifeng Zhong: “Using machine learning to detect shifts in government policy”
- Mayank Kejriwal: “Building and deploying knowledge graphs”
- Ameet Talwalkar: “Democratizing Machine Learning”
Register to join live or watch on-demand.