Data Augmentation in Natural Language Processing

The Data Exchange Podcast: Ed Hovy and Steven Feng on current challenges and future directions for research in data augmentation and in natural language models.


SubscribeApple • Android • Spotify • Stitcher • Google • RSS.

This week’s guests are Steven Feng, Graduate Student and Ed  Hovy, Research Professor, both from the Language Technologies Institute of Carnegie Mellon University. We discussed their recent survey paper on Data Augmentation Approaches in NLP (GitHub), an active field of research on techniques for increasing the diversity of training examples without explicitly collecting new data. One key reason why such strategies are important is that augmented data can act as a regularizer to reduce overfitting when training models.

Take the 2021 NLP Industry Survey and get a free pass to the 2021 NLP Summit.

We discussed current challenges and future directions for research in data augmentation for NLP. I also took the opportunity to discuss broad trends in NLP and Language Technologies with Steven and Ed, including:

  • The rise of Large Language Models (LLM)
  • Challenges posed by LLMs, specifically for academic researchers who may not have access to massive compute resources or big data sets.
  • The role of benchmarks in NLP and machine learning research
  • DARPA and basic research: For some additional context, at the time of our conversation Ed was on a two-year leave from CMU and working at DARPA guiding several NLP programs.

    Steven Feng:  So there are several things that make a good or ideal data augmentation technique. Usually, there’s a trade off between the ease of implementation and usage versus the performance benefits. And further, the augmented data should usually have a balanced distribution that is neither too similar nor too different to the original data. The former may result in overfitting, and the latter will result in augmented data that is not representative of the given domain at all. So data augmentation has typically been more explored in computer vision through very simple techniques like color, jittering, rotation, flipping and cropping images and so forth, and has slower adaptation for NLP. The main reason for this is likely due to the challenges presented by the discrete nature of language data. This makes it harder to maintain the desired invariances. And in fact, these desired invariances themselves are less obvious for NLP than for vision. In vision once can think it’s very simple like translation invariance, rotation invariance, illumination invariance, and so forth. And they’re definitely less obvious for NLP or text. It’s also much harder to encode these invariances either directly into the model, or just as a lightweight module that you can apply stochastically during training for NLP, usually you generate the augmented data beforehand and store it offline and kind of load it into training.

    Ed Hovy: There’s a new DARPA program being discussed called Learning with Less Labeling (LwLL), that is exactly along the lines that we’re talking about today. The goal is to scale up a problem a hundredfold less data. And to say, I’m going to give you the same question, but not translating between two languages, translating between 20 languages or even 200 languages, but with less data.

Download a complete transcript of this episode by filling out the form below:

Related content:

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.


2021 NLP Survey

The 2021 NLP Industry Survey is now open and we need your help. The survey takes less than 5 minutes to fill out and in exchange we’ll send you a copy of the survey results + a FREE pass to the 2021 NLP Summit (a virtual conference slated for October).


[Photo by Ludomił Sawicki on Unsplash.]