Building domain specific natural language applications

The Data Exchange Podcast: David Talby on Spark NLP and turning NLP research into enterprise solutions.

SubscribeiTunesAndroidSpotifyStitcherGoogle, and RSS.

In this episode of the Data Exchange I speak with David Talby, co-creator of Spark NLP, an open source, highly scalable, production grade natural language processing (NLP) library. Spark NLP has become one of the more popular NLP libraries and is available on PyPI, Conda, Maven, and Spark Packages. With recent advances in research in large-scale natural language models, there is strong interest in domain specific natural language applications. Besides their work on Spark NLP, David and his collaborators are building natural language models tuned specifically for healthcare applications.

Ray Summit has been postponed until the Fall. In the meantime, enjoy an amazing series of virtual conferences beginning in mid May on the theme “Scalable machine learning, scalable Python, for everyone”. Go to for details.

Our conversation spanned many topics, including:

  • Spark NLP: its current status and some common and surprising use cases.
  • Recent developments in NLP research and their implications for companies.
  • Spark NLP for Healthcare
  • [Full transcript of our conversation is below.]

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.

Download a complete transcript of this episode by filling out the form below:

Short excerpt:

Ben: I last spoke with you around November 2017, at the time when Spark NLP was just being unveiled. So, in this podcast, I’d like to get an update because the project has made many strides over the last two-plus years. So, first of all, at a high level, give us a sense of the ecosystem—users, contributors, and things like that.

David: When we last spoke, Spark NLP was just starting out, and, I’m going to say, we’ve been very pleased with the adoption within the community. We’ve been releasing software since 2017 at least every two weeks. We had somewhere between 50 and 60 releases in the last couple of years, which means if you use the library today, you can basically assume that someone has tested it and debugged it on whatever infrastructure you’re running it on—any cloud structure, any private combination, any Java or Python development environment.

We have 33 contributors, the last I checked. We have a very active community. On Slack, we talk to users every day. Because it’s an Apache-licensed project, we don’t really know how many people use it, but we did see that in PyPI in September there were 50,000 downloads—that’s downloads per month, but that’s only on PyPI. We’re also on Maven and Spark Packages Lite, and there are other ways to get the library as well.

Ben: So, of course, when you build an open source project like this, you can’t really anticipate the types of use cases and applications. What are some of the notable, and also surprising, use cases you’ve come across over the last two years—and are there specific verticals or domains where you think Spark NLP is particularly gaining a lot of followers?

David: We know in open source there’s a lot of use in finance and insurance industries. In financial systems, there’s a lot of need for NLP, high-accuracy NLP—not only to apply the new deployment techniques, but also to be able to do it at scale. So, these are not play projects, but are actually coordinated to work on large amounts of data, sometimes in real time.

Related content:

[Photo from Free Stock A woman sitting in a library.]