Generating high-fidelity and privacy-preserving synthetic data

Jinsung Yoon and Sercan Arik on a new, state-of-the-art neural architecture that is capable of representing diverse data modalities.

SubscribeApple • Spotify • Stitcher • Google • AntennaPod • Podcast Addict • Amazon •  RSS.

Jinsung Yoon (Senior Research Scientist) and Sercan Arik (Staff Research Scientist and Manager) are part of the Google team behind EHR-Safe, a set of tools for generating highly realistic and privacy-preserving synthetic Electronic Health Records.

Jinsung Yoon and Sercan Arik will be delivering a keynote at the Healthcare NLP Summit, a FREE online conference and the biggest gathering of NLP practitioners.

Anonymizing data with conventional methods can be a tedious and expensive process. The use of synthetic data opens up new possibilities for data sharing. Two properties are essential for synthetic data to be useful:

  1. The synthesized data are of high fidelity (e.g., they give similar downstream performances when a diagnostic model is trained on them).
  2. Synthetic data meets certain privacy measures (i.e. the synthesized data do not reveal a patient’s identity).

In this episode we discuss their new, state-of-the-art neural architecture that is capable of representing diverse data modalities while maintaining data privacy. While EHR-Safe targets a very specific domain and data artifact (Electronic Health Records), we explore possible extensions to structured data in other domains (e.g., financial services), as well extensions to different data types (visual data and text).

FREE Online conference: April 4-5

Register Now

Interview highlights – key sections from the video version:


Related content:

If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter: