Site icon The Data Exchange

Generating high-fidelity and privacy-preserving synthetic data

Jinsung Yoon and Sercan Arik on a new, state-of-the-art neural architecture that is capable of representing diverse data modalities.


SubscribeApple • Spotify • Stitcher • Google • AntennaPod • Podcast Addict • Amazon •  RSS.

Jinsung Yoon (Senior Research Scientist) and Sercan Arik (Staff Research Scientist and Manager) are part of the Google team behind EHR-Safe, a set of tools for generating highly realistic and privacy-preserving synthetic Electronic Health Records.

Jinsung Yoon and Sercan Arik will be delivering a keynote at the Healthcare NLP Summit, a FREE online conference and the biggest gathering of NLP practitioners.

Anonymizing data with conventional methods can be a tedious and expensive process. The use of synthetic data opens up new possibilities for data sharing. Two properties are essential for synthetic data to be useful:

  1. The synthesized data are of high fidelity (e.g., they give similar downstream performances when a diagnostic model is trained on them).
  2. Synthetic data meets certain privacy measures (i.e. the synthesized data do not reveal a patient’s identity).

In this episode we discuss their new, state-of-the-art neural architecture that is capable of representing diverse data modalities while maintaining data privacy. While EHR-Safe targets a very specific domain and data artifact (Electronic Health Records), we explore possible extensions to structured data in other domains (e.g., financial services), as well extensions to different data types (visual data and text).


FREE Online conference: April 4-5

Register Now


Interview highlights – key sections from the video version:

  1. Electronic Health Records, and data formats covered by EHR-Safe
  2. What problems and challenges does EHR-Safe address?
  3. Implementation details
  4. Stationarity assumptions
  5. Data augmentation
  6. What developments in machine learning led to EHR-Safe?
  7. Evaluating synthetic data
  8. Limitations of synthetic data
  9. Synthetic data in the context of other tools for confidential computing
  10. Synthetic data for text (doctor’s notes) and medical images
  11. How the healtcare community has reacted to EHR-Safe
  12. How tools developed for EHR-Safe may translate to other domains beyond healthcare
  13. An update from Sercan on the use of deep learning for tabular data problems
  14. Trends to watch in synthetic data generation

 

Related content:


If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Exit mobile version