Petros Zerfos and Hima Patel on Simplifying AI Data Pipelines with IBM’s Data Prep Kit.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.
Petros Zerfos and Hima Patel of IBM Research are part of the team behind Data Prep Kit, an open-source toolkit that helps process and prepare raw text and code data at scale for use in large language model applications. We explore Data Prep Kit’s robust capabilities in handling text, code, and documents, and discuss its scalability, cloud-native architecture, and future enhancements. We also touch on DPK’s integration with popular tools, including Ray, making it an essential resource for AI teams. [Ray Summit 2024 comes to San Francisco September 30-October 2. Use the code AnyscaleBen15 for a 15% discount when you register!]
Interview highlights – key sections from the video version:
- High-Level Basics of Data Preparation
- Core Functions of Data Prep Kit for Structured Data
- Capabilities of DPK in Document and Code Processing
- PDF Extraction and the Role of DPK
- Multimodal and Document Understanding with DPK
- Exploration of DPK’s Ray Integration
- DPK’s Flexibility and Integration with Vector Databases
- Using DPK for Large-Scale Data Processing
- DPK’s Scalability and Application in Different Modalities
- Developer Relations and Community Contributions
- Challenges and Future Directions for DPK
- Multilingual Capabilities and DPK
- Next Steps

Related content:
- A video version of this conversation is available on our YouTube channel.
- Inside the Data Strategies of Top AI Labs
- Choosing the Right Vector Search System
- Generative AI: Navigating the Challenges of Enterprise Adoption
- Chang She → Unlocking the Power of Unstructured Data
- Brian Raymond → ETL for LLMs
- Jerry Liu → An Open Source Data Framework for LLMs
If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
