Unlocking Unstructured Data with LLMs

Ben Lorica

8 months ago

Shreya Shankar on Semantic Extraction, DocETL Pipelines, and Enterprise Applications.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Shreya Shankar is a PhD student at UC Berkeley in the EECS department. This episode explores how Large Language Models (LLMs) are revolutionizing the processing of unstructured enterprise data like text documents and PDFs. It introduces DocETL, a framework using a MapReduce approach with LLMs for semantic extraction, thematic analysis, and summarization at scale. The discussion covers the DocETL workflow, system design considerations, practical use cases, validation techniques, and the future of LLM-powered data pipelines. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Structure Is All You Need
What AI Teams Need to Know for 2025
AI Unlocked – Overcoming The Data Bottleneck
Robert Nishihara → The Data-Centric Shift in AI: Challenges, Opportunities, and Tools
David Hughes → Prompts as Functions: The BAML Revolution in AI Engineering
Petros Zerfos and Hima Patel → Unlocking the Power of LLMs with Data Prep Kit
Tom Smoker → Why ‘Structure’ Is All You Need: A Deep Dive into Next-Gen AI Retrieval
Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
Matt Welsh → Reimagining Code: The AI-Driven Transformation of Programming and Data Analytics
Shuveb Hussain → Automating Unstructured Data Extraction with LLMs

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Understanding Unstructured Data and LLM Solutions

What fundamental challenge in enterprise data processing are you tackling?

For decades, enterprises have struggled to make sense of unstructured data like text documents, PDFs, images, and videos. Historically, there was no effective technology to automate this process at scale. The massive amount of text data that businesses need to analyze has been difficult to process without the right tools. LLMs now enable extraction of semantic data from these unstructured sources, addressing this long-standing gap in data processing capabilities.

How does the LLM-based approach differ from traditional methods for handling unstructured text?

Traditional approaches typically followed two paths: building bespoke pipelines using specialized NLP libraries (where engineers would train models for specific tasks), or using time-consuming crowdsourcing for data annotation. LLMs fundamentally change this paradigm by allowing for semantic extraction and analysis without the need for task-specific models or extensive manual annotation. This makes processing unstructured data more accessible, flexible, and often simpler to implement.

What is semantic data processing in this context?

Semantic data processing involves extracting meaningful information from text beyond just keywords or patterns. It enables tasks like identifying themes across documents, aggregating information based on semantic similarity, and generating insights. The key innovation is the ability to programmatically define what constitutes a “theme” through LLM prompts. This bridges the gap where users can’t always articulate precise requirements upfront but can recognize valuable patterns when they see them.

DocETL Architecture and Workflow

What is DocETL and how does its workflow operate?

DocETL is a dissertation project that applies the MapReduce paradigm to unstructured data using LLMs:

Map: An LLM, guided by a user’s prompt, extracts specific semantic information from individual documents (e.g., identifying product features and associated sentiment).
Reduce: The extracted information is aggregated by semantically grouping similar concepts and using another LLM call to summarize findings for each group.

The system orchestrates these operations at scale and optimizes for accuracy by decomposing complex tasks into manageable sub-tasks for the LLM, allowing for effective processing of large document collections.

What types of data can DocETL process?

The primary input is text data represented as strings. Common sources include PDF documents (sometimes requiring OCR preprocessing), raw text documents, transcripts, JSON-formatted logs, or CSV files containing large text fields. Essentially, any data that can be represented as text is a potential input for DocETL processing.

What are the primary tasks DocETL is used for?

The core capabilities revolve around:

Thematic extraction and analysis (identifying key themes within documents)
Semantically grouping related information across documents
Generating aggregated insights like summaries or reports
Content generation (e.g., creating flashcards from educational materials)
Identifying patterns or insights across document collections

For example, in customer reviews, users might extract pain points or specific product features, then generate reports that aggregate insights across these themes.

System Design and Implementation Considerations

How do different systems approach implementing LLM pipelines?

Systems vary significantly in their implementation:

Some use a single unified map operator (one-to-one document to output)
Others employ multiple specialized operators for different tasks (extraction, summarization)
These differences reflect diverse philosophies about structuring LLM tasks
System design must adapt to the rapidly evolving LLM ecosystem (e.g., supporting Claude alongside OpenAI)

These implementation differences often reflect different assumptions about user needs and learning curves.

Who is the target user for DocETL tools?

DocETL is designed to be accessible to various users, though with different levels of engagement:

Data engineers can build complete pipelines
Analysts without extensive coding knowledge can use parts of the pipeline (e.g., a single map operation)
Domain experts like doctors or lawyers are future targets, though abstracting technical concepts remains a UX challenge

The long-term vision is to abstract away data processing terms like ‘map’ and ‘filter’ entirely to make the system accessible to non-technical domain experts.

What is Doc Wrangler and how does it relate to DocETL?

Doc Wrangler is a specialized IDE built for creating DocETL pipelines. It provides enhanced observability, makes prompt engineering easier, and includes features like automatic prompt writing, incremental pipeline execution, and LLM-powered prompt editing. It helps users go from zero to a working pipeline quickly, addressing UX challenges around building semantic data processing systems. After development in Doc Wrangler, pipelines can be exported to scale across entire datasets using DocETL.

How can DocETL integrate with existing enterprise data architectures?

A common pattern is using DocETL to process unstructured sources and generate structured tables. These tables can then be loaded into standard relational databases or data warehouses, making the extracted semantic information queryable using familiar tools. This allows the LLM-processed data to become part of the existing data ecosystem, potentially serving as a “bronze” or “silver” layer for semantic insights derived from unstructured sources.

Practical Use Cases and Applications

What are common enterprise use cases for these LLM pipelines?

Typical use cases include:

Extracting themes from documents (e.g., customer pain points from reviews)
Grouping similar documents or content by themes
Generating reports that summarize findings across documents
Converting unstructured content to structured data
Reducing operational overhead for manual analysis tasks

Most business applications focus on solving text-related problems like thematic extraction and report generation.

Are most real-world applications focused on text or multi-modal data?

Despite interest in multi-modal capabilities, the overwhelming majority of practical deployments still focus on text. Most organizations prioritize solving text challenges before expanding to other modalities. Even when users have audio or video data, they often convert it to text through transcription first, aligning with the strengths of current LLM systems. This trend underscores the maturity of text-based tools compared to multi-modal processing.

Can you provide an example of a practical application?

One example involved processing a collection of medical educational PDFs using an LLM to generate flashcards automatically. This reflects how domain-specific data can be transformed into useful learning tools without building custom ML models. Other examples include extracting pain points from support tickets and grouping them by theme, analyzing clinical notes to identify specific symptoms, or summarizing key points from large document collections.

Handling Non-Determinism and Validation

How do LLMs handle non-deterministic tasks, and what are the implications?

We observe two main categories of tasks with different approaches:

Accuracy-critical tasks: For tasks with clear ground truth (like entity extraction), users typically set the LLM temperature to zero to maximize determinism and accuracy.
Creative/exploratory tasks: When asking open-ended questions like “find interesting insights,” non-determinism can be beneficial. Users might run such pipelines multiple times to generate different perspectives.

Managing this involves engineering systems to distinguish acceptable variance from problematic inconsistency. Non-determinism can be both a challenge and a feature depending on the task.

How can users validate the quality of LLM pipeline outputs?

DocETL supports several validation approaches:

Code-based guardrails: Simple checks like verifying extracted information exists in the source document (preventing hallucination)
LLM-based validators: Using another LLM call to evaluate outputs against criteria
Automatic retry logic: Ability to retry operations if validation checks fail
Filtering: Using validation results to filter outputs

Establishing clear definitions of “good” output among stakeholders is an important process consideration these tools aim to support.

What are the challenges in designing user interfaces for semantic data processing?

Designing effective UX involves mediating between three distinct elements:

The User: With goals that can be hard to articulate perfectly via prompts
The LLM Pipeline: Models have their own “understanding” which might not align with user intent
The Data: Input data characteristics can significantly impact LLM performance unpredictably

The central challenge is building interfaces that help users express intent, understand pipeline behavior, provide feedback easily, and navigate complexities arising from data and model interactions—all while making tools approachable for non-programmers.

Model Selection and Fine-Tuning

When should practitioners use reasoning models versus standard LLMs?

Based on current observations:

Reasoning models (like Claude Opus or GPT-4) provide the most value during pipeline creation, helping translate high-level goals into structured specifications
Standard LLMs (like GPT-3.5 Turbo or Gemini Pro) often suffice for executing well-defined extraction or summarization tasks within a pipeline
Advanced reasoning capabilities may not add significant leverage unless the task inherently demands complex reasoning

The choice should be based on the specific requirements rather than defaulting to the most powerful model for every task.

Is supervised fine-tuning recommended for LLM pipelines?

Running a DocETL pipeline can generate labeled data that could be suitable for supervised fine-tuning. While this seems like a logical progression, Shreya notes that users haven’t explicitly reported using this approach yet: “When you run a DocETL pipeline that will give you labeled data, then [you could] go train your models and then replug the models… but I haven’t seen people. No one has told me explicitly [that] they’re doing this.” She acknowledges that people are likely using fine-tuned models with DocETL, but isn’t certain whether they’re fine-tuning before or after incorporating DocETL into their workflows. The decision to fine-tune should be based on specific needs rather than assumed as a standard practice.

Is using multiple LLMs within a single pipeline common?

Yes, using multiple different LLMs within a single pipeline is a common pattern. Teams might use models from OpenAI or Google for extraction and then a different model (e.g., from Anthropic) for summarization. This approach leverages different models’ strengths for different tasks. Using multiple models for consensus is also a practical approach to mitigate the potential brittleness of any single model’s output.

Future Directions and Scaling Considerations

How does this approach handle or plan to handle multi-modal data?

While the initial focus is on text, reflecting current business priorities, the underlying framework relies on foundation models that are becoming increasingly multi-modal. As capabilities in processing images, video, and audio evolve (like in Gemini), the DocETL framework is positioned to adapt and incorporate these modalities. Extending to multi-modal processing is a natural and anticipated direction for future development.

What are the scaling and cost considerations for LLM-based pipelines?

Currently, DocETL runs on single machines with plans to scale with distributed processing frameworks like Ray. While LLM inference costs were initially a concern for large datasets, prices are decreasing dramatically (approximately 10x yearly), making large-scale semantic processing increasingly economically viable. Models like Gemini offer cost-effective alternatives, and open-weight or on-premise models provide alternative cost structures for organizations with specific requirements.

What are the open questions or research differences in this field?

Several research groups are exploring semantic data processing with LLMs, with varying design philosophies. Major open questions revolve around:

Usability (what interface design is easiest to learn?)
Adaptability (how to easily incorporate new LLMs?)
Establishing best practices for building robust semantic pipelines
Balancing flexibility and complexity in system design

The field is evolving rapidly, making it challenging to stabilize designs as new capabilities emerge and user preferences shift.