Site icon The Data Exchange

Unlocking Unstructured Data with LLMs

Shreya Shankar on Semantic Extraction, DocETL Pipelines, and Enterprise Applications.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Shreya Shankar is a  PhD student at UC Berkeley in the EECS department. This episode explores how Large Language Models (LLMs) are revolutionizing the processing of unstructured enterprise data like text documents and PDFs. It introduces DocETL, a framework using a MapReduce approach with LLMs for semantic extraction, thematic analysis, and summarization at scale. The discussion covers the DocETL workflow, system design considerations, practical use cases, validation techniques, and the future of LLM-powered data pipelines.  [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

  1. Shreya’s Background and the Unstructured Data Challenge
  2. Traditional NLP vs. LLM-based Approaches
  3. Introducing the Doc ETL Framework
  4. Non-Determinism and Creative Data Tasks
  5. Enterprise Pipelines and Architecture Considerations
  6. Integration with Other Tools and Plugins
  7. Observability, Guardrails, and Data Validation
  8. Advanced Reasoning Models in Data Workflows
  9. Fine-Tuning, Multiple LLMs, and Use-Case Variations
  10. Expanding to Multi-Modal Processing
  11. Comparing Doc ETL with Similar Systems
  12. Scaling Semantic Pipelines and Cost Trade-Offs
  13. Closing Thoughts and Future Directions

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Understanding Unstructured Data and LLM Solutions

What fundamental challenge in enterprise data processing are you tackling?

For decades, enterprises have struggled to make sense of unstructured data like text documents, PDFs, images, and videos. Historically, there was no effective technology to automate this process at scale. The massive amount of text data that businesses need to analyze has been difficult to process without the right tools. LLMs now enable extraction of semantic data from these unstructured sources, addressing this long-standing gap in data processing capabilities.

How does the LLM-based approach differ from traditional methods for handling unstructured text?

Traditional approaches typically followed two paths: building bespoke pipelines using specialized NLP libraries (where engineers would train models for specific tasks), or using time-consuming crowdsourcing for data annotation. LLMs fundamentally change this paradigm by allowing for semantic extraction and analysis without the need for task-specific models or extensive manual annotation. This makes processing unstructured data more accessible, flexible, and often simpler to implement.

What is semantic data processing in this context?

Semantic data processing involves extracting meaningful information from text beyond just keywords or patterns. It enables tasks like identifying themes across documents, aggregating information based on semantic similarity, and generating insights. The key innovation is the ability to programmatically define what constitutes a “theme” through LLM prompts. This bridges the gap where users can’t always articulate precise requirements upfront but can recognize valuable patterns when they see them.

DocETL Architecture and Workflow

What is DocETL and how does its workflow operate?

DocETL is a dissertation project that applies the MapReduce paradigm to unstructured data using LLMs:

The system orchestrates these operations at scale and optimizes for accuracy by decomposing complex tasks into manageable sub-tasks for the LLM, allowing for effective processing of large document collections.

What types of data can DocETL process?

The primary input is text data represented as strings. Common sources include PDF documents (sometimes requiring OCR preprocessing), raw text documents, transcripts, JSON-formatted logs, or CSV files containing large text fields. Essentially, any data that can be represented as text is a potential input for DocETL processing.

What are the primary tasks DocETL is used for?

The core capabilities revolve around:

For example, in customer reviews, users might extract pain points or specific product features, then generate reports that aggregate insights across these themes.

System Design and Implementation Considerations

How do different systems approach implementing LLM pipelines?

Systems vary significantly in their implementation:

These implementation differences often reflect different assumptions about user needs and learning curves.

Who is the target user for DocETL tools?

DocETL is designed to be accessible to various users, though with different levels of engagement:

The long-term vision is to abstract away data processing terms like ‘map’ and ‘filter’ entirely to make the system accessible to non-technical domain experts.

What is Doc Wrangler and how does it relate to DocETL?

Doc Wrangler is a specialized IDE built for creating DocETL pipelines. It provides enhanced observability, makes prompt engineering easier, and includes features like automatic prompt writing, incremental pipeline execution, and LLM-powered prompt editing. It helps users go from zero to a working pipeline quickly, addressing UX challenges around building semantic data processing systems. After development in Doc Wrangler, pipelines can be exported to scale across entire datasets using DocETL.

How can DocETL integrate with existing enterprise data architectures?

A common pattern is using DocETL to process unstructured sources and generate structured tables. These tables can then be loaded into standard relational databases or data warehouses, making the extracted semantic information queryable using familiar tools. This allows the LLM-processed data to become part of the existing data ecosystem, potentially serving as a “bronze” or “silver” layer for semantic insights derived from unstructured sources.

Practical Use Cases and Applications

What are common enterprise use cases for these LLM pipelines?

Typical use cases include:

Most business applications focus on solving text-related problems like thematic extraction and report generation.

Are most real-world applications focused on text or multi-modal data?

Despite interest in multi-modal capabilities, the overwhelming majority of practical deployments still focus on text. Most organizations prioritize solving text challenges before expanding to other modalities. Even when users have audio or video data, they often convert it to text through transcription first, aligning with the strengths of current LLM systems. This trend underscores the maturity of text-based tools compared to multi-modal processing.

Can you provide an example of a practical application?

One example involved processing a collection of medical educational PDFs using an LLM to generate flashcards automatically. This reflects how domain-specific data can be transformed into useful learning tools without building custom ML models. Other examples include extracting pain points from support tickets and grouping them by theme, analyzing clinical notes to identify specific symptoms, or summarizing key points from large document collections.

Handling Non-Determinism and Validation

How do LLMs handle non-deterministic tasks, and what are the implications?

We observe two main categories of tasks with different approaches:

  1. Accuracy-critical tasks: For tasks with clear ground truth (like entity extraction), users typically set the LLM temperature to zero to maximize determinism and accuracy.
  2. Creative/exploratory tasks: When asking open-ended questions like “find interesting insights,” non-determinism can be beneficial. Users might run such pipelines multiple times to generate different perspectives.

Managing this involves engineering systems to distinguish acceptable variance from problematic inconsistency. Non-determinism can be both a challenge and a feature depending on the task.

How can users validate the quality of LLM pipeline outputs?

DocETL supports several validation approaches:

Establishing clear definitions of “good” output among stakeholders is an important process consideration these tools aim to support.

What are the challenges in designing user interfaces for semantic data processing?

Designing effective UX involves mediating between three distinct elements:

  1. The User: With goals that can be hard to articulate perfectly via prompts
  2. The LLM Pipeline: Models have their own “understanding” which might not align with user intent
  3. The Data: Input data characteristics can significantly impact LLM performance unpredictably

The central challenge is building interfaces that help users express intent, understand pipeline behavior, provide feedback easily, and navigate complexities arising from data and model interactions—all while making tools approachable for non-programmers.

Model Selection and Fine-Tuning

When should practitioners use reasoning models versus standard LLMs?

Based on current observations:

The choice should be based on the specific requirements rather than defaulting to the most powerful model for every task.

Is supervised fine-tuning recommended for LLM pipelines?

Running a DocETL pipeline can generate labeled data that could be suitable for supervised fine-tuning. While this seems like a logical progression, Shreya notes that users haven’t explicitly reported using this approach yet: “When you run a DocETL pipeline that will give you labeled data, then [you could] go train your models and then replug the models… but I haven’t seen people. No one has told me explicitly [that] they’re doing this.” She acknowledges that people are likely using fine-tuned models with DocETL, but isn’t certain whether they’re fine-tuning before or after incorporating DocETL into their workflows. The decision to fine-tune should be based on specific needs rather than assumed as a standard practice.

Is using multiple LLMs within a single pipeline common?

Yes, using multiple different LLMs within a single pipeline is a common pattern. Teams might use models from OpenAI or Google for extraction and then a different model (e.g., from Anthropic) for summarization. This approach leverages different models’ strengths for different tasks. Using multiple models for consensus is also a practical approach to mitigate the potential brittleness of any single model’s output.

Future Directions and Scaling Considerations

How does this approach handle or plan to handle multi-modal data?

While the initial focus is on text, reflecting current business priorities, the underlying framework relies on foundation models that are becoming increasingly multi-modal. As capabilities in processing images, video, and audio evolve (like in Gemini), the DocETL framework is positioned to adapt and incorporate these modalities. Extending to multi-modal processing is a natural and anticipated direction for future development.

What are the scaling and cost considerations for LLM-based pipelines?

Currently, DocETL runs on single machines with plans to scale with distributed processing frameworks like Ray. While LLM inference costs were initially a concern for large datasets, prices are decreasing dramatically (approximately 10x yearly), making large-scale semantic processing increasingly economically viable. Models like Gemini offer cost-effective alternatives, and open-weight or on-premise models provide alternative cost structures for organizations with specific requirements.

What are the open questions or research differences in this field?

Several research groups are exploring semantic data processing with LLMs, with varying design philosophies. Major open questions revolve around:

The field is evolving rapidly, making it challenging to stabilize designs as new capabilities emerge and user preferences shift.

Exit mobile version