The Fenic Approach to Production-Ready Data Processing

Kostas Pardalis on Inference-First Data Frames, Markdown as Structure, Semantic Query Operations, and Production AI Debugging.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Kostas Pardalis, co-founder of Typedef , discusses Fenic, an open-source data frame framework designed specifically for AI applications that treats inference as a first-class operation within the query engine. The key innovation is extracting maximum structure from seemingly unstructured data (particularly through markdown as a data type) while providing production-ready features like row-level lineage, caching, and semantic operations for debugging and optimizing multi-step AI pipelines. Real-world applications include content companies building dynamic narrative classification systems and cybersecurity teams processing mixed structured/unstructured threat intelligence data.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Structure Is All You Need
RAG Reimagined: 5 Breakthroughs You Should Know
Is Your Data Stack Ready for Multimodal AI?
Paradigm Shifts in Data Processing for the Generative AI Era
Shreya Shankar → Unlocking Unstructured Data with LLMs
Douwe Kiela → Building Production-Grade RAG at Scale
David Hughes → Prompts as Functions: The BAML Revolution in AI Engineering

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Background and Motivation

Why did you feel the need to create new data infrastructure like TypeDB and Fenic specifically for AI applications?

My co-founder and I have extensive backgrounds in data infrastructure—from ETL solutions to federated query engines at companies like RudderStack and Starburst. We observed that existing data platforms like Spark, Trino, and Snowflake were fundamentally built for Business Intelligence (BI) workloads. While they’ve been retrofitted for machine learning, they weren’t designed for the unique requirements of modern AI applications.

We saw an inflection point where data itself, not just software, was becoming the core value driver. The generative AI wave accelerated this shift from a multi-year evolution to a matter of months. Instead of force-fitting legacy tools into AI workloads, we believed it was time to build systems from first principles specifically for inference-heavy applications. This led us to create TypeDB and the open-source Fenic project—tools designed for the millions of developers now building AI and agentic applications.

What does “inference-first” mean in practice for teams building AI applications?

“Inference-first” means treating LLM calls and model inference as native, first-class operations within the data processing engine, not as external black-box functions. In traditional systems like Spark, if you want to call an LLM, you write a User-Defined Function (UDF) that the query optimizer can’t see into or optimize.

In Fenic, we’ve built inference directly into our query operators with concepts like semantic_filter, semantic_join, and semantic_map. This means the query engine is fully aware when inference is happening, allowing it to optimize these operations just like it would CPU or memory operations. For developers, this translates to better performance, reduced costs through intelligent caching, and detailed debugging capabilities that are impossible when LLM calls are treated as opaque external functions.

Core Architecture and Philosophy

Why focus on structured and semi-structured data when AI is largely driven by unstructured data?

Our thesis is that unstructured data—PDFs, transcripts, documents—actually contains significant latent structure that LLMs can help surface. Once extracted, this structure needs to be treated as a first-class citizen in your data pipeline. That’s why we define new data types like markdown and transcript, similar to how databases eventually adopted JSON as a native type.

The key insight is that at scale, you need to extract implicit information and make it explicit through structure. This allows you to apply traditional data operations—joins, filters, aggregations—alongside AI inference in a unified, optimizable pipeline. You’re not choosing between structured and unstructured; you’re building bridges between both worlds.

Why did you choose the data frame API instead of creating something entirely new?

The data frame paradigm provides several critical advantages for AI application developers:

Familiarity: Teams already have years of experience with Pandas and PySpark. We’re extending familiar concepts rather than forcing a complete paradigm shift.
Software engineering best practices: Data frames enable testing, debugging, and monitoring practices that teams have refined over years.
Lazy evaluation and caching: Critical for developing multi-step inference pipelines efficiently.
Row-level operations: Perfect for the row-by-row nature of inference workloads, where each document might be processed differently.

By building on this foundation, teams can apply their existing knowledge while gaining new capabilities specific to AI workloads.

How does Fenic improve traditional RAG (Retrieval-Augmented Generation) pipelines?

Traditional RAG implementations suffer from several issues that Fenic addresses:

Intelligent chunking: Instead of naive character-count splitting, Fenic leverages document structure (sections, paragraphs, headings) for semantically meaningful chunks.
End-to-end optimization: Rather than cobbling together separate tools (embedders, vector databases, LLMs) with limited visibility between them, Fenic provides a single optimizable pipeline.
Debugging capabilities: Row-level lineage lets you trace exactly which documents were retrieved, how they were processed, and what prompts were used when a RAG system produces poor results.
Evaluation integration: Easy comparison against golden datasets using the same data frame operations, making it straightforward to calculate accuracy metrics and identify failure patterns.

Developer Experience

How does Fenic handle debugging and evaluation of AI pipelines?

Fenic provides two key features that transform the debugging experience:

Explicit caching: Cache outputs at any pipeline step, allowing you to iterate on downstream logic without re-running expensive upstream inference. This dramatically speeds up development cycles.
Row-level lineage: Unlike traditional column-level lineage, Fenic tracks individual row processing history. When you get an unexpected output, you can trace that specific result back through every transformation and prompt used. This is crucial for non-deterministic AI pipelines where each row might be processed differently.

For evaluation, you can easily integrate golden datasets and run comparisons using familiar data frame operations, making it simple to track metrics and identify systematic issues in your pipeline.

How does embedding inference in the query engine differ from using UDFs?

When inference is a first-class operation, the query optimizer can:

Schedule and batch LLM calls efficiently
Identify opportunities to use smaller, cheaper models for certain operations
Cache repeated inference patterns automatically
Reorder operations to minimize expensive API calls
Provide accurate cost estimates before running queries

With traditional UDFs, the optimizer treats inference as a black box, missing all these optimization opportunities. This difference can translate to significant cost savings and performance improvements in production.

Integration and Scaling

How does Fenic fit into modern data lakehouse architectures?

Fenic embraces the separation of storage and compute that defines modern data architectures. We’re a purely compute engine that integrates seamlessly with existing storage solutions:

Storage formats: Full compatibility with Parquet, Iceberg, Delta Lake, and Lance
Built on Arrow: Leverages Apache Arrow for ecosystem interoperability
Lakehouse-native: Reads from and writes to your existing lakehouse without requiring data movement

The philosophy is simple: specialized teams have already solved storage well. Fenic focuses on providing superior compute capabilities for AI workloads while working with your existing infrastructure.

How does Fenic handle scaling for large document collections or high-inference workloads?

The primary bottleneck in AI applications is typically LLM inference cost and latency, not data processing. Fenic addresses this through intelligent optimization:

Inference-aware optimization: The query engine can batch requests, cache repeated operations, and choose appropriate model sizes based on context
Single-node efficiency: The open-source version scales well on a single node for most workloads
Distributed compute: Cloud platform being designed with Ray integration for truly massive workloads

The key is that by understanding inference as a first-class operation, Fenic can apply optimizations that are impossible when LLMs are just external API calls.

Real-World Applications

What are some compelling production use cases for Fenic?

Media and Content Companies: One partner moved beyond static content taxonomies to dynamic narrative extraction. They analyze streams of news articles to automatically construct “narrative arcs” showing how stories evolve over time. This creates sophisticated, real-time content organization that updates continuously—something prohibitively complex with traditional tools.

Cybersecurity: Security teams use Fenic to process the massive mix of structured logs and unstructured threat reports that define modern security operations. They’ve automated entity extraction for threat actors, malware families, and attack patterns across vast datasets. The scale makes it impossible to just “throw everything at an LLM”—you need Fenic’s structured approach to make the data tractable.

Financial Services: Although still emerging, we’re seeing interest from teams processing earnings reports, regulatory filings, and research documents to extract structured insights at scale.

How does Fenic compare to other document processing tools like unstructured.io or DocETL?

These tools excel at the ingestion layer—converting PDFs, images, and other formats into initial text representations. Fenic operates at the next layer: post-processing, validation, and pipeline orchestration.

A typical workflow might use specialized extraction tools for initial OCR or parsing, then use Fenic to:

Clean and validate the extracted data
Apply domain-specific schemas and rules
Run quality checks and fix errors using additional LLM calls
Build production-ready pipelines with monitoring and debugging

Think of extraction tools as solving the “getting data out” problem, while Fenic solves the “turning extracted data into reliable, structured information” problem that’s critical for production AI applications.

What domains could benefit most from this approach next?

Any domain where knowledge workers process large volumes of unstructured documents:

Legal: Contract analysis, case law research, compliance checking
Healthcare: Clinical notes processing, research paper analysis, patient record structuring
Manufacturing: Technical documentation, maintenance reports, quality control documents

These domains have massive amounts of valuable unstructured data but lack the tools to systematically extract and structure that information for AI applications. Fenic provides the missing infrastructure to unlock this value.