Making Data Engineering Safe for Automation and Agents

Ben Lorica

4 months ago

Ciro Greco on Git for Data, AI Agents, and the Future of the Lakehouse.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Ciro Greco, Co-founder & CEO at Bauplan, joins the podcast to discuss a new paradigm for data engineering rooted in software engineering principles. He explains how treating the data lakehouse like a software project — with version control, branching, and transactional pipelines — creates a robust and safe environment for development. This code-first, programmable foundation is essential for the coming wave of AI agents that will automate the creation, debugging, and maintenance of data infrastructure.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Inside the race to build agent-native databases
Autonomous Agents are Here. What Does It Mean for Your Data?
Inside the race to build agent-native databases
Luke Wroblewski → Databases for Machines, Not People
Mike Freedman and Ajay Kulkarni → Is Your Database Ready for an Army of AI Agents?
Agentic AI Applications: A Field Guide

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Core Philosophy & Problems

What does “bringing software engineering rigor to data engineering” mean in practice, and what problems are you trying to solve?

The core idea is applying the best practices that allow software engineers to build massive, scalable applications with large teams to the world of data. Software engineering has demonstrated how to achieve both robustness and simplicity at scale—teams can work together in the same codebase while maintaining productivity and reliability.

Historically, data engineering hasn’t needed this level of rigor because the stakes were lower. An internal dashboard breaking isn’t the same as a customer-facing application going down. But as data becomes mission-critical—especially with AI-powered applications serving millions of users—you need the same level of dependability. The challenge is that data tooling hasn’t caught up, and there’s been a false tension between simplicity and robustness.

The goal is to eliminate this false dichotomy and provide a system where data engineers can work productively without wasting most of their time debugging brittle pipelines. The core value proposition is simple: you can run a complex system with very few data engineers because the system is engineered not to break and leave you with unrecoverable problems.

Why is data engineering fundamentally more complex than traditional software engineering?

Data systems face several unique challenges:

First, data is an open system—you don’t control much of what happens because it depends on user behavior and external systems. Distributions change, spikes are unpredictable, and you’re dealing with inputs you can’t fully anticipate.

Second, data is bulky at scale. Serious data work can’t be done on your laptop, which creates a major disconnect between local development and cloud production. When you work on a sample locally, you can’t be sure it’s representative. More critically, things break on your laptop for completely different reasons than they break in the cloud—cloud failures often involve distributed infrastructure issues that you can’t reproduce locally.

Third, data systems are highly fragmented. You’re managing a distributed, stateful system spread across different silos: application databases, data lakes, separate compute runtimes, warehouses with different architectures. This fragmentation makes the entire system messy to operate and difficult to reason about with traditional unit and integration testing alone.

Versioning, Transactionality & “Git for Data”

How does the “Git for data” concept work in practice, and why is it more than just version control?

“Git for data” started as a loose metaphor, but we’ve made it a concrete, functional part of the platform. Early systems focused on versioning individual datasets, which is necessary but insufficient. To achieve true software-like workflows, you need more.

Our approach combines three critical elements:

Table Versioning: Using open table formats like Apache Iceberg that have built-in time travel and atomic writes at the table level.
Multi-Table Versioning: Using a catalog (similar to Project Nessie concepts) that allows you to think in terms of branches across the entire lake, not just single tables. This is crucial because data pipelines rarely affect just one table—you need to version the entire set of tables touched by a pipeline.
A Version-Aware Runtime: This is the critical missing piece. If your runtime (e.g., Spark) isn’t aware of these multi-table branches, it can’t execute a pipeline transactionally. Without this, a pipeline might update two of four tables and then fail, leaving your data lake in an inconsistent state. No individual table is corrupted, but the numbers across your lake are wrong.

The integration of all three components ensures that nothing you do on your data lake happens in isolation from your code, and nothing gets lost—you can always travel back in time and reproduce things deterministically.

What does it mean to make pipelines “atomic,” and why does this matter for a data lake?

When a data engineer thinks about their work, they think in pipelines, not individual tables. A single pipeline might update 100 different tables. We believe that when you merge the code for that pipeline, the corresponding data changes should also be merged as a single, atomic operation.

Our runtime is tightly integrated with the versioning system. When you run a pipeline, it operates on an isolated branch. The entire pipeline runs as a single atomic transaction. If it succeeds, all table changes are merged into the main branch at once. If it fails, nothing is merged, and the production data remains untouched and consistent. The isolated branch with the failed state is preserved, making it simple to debug the exact cause of the failure.

This means your data lake is never left in an inconsistent, intermediate state—either all 100 tables are updated successfully, or none of them are. This is “as transactional as you can get” for a lake: atomic merges at the pipeline level.

Why does transactionality matter for analytics workloads when they’re not high-concurrency transactional systems?

While traditional analytics isn’t high-concurrency in the database sense, there’s a practical tradeoff between absolute transactionality and having a system that’s impossible to debug when things break across 10 different fragmented places. As analytics becomes part of everyday product development—not just internal reporting—you want your lake to be as transactional as possible.

More importantly, this becomes critical when agents enter the picture. If you believe that in 5-10 years you’ll have one data engineer managing hundreds of agents, you’re moving toward the same high-concurrency problem that transactional databases solved. Your lake needs to be transactional to handle multiple artificial systems working on the same data pool simultaneously.

What is the “write-audit-publish” pattern, and why should it be everywhere, not just at ingestion?

Write-audit-publish (also called blue-green deployment for data) originated at Netflix and refers to ensuring data is vetted before it hits your production lake. Traditionally, this has been limited to data ingestion.

The question is: why limit this to just ingestion? Every operation on your lake should work this way. Every operation should involve isolating the data, running a transactional job (whether a pipeline or single operation), and then merging atomically if it makes sense—or keeping the branch open for debugging and potential rollback.

This built-in transactionality enables write-audit-publish to be the default for every operation, not just a special pattern for ingestion. Every change is made in isolation, can be vetted, and is then merged atomically, guaranteeing that your production lake is always coherent and dependable. When agents enter the picture and you have enormous numbers of artificial systems working together on the same data pool, this isolation everywhere becomes essential.

How does this approach enable teams to scale without typical bottlenecks?

For scaling teams, this means multiple developers or teams can create their own branches of the data and work independently on new pipelines or fixes in a secure, isolated cloud environment. They can merge their changes back without risking the integrity of the production data lake.

When you have a system that’s essentially “Git for data” and you lower the technical bar for cloud development, your central data team can give real autonomy to product teams, data science teams, and application teams without constant babysitting. Developers get their own branches and environments where they can work securely. The central team retains full observability and can always monitor everything and roll back if needed.

This isn’t just about growing a central data team—it’s about enabling adjacent teams to work independently when the central team becomes a bottleneck, allowing product and data science teams more autonomy without requiring constant hand-holding from the platform team.

Safety, Governance & AI Agents

What does “safety” mean in the context of a data platform?

Safety means designing the system so that no mistake is irreversible. Every operation executes in an isolated branch using the write-audit-publish style. If anything fails—even a misguided DROP TABLE command—it never hits production. You can diff, debug, and roll back deterministically, with full change history.

When something breaks, the system creates a branch that freezes the broken state in time and prevents further changes from propagating to the underlying data. This means your production lake never ends up in an inconsistent state. The rollback mechanism is built into the core workflow by design, not as a reactive measure.

How does fixing concurrency enable governance?

Once you have a truly transactional system with proper concurrency controls and principled isolation, governance becomes much more straightforward. We know this because databases figured this out—governance in databases is relatively well-understood and easier than governance on data lakes.

When you can properly isolate different writers and readers in a principled way, access control comes naturally. It’s not that hard to build that layer once the concurrency foundation is solid. Consistent isolation is the foundation on which sensible permissions and lineage actually work. This is why systems like Databricks’ Unity Catalog resonate with enterprises—it’s the first time they can actually handle governance in a gigantic data lake.

How does this transactional, version-controlled approach enable the use of AI agents for data engineering?

The biggest barrier to using AI agents for real data engineering work is safety. We are comfortable letting a coding agent like Claude write code in our IDE because we have Git—we can review the changes, and if something goes wrong, we can easily roll back. But nobody in their right mind would let an agent write directly to a production data lake because that rollback capability doesn’t traditionally exist.

Our platform provides that safety net. Because every operation is isolated in a branch and executed transactionally, an agent can be tasked with writing to the lake. For example, you can tell an agent, “This pipeline broke; please fix it.” The agent can check out a branch, write the code, run the pipeline, and submit the result as a pull request with both the code and data changes. The human engineer can then review the result and merge it confidently. The worst-case scenario is the agent does something wrong on a branch that can be discarded without ever affecting production.

This moves agents from being read-only analysis tools to being active participants that can write, fix, and build. If agents only read data—exploring and analyzing without making changes—you can automate what analysts do. The next level is more interesting: when the agent identifies the fix, it can implement the change and return with a branch and a pull request showing what it did before you merge. That’s true automation of engineering work, not just analyst work.

What is the MCP server’s role, and why does it matter for agents?

The MCP (Model Context Protocol) server is the translation layer between a coding agent and the Bauplan platform. Designing a good MCP server is crucial for enabling agents to work with the platform effectively.

The key insight: we’re comfortable with Claude writing our code because we can roll back in Git. When you introduce the same rollback safety to data lakes, agents become viable for actual data engineering work—not just reading and exploring data, but actually writing to the lake.

When customers got access to the MCP server, they spontaneously started using Claude and other agents for actual data engineering without much prompting from us. They’re using the existing Git-for-data abstraction through the MCP to do something unprecedented: asking agents to write data into the lake. This rapid adoption speaks to how fundamentally work patterns are changing—these data engineers already spend their day in tools like Claude Code, so extending that workflow to their data infrastructure is natural.

Code-First Architecture

Why is a code-first approach essential, especially given that users can now use natural language prompts?

Users will use prompts, but agents will use code. Operating with Bauplan feels like working with a Python package—you pip install it, work primarily in your IDE, and whenever you do something that changes the lake state, you do it through code.

All platform functionality—branching, querying, running pipelines, importing data, dropping tables, merging branches—is exposed as an API in your code. The reason is simple: UIs are great for human interaction and manual tasks but terrible for automation and anything done at scale.

By exposing every platform capability as a Python API, we make the platform natively operable by other programs, especially AI agents. Agents are great at writing and executing code; they are terrible at navigating UIs. A code-first design allows for principled, imperative logic to be built around what the LLM does, providing that crucial layer of security and control.

Current “agent washing” involves platforms built for human UIs with agents bolted on top. But you can’t just put an agent on top of a UI-driven platform and expect it to work effectively for automation. The platform’s core interfaces must be designed from the ground up for programmatic interaction.

How does this compare to the DevOps shift to infrastructure-as-code?

This mirrors the cloud era when DevOps shifted from manual processes to infrastructure-as-code. As data becomes critical for systems that embed AI everywhere, we need a way to program infrastructure at scale.

Our vision is that data infrastructure needs to be programmable, much like how DevOps transformed when the cloud emerged. Exposing the entire platform as code allows AI systems to manipulate the platform natively, because AI systems excel at coding and are terrible with UIs. It also provides the principled, imperative layer on top of what LLMs do, which is crucial for security and control.

What specific design choices need to change when building for agents versus humans?

Consider query optimization as an example. A data engineer on Snowflake, paying per query, has an incentive to carefully craft and optimize queries before running them. They minimize iterations to control costs.

Agents work completely differently—they work better with fast iterations over similar queries, fixing one line or one filter at a time. They improve through rapid, incremental retries. Now imagine hundreds of agents hitting your system with suboptimal queries, iterating their way to the right answer. Is your system designed for that high-throughput, high-redundancy scenario? Most warehouses aren’t, because nobody uses them that way today.

This implies fundamentally different infrastructure requirements: high-throughput workloads with smart scheduling and caching to absorb agent iteration patterns efficiently. Additionally, you might not always want to use your most expensive, sophisticated model. Sometimes you want a cheaper, simpler model that’s been trained or fine-tuned for specific tasks. Your infrastructure needs to support fast iteration for these less sophisticated agents rather than force them to work like expensive, thoughtful models.

When you look at actual agent traces, they don’t behave exactly like humans—your system needs to be redesigned with that in mind, allowing agents to work in a way that is natural to them.

Platform Architecture & Integrations

What is the “Programmable Lakehouse” you describe?

The Programmable Lakehouse is the culmination of these ideas—a next-generation data stack built from first principles for AI systems to operate both the compute and data management layers.

It’s a system where every operation is atomic and versioned, and every capability is exposed through a code-based API. This creates an environment where data engineers can manage swarms of AI agents that perform the bulk of the repetitive work—fixing pipelines, optimizing queries, performing root cause analysis, and even building new transformations. The data engineer’s role evolves from doing the manual work to architecting the system and orchestrating these agents, allowing them to focus on higher-level business problems.

What’s the current status of the platform?

Bauplan is currently in production with early customers—mostly startups plus one large enterprise—but as a seed-stage company, it’s not yet generally available. The initial validation shows that this approach resonates with a new generation of companies.

The core functionality is shipping today: branchable multi-table versioning, atomic pipeline merges, and a Python runtime that treats pipelines like code. Building data infrastructure takes significant time and engineering effort, so there’s substantial work ahead on the roadmap—focusing on general availability, distributed scaling, smarter scheduling, and caching layers tailored to agent-driven workloads. The underlying programmable lakehouse platform needs to be fully-fledged and end-to-end so you can safely use agents on your data and sleep at night.

How does Bauplan interact with orchestration tools like Airflow, Dagster, or Prefect?

We see a clear separation of concerns. Tools like Airflow and Dagster are orchestrators—they handle scheduling, monitoring, and retries at a higher level in the stack. Bauplan is a runtime—we are the engine that actually executes the pipelines and manages data, branches, and atomic merges.

Our clients use their existing orchestrators, and a task in their Airflow DAG or Dagster graph will simply make an API call to Bauplan to run a pipeline, import data, or create a branch. We integrate with all of them.

The key principle is separation of concerns: your orchestration layer should not be your runtime. We’ve seen many projects become unmanageable because complex data transformation logic gets crammed inside massive Airflow projects or orchestration logic. This creates a form of lock-in—not by a vendor, but by your own design choices. At a certain point, the entire system is designed around that orchestrator, making it incredibly difficult to migrate or evolve. By keeping the runtime separate and putting transformation logic in the runtime rather than inside giant DAGs, you maintain a cleaner architecture with better separation of concerns.

Which orchestration approaches make sense for agent workflows?

Keep orchestration thin and separate. Tools that bring simplicity and durable function concepts (like Temporal or Inngest) will be interesting for agentic workflows, but the key principle remains: simplicity wins. Keep orchestration separate from runtime and data management, or you’ll be trapped in quicksand of your own making.

Semantic Layers & Context

What is your view on the role of a semantic layer or “context store”?

A semantic layer, or what some call a “context store,” is critically important—especially for AI agents and operationalizing “chat with your data” use cases. To successfully automate these scenarios or have an agent reason about your data, you must provide it with context—definitions of metrics, business logic, and relationships. Without this, you risk the LLM acting as an opaque oracle, producing numbers that you can’t trust or verify. All the “chat with your data” projects will fail without fixing the context layer.

Lowering the barrier for non-SQL users is great, but if you can’t control the semantic layer, you risk people treating the LLM as an oracle with no way to verify if the data or numbers are correct.

Why isn’t Bauplan building its own semantic layer?

Bauplan is an infrastructure company focused on the compute and data management substrate. We are not building a semantic layer product ourselves for several reasons:

First, composability is critical in 2025 and beyond. This transition to agent-driven workflows won’t happen overnight, and legacy platforms will persist (just like IBM mainframes still exist in banks). Some workflows will migrate to new platforms, others won’t. Locking users into a proprietary semantic layer doesn’t serve this reality.

Second, there is currently no consensus on the best way to define and manage semantics, and the landscape is still evolving. Traditionally, semantics were defined inside BI tools, which creates silos that nobody can tap into. The next generation (companies like Cube.dev) extracted semantics from consumption layers, which is clearly better design—it will pay off when you want to standardize across multiple systems. But there’s no clear consensus or dominant player yet.

Third, we plan to integrate with and leverage emerging standards and tools in this space. For example, the metadata capabilities in Apache Iceberg already allow for propagating context down to the table level, which is something we can capture and utilize. When there’s a clear integration to build, it will be built. We want our users to be able to use whichever context or semantic layer makes sense for them, not be locked into our choice.

Future of Data Engineering

In five years, how do you see the role of a data engineer evolving?

If our vision is correct—and it’s looking more validated now than when we started—the work of a data engineer will change dramatically. We expect spending on data compute to grow, while the size of data engineering teams will shrink relative to the work they accomplish.

A single data engineer will be able to achieve 10x the output of today by offloading the repetitive and difficult work of debugging, root cause analysis, incident response, and pipeline fixing onto AI systems. The data engineer becomes an orchestrator of AI systems, managing swarms of agents that perform the bulk of the repetitive work.

What will data engineers focus on instead of repetitive tasks?

The data engineer’s role will shift to that of an architect and orchestrator of AI agents. In the ideal scenario, they will focus on architecture decisions—the strategic “how do we build this?” questions rather than the tactical firefighting.

More importantly, they’ll be able to get closer to the business and work more closely on understanding the “why” behind what they’re building. This is a significant problem in data engineering today: the infrastructure is so complicated that people get fixated on it and lose track of why they’re doing the work in the first place. By freeing them from the low-level complexities of infrastructure management, data engineers will be able to focus on the most important questions: How should this system be designed? What are the core business problems we are trying to solve? They’ll have a greater impact on the “why” behind the data, not just the “how.”