The Infrastructure for Production AI

Ben Lorica

5 months ago

Zhen Lu on AI-First Clouds, Production Use Cases, and GPU Reliability.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Zhen Lu, CEO of Runpod, joins the podcast to discuss what it means to build an “AI-first” cloud, moving beyond the architectures of the Web 2.0 era. He shares practical insights into the most common production use cases for AI today, including generative media and fine-tuned small language models for enterprise agents. The conversation covers the critical challenges of achieving reliability in production, from fragile GPUs to silent “gray outages,” and explores what the future of agent-driven software means for developers and infrastructure providers.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Is your LLM overkill?
How Tech-Forward Organizations Build Custom AI Platforms
Andrew Rabinovich → Why Digital Work is the Perfect Training Ground for AI Agents
Anant Bhardwaj → Predictability Beats Accuracy in Enterprise AI
Hagay Lupesko → Beyond GPUs: Cerebras’ Wafer-Scale Engine for Lightning-Fast AI Inference

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

AI-First Cloud Infrastructure

What distinguishes an AI-first cloud from traditional cloud providers like AWS?

An AI-first cloud is fundamentally different because it’s co-designed from the hardware up through the software to serve AI workloads. Traditional clouds were built for Web 2.0 applications that were primarily IO-bounded, focused on shuttling small amounts of data between services. AI workloads are compute-bounded and require moving orders of magnitude more data – from model weights to kernels to media files.

The key differentiator is starting from first principles with both hardware and software working in tandem. This means tight integration of GPUs, high-bandwidth networking, and storage layers with software abstractions that expose the right controls. For example, an AI-first cloud provides multi-level caching beyond just exposed storage layers – including caching at shared memory levels and sharded storage across different storage types globally. This requires low-level hardware access that software-only companies can’t achieve. You can’t get this level of optimization by simply cobbling together existing services from a traditional cloud provider.

What specific capabilities should practitioners expect from an AI-first platform?

Teams should expect hardware-software co-design features like model-aware placement and high-throughput networking optimized for AI workloads. The platform should provide multi-tier caching and sharded storage specifically tuned for model weights and large artifacts, not just generic buckets and volumes. Operational ergonomics are critical: job orchestration, self-healing capabilities, health probes, utilization telemetry, and simple developer access controls so you aren’t managing thousands of SSH keys. The platform needs to handle scheduling, health checks, fault tolerance, and support for heterogeneous clusters without forcing teams to assemble these capabilities from scratch.

Which teams actually need an AI-first cloud platform?

Teams that benefit most are those running production workloads, ongoing fine-tuning, or advanced training runs. If you’re just doing basic RAG applications that call third-party APIs like OpenAI, you might not need to run any AI workloads yourself and can get by with traditional cloud services. The need for AI-first infrastructure arises when you decide to run, fine-tune, and operationalize AI models yourself to gain more control, better performance, or meet specific requirements.

For simple supervised fine-tuning experiments, you mainly need framework support and basic GPU access. But as you scale to production inference with SLAs, multi-day training runs, or fleets of heterogeneous GPUs, you need infrastructure that’s self-healing, has robust health checks, and can recover from inevitable hardware failures. The challenge isn’t just getting GPUs – it’s operationalizing them with proper software layers, networking, and the ability to handle non-homogeneous workloads effectively.

Hardware and Compute Reality

Is GPU availability still a major constraint for enterprises?

For foundation model companies training models with tens or hundreds of thousands of GPUs, yes – supply remains constrained. But for most enterprises, GPU availability isn’t the primary challenge anymore. The real difficulty is getting compute that can be operationalized – with proper software layers, co-location, networking, and developer-friendly interfaces.

Many companies that purchased thousands of GPUs discovered they had no idea how to actually deploy and manage them effectively. The challenge lies in the software layer that wraps the hardware: How do you manage networking? How do you divide GPU resources among developers without creating a security and management nightmare? The difficulty is in the software and ergonomics that make the hardware usable for your business.

What’s the current state of AMD GPU usability for AI workloads?

AMD GPUs are better than six months ago and can perform quite well once properly configured, but there’s still an unfortunate barrier in software. Most issues stem from frameworks and models not being updated to support ROCm out of the box. Getting AMD GPUs to work often requires dealing with the pain of an immature software stack, and success depends on tribal knowledge specific to your workload.

Teams currently adopting AMD are either extremely cost-sensitive and willing to deal with setup pain, or have specific incentives to make it work. It’s similar to the adoption period whenever NVIDIA releases new architecture – requiring updates to nightly builds and dealing with alpha/beta software. For AMD to see wider adoption, the developer experience needs to become much smoother.

Production Use Cases and Patterns

What are the most common AI applications actually running in production?

Three major categories dominate production deployments:

Generative Media: This remains huge and compute-intensive. Applications include fashion virtual try-on systems, real estate staging with AI-generated video walkthroughs, and digital avatars with voice cloning. These started with image generation but have expanded to video and multimodal use cases combining realistic avatars, text, and voice. These applications are both computationally expensive and bandwidth-hungry.

Small Language Model Agents: Companies are running sub-70B parameter models for customer support and internal workflows. These teams typically start by prototyping with providers like OpenAI or Anthropic but migrate to running their own fine-tuned open-weight models. They make this switch to gain more control, predictability, and better performance on their specific tasks while avoiding issues with model deprecation, lack of control, or unpredictable behavior changes.

High-Accuracy Transcription: While transcription might seem like a commodity, specialized services require extremely high accuracy or custom functionality. Companies whose core business relies on transcription run their own models to fine-tune them for specific audio environments and add features like precise timestamping or in-process text categorization. Doing this in a single pass on the same compute worker is more efficient than a multi-step pipeline.

Why are teams choosing to run their own smaller, fine-tuned models instead of using large API providers?

Teams make this choice for three main reasons. First, control and predictability – API providers can deprecate models, which breaks applications that rely on carefully engineered prompts. Enterprises need reliability and can’t afford unexpected system failures. Second, a fine-tuned smaller open-source model can outperform much larger general-purpose models for narrow, specific use cases. Third, once tuned, a smaller model is significantly cheaper to run for inference at scale, making the economics of the application viable.

Additionally, teams face issues with IP risk, data privacy constraints, latency requirements, and governance concerns that push them toward owning more of the stack. The investment in prompts and workflows makes model switching expensive, so teams want to maintain consistency even as base models evolve.

What percentage of production inference uses reasoning models versus standard models?

Reasoning models remain in the minority due to cost, speed, and unpredictability. The challenge for enterprises is budgeting – especially with multi-turn agentic workflows where thinking time is essentially unbounded. Until pricing becomes more outcome-focused rather than compute-time based, adoption will remain limited to specific high-value use cases. For many enterprise tasks, smaller fine-tuned models deliver better cost-reliability tradeoffs.

Model Deployment and Fine-tuning

Why is fine-tuning still important when foundation models keep improving?

Fine-tuning provides two critical benefits that remain valuable regardless of foundation model improvements. First, it allows smaller models to outperform larger general-purpose models for specific tasks, providing superior performance on narrow domains. Second, it gives teams complete control over their deployment – avoiding model deprecation issues, maintaining consistent behavior, and controlling costs.

Teams are discovering that a fine-tuned smaller open-source model can perform better than ChatGPT for narrow use cases while being more predictable and cost-effective. The investment in prompts and workflows makes model switching expensive, so fine-tuning helps teams maintain consistency even as base models evolve. This approach also addresses IP risk and data boundary requirements that many enterprises face.

What about supervised fine-tuning versus more advanced reinforcement learning approaches?

Supervised fine-tuning (SFT) is the workhorse because it’s explainable, repeatable, and easy to operate. The workflow is straightforward and well-understood by most teams. Reinforcement learning fine-tuning is gaining interest and represents a potential unlock in performance, but it’s much harder to implement.

The RL workflow requires non-homogeneous clusters, different environments, and most importantly, a UX that allows domain experts to understand and participate in the process. It needs better orchestration tools and tighter domain-expert workflows. The tooling and workflows aren’t mature enough yet for typical enterprise teams to adopt effectively. Expect adoption where the measurable lift justifies the extra complexity.

Reliability and Operations

What are the main reliability challenges when running AI workloads in production?

Reliability issues occur at three main levels:

Hardware Failures: GPUs are surprisingly fragile with failure rates in the low single-digit percentages – much higher than traditional hardware. Teams need systems that can detect GPU failures, provide rapid replacement, and automatically recover jobs. You have to build your systems with the expectation that GPUs will break and have a recovery plan in place.

Network Bottlenecks: AI workloads move massive amounts of data, making bandwidth – not just latency – a critical issue. The massive bandwidth requirements for moving model weights and data create enormous strain on network infrastructure, especially for media workloads at scale. Networks bottleneck on bandwidth when pushing model weights and media across global links.

Gray Outages: These are the most difficult issues – partial failures within customer workloads where compute appears to be running but will never complete. A workload might be stuck in a non-productive state due to subtle issues like network inconsistencies or kernel problems. The instance consumes expensive compute but won’t finish its task. These require collaboration between infrastructure providers and customers to implement proper monitoring hooks, utilization KPIs, and automated remediation strategies.

What emerging practices help ensure production reliability?

Start simple and add complexity only as needed. Write composable code from the start, but deploy as a monolith initially for simplicity. As your application scales and needs become clearer, your platform should allow evolution toward microservices-style deployment without complete re-architecture.

Key practices include treating model weights as first-class data with cache hierarchies, smart prefetch, and sharded storage. Bake in runtime health with per-process probes, GPU/host/Kubernetes-level checks, and automatic failover. Instrument for utilization and throughput, not just success/error counts. Invest in workload-level signals like utilization plateaus and step-time drift. Add CI/CD for prompts, datasets, and model versions – test suites that pin expected behavior across model or provider swaps.

Focus on solving fundamental hardware and software maturity issues before optimizing for architectural elegance. Iteration and continuous deployment capabilities are more critical than perfect initial architecture.

Data Movement and Storage

Why isn’t “just use buckets and volumes” good enough for AI workloads?

Buckets and block storage are necessary but insufficient for production AI. These workloads need multi-level caching at shared memory, node-local SSD, and distributed cache levels, orchestrated by workload-aware policies. They also require sharded storage across regions and media types to reduce cold starts, control egress costs, and stabilize tail latency during scale-outs.

The platform needs to provide model-aware caching and placement strategies that understand the specific patterns of AI workloads – like the need to repeatedly load the same model weights across multiple workers or the burst patterns of batch inference. This goes far beyond what generic storage solutions can provide.

Agents and the Future of Software

How will agents change software development and deployment?

Agents will essentially become the software rather than being wrapped in deterministic code. We’re moving toward a world where businesses will use a mix of off-the-shelf agents for common tasks and custom-built agents for their unique, complex problems. While deterministic code won’t disappear – it’s still needed for things like mixture-of-experts routing and scaffolding agent systems – the focus will shift to building and iterating on agents themselves.

The most valuable skill will no longer be wrapping an AI model in deterministic code; it will be building, testing, and iterating on the agent itself. This requires new tools and platforms – for example, CI/CD for prompts – to support this new development lifecycle.

What infrastructure changes are needed as agents become ubiquitous?

Everything needs to be rethought with the assumption that software will increasingly interact with agents rather than humans. Databases, APIs, and other services designed for human interaction patterns will need fundamental redesigns. The challenge is making agents work together effectively, determining where deterministic code interfaces with agents, and building platforms that support rapid iteration as models and capabilities evolve.

Teams should shift engineering emphasis from wrapping agents with deterministic glue to engineering the agents themselves – their tools, memories, policies, evaluations, and iteration loops. The infrastructure needs to support this shift with appropriate abstractions and tooling.

Practical Guidance for Teams

If I’m leading an AI application team, what should I prioritize in the next quarter?

Focus on five key areas:

Own your reliability loop: Implement comprehensive health checks, job resumption capabilities, and workload-aware caching. Treat failures as inevitable and build recovery into your system from the start.
Stabilize behavior: Create CI/CD pipelines for prompts, datasets, and models with task-specific evaluations. Track outcomes across versions so changes in providers, quantization approaches, or base models don’t silently break your product. Store prompts and model choices as code, diff them, and gate changes with behavioral test suites including golden sets, adversarial cases, and cost/latency budgets.
Right-size models: Prefer small fine-tuned models for cost and predictability. Add reasoning capabilities only where they measurably improve outcomes enough to justify the cost. Use quantization and distillation to push inference costs down – train on bigger clusters, then deploy smaller artifacts to cheaper hardware.
Design for data movement: Co-locate compute and storage when possible, use sharded weights and multi-tier caches. Understand that bandwidth, not just latency, is a critical bottleneck for AI workloads.
Keep it simple first: Start with monolithic deployments using composable code paths. Break into microservices only when scale proves the need – for example, when you need to independently scale ASR, embedding, reranker, and generator components.

Is “serverless, instant, auto-scaling” the right target for AI infrastructure?

These terms are overloaded in the AI context. Instead, prioritize proximal objectives: fast cold starts via model-aware caching, scale-to-zero where it makes sense, resilient job resumption, and smooth developer ergonomics. You’ll get most of the benefits users expect from “serverless” without chasing a marketing label. The platform should let you promote components to microservices without re-architecting everything.