The Practical Realities of AI Development

Ben Lorica

8 months ago

Lin Qiao on AI Dev Challenges, Model Convergence, Fine-Tuning & Infra Abstraction.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Lin Qiao, CEO of Fireworks AI, dives into the practical challenges AI developers face, from UX/DX hurdles to complex systems engineering. Discover key trends like the convergence of open-source and proprietary models, the rise of agentic workflows, and strategies for optimizing quality, speed, and cost. Learn how modern infrastructure abstracts complexity, enabling teams to focus on building applications and owning their AI strategy.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Robert Nishihara → The Data-Centric Shift in AI: Challenges, Opportunities, and Tools
Travis Addair → The Evolution of Reinforcement Fine-Tuning in AI
Hagay Lupesko → Beyond GPUs: Cerebras’ Wafer-Scale Engine for Lightning-Fast AI Inference
Nestor Maslej → 2025 Artificial Intelligence Index
What AI Teams Need to Know for 2025
AI Agents: 10 Key Trends & Challenges You Need to Know
AI Unlocked – Overcoming The Data Bottleneck
David Hughes → Prompts as Functions: The BAML Revolution in AI Engineering
Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

What are the key challenges developers face when building AI applications?

Developers encounter challenges on both UX/DX (user/developer experience) and technical fronts. The nature of these challenges varies by developer background:

Machine learning engineers struggle with algorithm design, infrastructure optimization, and model fine-tuning complexities
Application developers without ML expertise face steep learning curves when adopting AI capabilities

The balancing act involves addressing both sides: creating intuitive tools tailored to specific skill profiles while solving hard engineering problems like GPU allocation, multi-cloud orchestration, and latency management. Building production AI applications is therefore half product thinking, half systems engineering, with teams needing to optimize across three critical dimensions:

Quality (output accuracy)
Speed (throughput and latency)
Cost (operational efficiency)

How is cloud infrastructure complexity being addressed for AI developers?

Virtual cloud infrastructure solutions now span multiple cloud providers (typically seven or more) and regions (30+), handling:

Automatic GPU provisioning and management
Cross-regional availability and failover
Performance optimization and SLA maintenance
Workload-specific routing to appropriate hardware

This level of abstraction allows development teams to work strictly at the “call the model, ship the feature” layer without worrying about procuring GPUs, managing regional outages, or sustaining user-facing SLAs. By offloading these infrastructure concerns, developers can focus solely on building applications on top of foundation models rather than managing low-level infrastructure logistics.

What trends should developers be paying attention to in the foundation model space?

Key trends include:

Rapid convergence: The gap between open-source and proprietary models is shrinking significantly, particularly since early 2024 with high-quality open models (DeepSeek, newer Llama models, Cohere, etc.)
Creative application emergence: Completely new user experiences built on foundation models, particularly in coding, document processing, and specialized domain tasks
Standardization of toolchains: The technology stack is becoming more standardized, making AI more accessible to developers without deep ML expertise
Focus on efficient inference: By late 2025, foundation-model tooling will likely feel like a utility (spin up, benchmark, fine-tune, deploy) without requiring deep ML expertise
Latency-first design: Streaming everywhere will become table stakes for user-facing AI

What is the current state of proprietary versus open-weight models?

The performance gap is narrowing significantly. The choice depends heavily on the use case and strategic goals:

Proprietary Models (OpenAI, Google Gemini):

Often lead in some general benchmarks and integrated multi-modal capabilities
Provide the fastest path for broad, multimodal use cases
Easier initial integration for experimentation

Open-Weight Models (Llama, DeepSeek, Qwen, Cohere Command R, Gemma):

Advantages in ownership and control: Companies can fine-tune and deploy without external dependencies
Domain customization: Ability to specialize for specific use cases
Task-specific excellence: Open models excel particularly in coding and tasks that can be mapped to coding problems
Meet or exceed proprietary quality on many narrower tasks

No single model is universally best – performance heavily depends on the alignment between a model’s training data and your specific inference workload. The smallest gap between open and proprietary models exists in coding, largely because code outputs are easily verifiable, which provides clear signals for reinforcement learning.

The key recommendation is to architect applications to be model-agnostic from the beginning, allowing flexibility to swap models as capabilities evolve.

How should developers select and use different model types for specific tasks?

Models broadly fall into different categories with distinct strengths:

Reasoning models (e.g., DeepSeek R1/V2):

Excel at complex tasks requiring planning and multi-step execution
Generate verbose chains of thought to solve problems methodically
Ideal for breaking down problems into smaller components
Used for planning phases in agentic workflows
“Think aloud” and handle tasks requiring decomposition

Non-reasoning models (e.g., DeepSeek V3, Coder models):

More straightforward in their processing approach
Excel as conversational or coding assistants
Process input more directly without extended deliberation
Often more efficient for well-defined tasks
Used as “executors” for steps planned by reasoning models

A common production pattern is implementing a router mechanism that directs prompts to the appropriate model type based on the task requirements. For instance, routing a complex reasoning task to a reasoning model for planning, then directing each sub-task to a faster executor model. This approach allows teams to build comprehensive workflows where different models handle specialized functions.

How can developers optimize real-time AI applications, particularly for audio and video?

For real-time applications, especially those involving audio:

Latency optimization is critical:

Focus on time-to-first-token (ideally <200ms) rather than total generation time
Optimize for the first 20 tokens to maintain human interaction pacing
Enable streaming across models to improve perceived performance

Decomposed pipelines often outperform end-to-end models:

Use separate specialized models for different stages (e.g., audio input → LLM for intelligence → voice output)
This approach offers greater flexibility and customization than integrated solutions
Platform tools can help manage the complexity of these multi-model pipelines

Consider smaller models with fine-tuning:

Large models often fail to meet real-time latency requirements
Smaller, specialized models can be fine-tuned to bridge quality gaps while maintaining speed
Balance between model size and performance is crucial for consumer-facing applications

What approaches to model fine-tuning are most practical for application developers?

Fine-tuning is almost always part of building production-ready AI applications for “last-mile alignment” to achieve specific quality, behavior, and reliability. Two primary approaches exist:

Supervised Fine-Tuning (SFT):

Conceptually simpler: provide prompt-completion examples
Challenges remain in creating high-quality labeled training data
More accessible to application developers without deep ML expertise
Platform support is becoming more common

Reinforcement Fine-Tuning (RFT):

More complex: involves providing feedback/ratings on model outputs
Requires defining reward functions or rubrics (often multifaceted)
Currently better suited for ML engineers than application developers
Simplifying this UX is an active area of innovation

Infrastructure and Data Considerations:

Fine-tuning very large models (70B+ parameters) requires significant specialized infrastructure
Using proprietary models to generate fine-tuning data is sometimes done as a shortcut but may not yield optimal quality for specialized domains
The industry needs solutions for fast, low-cost, low-effort quality customization cycles
Human expert labeling often remains necessary but is costly

What’s the state of agentic workflows in production environments?

Agentic workflows are a major driver of foundation model usage, but with clear patterns:

Current focus is on single-agent systems:

Most production implementations involve single agents solving specific problems
Examples include coding agents, document processing agents, and specialized domain agents
Building stable single-agent systems is a prerequisite before moving to multi-agent complexity

Production implementations typically include:

Fine-tuning for “last-mile alignment” to specific tasks
Function calling for tool integration
Model routing to direct specific sub-tasks to specialized models

Multi-agent systems are still emerging:

Standards like MCP (Model Controller Pattern/Multi-agent Communication Protocol) are in early adoption
Complex coordination between agents remains challenging
The field is focused on establishing stability with single agents before scaling to multi-agent systems

How are different organizations adopting AI, and what patterns are emerging?

Adoption is happening simultaneously across three main segments, rather than following the typical pattern where startups lead and enterprises follow later:

AI-native startups:

Experimenting with entirely new user experiences enabled by foundation models
Building novel applications not possible before foundation models

Digital-native companies:

Organizations with existing large user bases renovating their products with generative AI capabilities
Enhancing existing offerings with AI capabilities

Traditional enterprises:

Focusing primarily on workforce productivity enhancement across various tasks
Implementing AI to streamline internal operations and workflows

As companies find product-market fit with AI, they increasingly want to own their AI strategy end-to-end, avoiding dependencies on centralized model providers. This ownership ensures their proprietary data fuels their own AI improvements rather than benefiting external applications.

What should teams consider when developing their “own-your-own-AI” strategy?

As teams reach product-market fit with AI applications, owning their AI strategy becomes vital:

End-to-end control:

Avoid reliance on centralized model providers
Maintain control from data collection through model deployment
Ensure proprietary data benefits your applications, not external ones
Preserve data sovereignty

Model flexibility:

Design applications to be model-agnostic from the beginning
Expect to swap models as capabilities evolve
Use abstraction layers for model switching/upgrades

Balance experimentation and production:

Use the best available models during experimentation
Invest in optimization and customization after finding product-market fit
Consider smaller, fine-tuned models for production efficiency once the approach is validated

Infrastructure considerations:

Evaluate build vs. buy decisions for AI infrastructure
Consider managed services that abstract complexity while maintaining control
Plan for scaling infrastructure as application usage grows

Will smaller models replace larger ones for production applications?

Smaller models (≤10B parameters) have physical limitations—fewer parameters store less knowledge—but targeted, fine-tuned small models can already outperform larger ones on well-scoped tasks while meeting strict latency and cost budgets.

Key considerations:

Physics still matters: smaller models inherently capture less information than larger ones
Fine-tuning can bridge quality gaps for specific, narrow tasks
Smaller models offer significant advantages in speed and cost efficiency
Optimization typically happens after product-market fit is established
Initial focus should be on capability and quality rather than efficiency

Expect a two-tier future: large “generalist” models in the backend handling complex reasoning, with small “specialist” models at the edge or in tight latency loops handling specific tasks.

Does underlying hardware choice (NVIDIA vs. AMD, etc.) matter to application teams?

Hardware considerations are increasingly important as applications scale:

Abstraction and management:

Modern platforms abstract hardware details, allowing developers to focus on application logic
Cross-provider compatibility ensures resilience against regional outages

Emerging competition:

AMD is closing the hardware performance gap with NVIDIA
The primary challenge remains software stack maturity for non-NVIDIA hardware
Increased competition ultimately benefits the ecosystem through better pricing and innovation

Workload-specific optimization:

Different hardware architectures excel at different workload patterns
Effective platforms match workloads to the most appropriate hardware
Frameworks like PyTorch support multiple hardware types, enabling flexible deployment

Optimization timing:

Hardware optimization typically comes after finding product-market fit
Initial focus should be on capability and quality rather than efficiency
Once scaling begins, hardware optimization becomes increasingly important for cost and performance

The simplest path is to use a platform that auto-routes workloads to the most cost-effective hardware, letting you benefit from emerging options without rewriting code.

What’s the bottom line for practitioners building AI applications?

Focus on user-visible latency and quality, build a model-agnostic architecture, fine-tune early for domain fit, and lean on multi-cloud inference platforms to hide hardware headaches. Keep continuous benchmarking in your CI/CD pipeline to evaluate new models as they emerge.

AI capability is converging fast across providers—your competitive edge will come from product execution, solving real user problems effectively, and optimizing across the quality-speed-cost triangle, not from reinventing the ML infrastructure wheel.