The Practical Realities of AI Development

Lin Qiao on AI Dev Challenges, Model Convergence, Fine-Tuning & Infra Abstraction.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

 

Lin Qiao, CEO of Fireworks AI, dives into the practical challenges AI developers face, from UX/DX hurdles to complex systems engineering. Discover key trends like the convergence of open-source and proprietary models, the rise of agentic workflows, and strategies for optimizing quality, speed, and cost. Learn how modern infrastructure abstracts complexity, enabling teams to focus on building applications and owning their AI strategy.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a heavily edited excerpt, in Question & Answer format.

What are the key challenges developers face when building AI applications?

Developers encounter challenges on both UX/DX (user/developer experience) and technical fronts. The nature of these challenges varies by developer background:

  • Machine learning engineers struggle with algorithm design, infrastructure optimization, and model fine-tuning complexities
  • Application developers without ML expertise face steep learning curves when adopting AI capabilities

The balancing act involves addressing both sides: creating intuitive tools tailored to specific skill profiles while solving hard engineering problems like GPU allocation, multi-cloud orchestration, and latency management. Building production AI applications is therefore half product thinking, half systems engineering, with teams needing to optimize across three critical dimensions:

  • Quality (output accuracy)
  • Speed (throughput and latency)
  • Cost (operational efficiency)
How is cloud infrastructure complexity being addressed for AI developers?

Virtual cloud infrastructure solutions now span multiple cloud providers (typically seven or more) and regions (30+), handling:

  • Automatic GPU provisioning and management
  • Cross-regional availability and failover
  • Performance optimization and SLA maintenance
  • Workload-specific routing to appropriate hardware

This level of abstraction allows development teams to work strictly at the “call the model, ship the feature” layer without worrying about procuring GPUs, managing regional outages, or sustaining user-facing SLAs. By offloading these infrastructure concerns, developers can focus solely on building applications on top of foundation models rather than managing low-level infrastructure logistics.

What trends should developers be paying attention to in the foundation model space?

Key trends include:

  • Rapid convergence: The gap between open-source and proprietary models is shrinking significantly, particularly since early 2024 with high-quality open models (DeepSeek, newer Llama models, Cohere, etc.)
  • Creative application emergence: Completely new user experiences built on foundation models, particularly in coding, document processing, and specialized domain tasks
  • Standardization of toolchains: The technology stack is becoming more standardized, making AI more accessible to developers without deep ML expertise
  • Focus on efficient inference: By late 2025, foundation-model tooling will likely feel like a utility (spin up, benchmark, fine-tune, deploy) without requiring deep ML expertise
  • Latency-first design: Streaming everywhere will become table stakes for user-facing AI
What is the current state of proprietary versus open-weight models?

The performance gap is narrowing significantly. The choice depends heavily on the use case and strategic goals:

Proprietary Models (OpenAI, Google Gemini):

  • Often lead in some general benchmarks and integrated multi-modal capabilities
  • Provide the fastest path for broad, multimodal use cases
  • Easier initial integration for experimentation

Open-Weight Models (Llama, DeepSeek, Qwen, Cohere Command R, Gemma):

  • Advantages in ownership and control: Companies can fine-tune and deploy without external dependencies
  • Domain customization: Ability to specialize for specific use cases
  • Task-specific excellence: Open models excel particularly in coding and tasks that can be mapped to coding problems
  • Meet or exceed proprietary quality on many narrower tasks

No single model is universally best – performance heavily depends on the alignment between a model’s training data and your specific inference workload. The smallest gap between open and proprietary models exists in coding, largely because code outputs are easily verifiable, which provides clear signals for reinforcement learning.

The key recommendation is to architect applications to be model-agnostic from the beginning, allowing flexibility to swap models as capabilities evolve.

How should developers select and use different model types for specific tasks?

Models broadly fall into different categories with distinct strengths:

Reasoning models (e.g., DeepSeek R1/V2):

  • Excel at complex tasks requiring planning and multi-step execution
  • Generate verbose chains of thought to solve problems methodically
  • Ideal for breaking down problems into smaller components
  • Used for planning phases in agentic workflows
  • “Think aloud” and handle tasks requiring decomposition

Non-reasoning models (e.g., DeepSeek V3, Coder models):

  • More straightforward in their processing approach
  • Excel as conversational or coding assistants
  • Process input more directly without extended deliberation
  • Often more efficient for well-defined tasks
  • Used as “executors” for steps planned by reasoning models

A common production pattern is implementing a router mechanism that directs prompts to the appropriate model type based on the task requirements. For instance, routing a complex reasoning task to a reasoning model for planning, then directing each sub-task to a faster executor model. This approach allows teams to build comprehensive workflows where different models handle specialized functions.

How can developers optimize real-time AI applications, particularly for audio and video?

For real-time applications, especially those involving audio:

Latency optimization is critical:

  • Focus on time-to-first-token (ideally <200ms) rather than total generation time
  • Optimize for the first 20 tokens to maintain human interaction pacing
  • Enable streaming across models to improve perceived performance

Decomposed pipelines often outperform end-to-end models:

  • Use separate specialized models for different stages (e.g., audio input → LLM for intelligence → voice output)
  • This approach offers greater flexibility and customization than integrated solutions
  • Platform tools can help manage the complexity of these multi-model pipelines

Consider smaller models with fine-tuning:

  • Large models often fail to meet real-time latency requirements
  • Smaller, specialized models can be fine-tuned to bridge quality gaps while maintaining speed
  • Balance between model size and performance is crucial for consumer-facing applications
What approaches to model fine-tuning are most practical for application developers?

Fine-tuning is almost always part of building production-ready AI applications for “last-mile alignment” to achieve specific quality, behavior, and reliability. Two primary approaches exist:

Supervised Fine-Tuning (SFT):

  • Conceptually simpler: provide prompt-completion examples
  • Challenges remain in creating high-quality labeled training data
  • More accessible to application developers without deep ML expertise
  • Platform support is becoming more common

Reinforcement Fine-Tuning (RFT):

  • More complex: involves providing feedback/ratings on model outputs
  • Requires defining reward functions or rubrics (often multifaceted)
  • Currently better suited for ML engineers than application developers
  • Simplifying this UX is an active area of innovation

Infrastructure and Data Considerations:

  • Fine-tuning very large models (70B+ parameters) requires significant specialized infrastructure
  • Using proprietary models to generate fine-tuning data is sometimes done as a shortcut but may not yield optimal quality for specialized domains
  • The industry needs solutions for fast, low-cost, low-effort quality customization cycles
  • Human expert labeling often remains necessary but is costly
What’s the state of agentic workflows in production environments?

Agentic workflows are a major driver of foundation model usage, but with clear patterns:

Current focus is on single-agent systems:

  • Most production implementations involve single agents solving specific problems
  • Examples include coding agents, document processing agents, and specialized domain agents
  • Building stable single-agent systems is a prerequisite before moving to multi-agent complexity

Production implementations typically include:

  • Fine-tuning for “last-mile alignment” to specific tasks
  • Function calling for tool integration
  • Model routing to direct specific sub-tasks to specialized models

Multi-agent systems are still emerging:

  • Standards like MCP (Model Controller Pattern/Multi-agent Communication Protocol) are in early adoption
  • Complex coordination between agents remains challenging
  • The field is focused on establishing stability with single agents before scaling to multi-agent systems
How are different organizations adopting AI, and what patterns are emerging?

Adoption is happening simultaneously across three main segments, rather than following the typical pattern where startups lead and enterprises follow later:

AI-native startups:

  • Experimenting with entirely new user experiences enabled by foundation models
  • Building novel applications not possible before foundation models

Digital-native companies:

  • Organizations with existing large user bases renovating their products with generative AI capabilities
  • Enhancing existing offerings with AI capabilities

Traditional enterprises:

  • Focusing primarily on workforce productivity enhancement across various tasks
  • Implementing AI to streamline internal operations and workflows

As companies find product-market fit with AI, they increasingly want to own their AI strategy end-to-end, avoiding dependencies on centralized model providers. This ownership ensures their proprietary data fuels their own AI improvements rather than benefiting external applications.

What should teams consider when developing their “own-your-own-AI” strategy?

As teams reach product-market fit with AI applications, owning their AI strategy becomes vital:

End-to-end control:

  • Avoid reliance on centralized model providers
  • Maintain control from data collection through model deployment
  • Ensure proprietary data benefits your applications, not external ones
  • Preserve data sovereignty

Model flexibility:

  • Design applications to be model-agnostic from the beginning
  • Expect to swap models as capabilities evolve
  • Use abstraction layers for model switching/upgrades

Balance experimentation and production:

  • Use the best available models during experimentation
  • Invest in optimization and customization after finding product-market fit
  • Consider smaller, fine-tuned models for production efficiency once the approach is validated

Infrastructure considerations:

  • Evaluate build vs. buy decisions for AI infrastructure
  • Consider managed services that abstract complexity while maintaining control
  • Plan for scaling infrastructure as application usage grows
Will smaller models replace larger ones for production applications?

Smaller models (≤10B parameters) have physical limitations—fewer parameters store less knowledge—but targeted, fine-tuned small models can already outperform larger ones on well-scoped tasks while meeting strict latency and cost budgets.

Key considerations:

  • Physics still matters: smaller models inherently capture less information than larger ones
  • Fine-tuning can bridge quality gaps for specific, narrow tasks
  • Smaller models offer significant advantages in speed and cost efficiency
  • Optimization typically happens after product-market fit is established
  • Initial focus should be on capability and quality rather than efficiency

Expect a two-tier future: large “generalist” models in the backend handling complex reasoning, with small “specialist” models at the edge or in tight latency loops handling specific tasks.

Does underlying hardware choice (NVIDIA vs. AMD, etc.) matter to application teams?

Hardware considerations are increasingly important as applications scale:

Abstraction and management:

  • Modern platforms abstract hardware details, allowing developers to focus on application logic
  • Cross-provider compatibility ensures resilience against regional outages

Emerging competition:

  • AMD is closing the hardware performance gap with NVIDIA
  • The primary challenge remains software stack maturity for non-NVIDIA hardware
  • Increased competition ultimately benefits the ecosystem through better pricing and innovation

Workload-specific optimization:

  • Different hardware architectures excel at different workload patterns
  • Effective platforms match workloads to the most appropriate hardware
  • Frameworks like PyTorch support multiple hardware types, enabling flexible deployment

Optimization timing:

  • Hardware optimization typically comes after finding product-market fit
  • Initial focus should be on capability and quality rather than efficiency
  • Once scaling begins, hardware optimization becomes increasingly important for cost and performance

The simplest path is to use a platform that auto-routes workloads to the most cost-effective hardware, letting you benefit from emerging options without rewriting code.

What’s the bottom line for practitioners building AI applications?

Focus on user-visible latency and quality, build a model-agnostic architecture, fine-tune early for domain fit, and lean on multi-cloud inference platforms to hide hardware headaches. Keep continuous benchmarking in your CI/CD pipeline to evaluate new models as they emerge.

AI capability is converging fast across providers—your competitive edge will come from product execution, solving real user problems effectively, and optimizing across the quality-speed-cost triangle, not from reinventing the ML infrastructure wheel.