Lin Qiao on AI Dev Challenges, Model Convergence, Fine-Tuning & Infra Abstraction.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.
Lin Qiao, CEO of Fireworks AI, dives into the practical challenges AI developers face, from UX/DX hurdles to complex systems engineering. Discover key trends like the convergence of open-source and proprietary models, the rise of agentic workflows, and strategies for optimizing quality, speed, and cost. Learn how modern infrastructure abstracts complexity, enabling teams to focus on building applications and owning their AI strategy.
Interview highlights – key sections from the video version:
-
-
- Fireworks AI Hyper-growth and Scale Metrics
- Developer & Enterprise Adoption Landscape
- Open-source vs Proprietary Model Convergence
- DeepSeek Breakthrough & Release Cadence Impact
- Meta Cadence and Security Concerns on Chinese Models
- Model Selection: Aligning Models to Tasks
- Multimodal Progress & Audio Pipeline Design
- Reasoning vs Non-reasoning Models & Smart Routing
- Fine-tuning UX: SFT vs RLHF Challenges
- Early Production Agents & Single-agent Workflows
- Coding & Document Processing Agents in the Wild
- Small Models, Distillation & Post-PMF Optimization
- AMD Hardware Outlook for Inference
- Fireworks Infrastructure Layers & Closing Remarks
-
Related content:
- A video version of this conversation is available on our YouTube channel.
- Robert Nishihara → The Data-Centric Shift in AI: Challenges, Opportunities, and Tools
- Travis Addair → The Evolution of Reinforcement Fine-Tuning in AI
- Hagay Lupesko → Beyond GPUs: Cerebras’ Wafer-Scale Engine for Lightning-Fast AI Inference
- Nestor Maslej → 2025 Artificial Intelligence Index
- What AI Teams Need to Know for 2025
- AI Agents: 10 Key Trends & Challenges You Need to Know
- AI Unlocked – Overcoming The Data Bottleneck
- David Hughes → Prompts as Functions: The BAML Revolution in AI Engineering
- Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
Support our work by subscribing to our newsletter📩
Transcript
Below is a heavily edited excerpt, in Question & Answer format.
What are the key challenges developers face when building AI applications?
Developers encounter challenges on both UX/DX (user/developer experience) and technical fronts. The nature of these challenges varies by developer background:
- Machine learning engineers struggle with algorithm design, infrastructure optimization, and model fine-tuning complexities
- Application developers without ML expertise face steep learning curves when adopting AI capabilities
The balancing act involves addressing both sides: creating intuitive tools tailored to specific skill profiles while solving hard engineering problems like GPU allocation, multi-cloud orchestration, and latency management. Building production AI applications is therefore half product thinking, half systems engineering, with teams needing to optimize across three critical dimensions:
- Quality (output accuracy)
- Speed (throughput and latency)
- Cost (operational efficiency)
How is cloud infrastructure complexity being addressed for AI developers?
Virtual cloud infrastructure solutions now span multiple cloud providers (typically seven or more) and regions (30+), handling:
- Automatic GPU provisioning and management
- Cross-regional availability and failover
- Performance optimization and SLA maintenance
- Workload-specific routing to appropriate hardware
This level of abstraction allows development teams to work strictly at the “call the model, ship the feature” layer without worrying about procuring GPUs, managing regional outages, or sustaining user-facing SLAs. By offloading these infrastructure concerns, developers can focus solely on building applications on top of foundation models rather than managing low-level infrastructure logistics.
What trends should developers be paying attention to in the foundation model space?
Key trends include:
- Rapid convergence: The gap between open-source and proprietary models is shrinking significantly, particularly since early 2024 with high-quality open models (DeepSeek, newer Llama models, Cohere, etc.)
- Creative application emergence: Completely new user experiences built on foundation models, particularly in coding, document processing, and specialized domain tasks
- Standardization of toolchains: The technology stack is becoming more standardized, making AI more accessible to developers without deep ML expertise
- Focus on efficient inference: By late 2025, foundation-model tooling will likely feel like a utility (spin up, benchmark, fine-tune, deploy) without requiring deep ML expertise
- Latency-first design: Streaming everywhere will become table stakes for user-facing AI
What is the current state of proprietary versus open-weight models?
The performance gap is narrowing significantly. The choice depends heavily on the use case and strategic goals:
Proprietary Models (OpenAI, Google Gemini):
- Often lead in some general benchmarks and integrated multi-modal capabilities
- Provide the fastest path for broad, multimodal use cases
- Easier initial integration for experimentation
Open-Weight Models (Llama, DeepSeek, Qwen, Cohere Command R, Gemma):
- Advantages in ownership and control: Companies can fine-tune and deploy without external dependencies
- Domain customization: Ability to specialize for specific use cases
- Task-specific excellence: Open models excel particularly in coding and tasks that can be mapped to coding problems
- Meet or exceed proprietary quality on many narrower tasks
No single model is universally best – performance heavily depends on the alignment between a model’s training data and your specific inference workload. The smallest gap between open and proprietary models exists in coding, largely because code outputs are easily verifiable, which provides clear signals for reinforcement learning.
The key recommendation is to architect applications to be model-agnostic from the beginning, allowing flexibility to swap models as capabilities evolve.
How should developers select and use different model types for specific tasks?
Models broadly fall into different categories with distinct strengths:
Reasoning models (e.g., DeepSeek R1/V2):
- Excel at complex tasks requiring planning and multi-step execution
- Generate verbose chains of thought to solve problems methodically
- Ideal for breaking down problems into smaller components
- Used for planning phases in agentic workflows
- “Think aloud” and handle tasks requiring decomposition
Non-reasoning models (e.g., DeepSeek V3, Coder models):
- More straightforward in their processing approach
- Excel as conversational or coding assistants
- Process input more directly without extended deliberation
- Often more efficient for well-defined tasks
- Used as “executors” for steps planned by reasoning models
A common production pattern is implementing a router mechanism that directs prompts to the appropriate model type based on the task requirements. For instance, routing a complex reasoning task to a reasoning model for planning, then directing each sub-task to a faster executor model. This approach allows teams to build comprehensive workflows where different models handle specialized functions.
How can developers optimize real-time AI applications, particularly for audio and video?
For real-time applications, especially those involving audio:
Latency optimization is critical:
- Focus on time-to-first-token (ideally <200ms) rather than total generation time
- Optimize for the first 20 tokens to maintain human interaction pacing
- Enable streaming across models to improve perceived performance
Decomposed pipelines often outperform end-to-end models:
- Use separate specialized models for different stages (e.g., audio input → LLM for intelligence → voice output)
- This approach offers greater flexibility and customization than integrated solutions
- Platform tools can help manage the complexity of these multi-model pipelines
Consider smaller models with fine-tuning:
- Large models often fail to meet real-time latency requirements
- Smaller, specialized models can be fine-tuned to bridge quality gaps while maintaining speed
- Balance between model size and performance is crucial for consumer-facing applications
What approaches to model fine-tuning are most practical for application developers?
Fine-tuning is almost always part of building production-ready AI applications for “last-mile alignment” to achieve specific quality, behavior, and reliability. Two primary approaches exist:
Supervised Fine-Tuning (SFT):
- Conceptually simpler: provide prompt-completion examples
- Challenges remain in creating high-quality labeled training data
- More accessible to application developers without deep ML expertise
- Platform support is becoming more common
Reinforcement Fine-Tuning (RFT):
- More complex: involves providing feedback/ratings on model outputs
- Requires defining reward functions or rubrics (often multifaceted)
- Currently better suited for ML engineers than application developers
- Simplifying this UX is an active area of innovation
Infrastructure and Data Considerations:
- Fine-tuning very large models (70B+ parameters) requires significant specialized infrastructure
- Using proprietary models to generate fine-tuning data is sometimes done as a shortcut but may not yield optimal quality for specialized domains
- The industry needs solutions for fast, low-cost, low-effort quality customization cycles
- Human expert labeling often remains necessary but is costly
What’s the state of agentic workflows in production environments?
Agentic workflows are a major driver of foundation model usage, but with clear patterns:
Current focus is on single-agent systems:
- Most production implementations involve single agents solving specific problems
- Examples include coding agents, document processing agents, and specialized domain agents
- Building stable single-agent systems is a prerequisite before moving to multi-agent complexity
Production implementations typically include:
- Fine-tuning for “last-mile alignment” to specific tasks
- Function calling for tool integration
- Model routing to direct specific sub-tasks to specialized models
Multi-agent systems are still emerging:
- Standards like MCP (Model Controller Pattern/Multi-agent Communication Protocol) are in early adoption
- Complex coordination between agents remains challenging
- The field is focused on establishing stability with single agents before scaling to multi-agent systems
How are different organizations adopting AI, and what patterns are emerging?
Adoption is happening simultaneously across three main segments, rather than following the typical pattern where startups lead and enterprises follow later:
AI-native startups:
- Experimenting with entirely new user experiences enabled by foundation models
- Building novel applications not possible before foundation models
Digital-native companies:
- Organizations with existing large user bases renovating their products with generative AI capabilities
- Enhancing existing offerings with AI capabilities
Traditional enterprises:
- Focusing primarily on workforce productivity enhancement across various tasks
- Implementing AI to streamline internal operations and workflows
As companies find product-market fit with AI, they increasingly want to own their AI strategy end-to-end, avoiding dependencies on centralized model providers. This ownership ensures their proprietary data fuels their own AI improvements rather than benefiting external applications.
What should teams consider when developing their “own-your-own-AI” strategy?
As teams reach product-market fit with AI applications, owning their AI strategy becomes vital:
End-to-end control:
- Avoid reliance on centralized model providers
- Maintain control from data collection through model deployment
- Ensure proprietary data benefits your applications, not external ones
- Preserve data sovereignty
Model flexibility:
- Design applications to be model-agnostic from the beginning
- Expect to swap models as capabilities evolve
- Use abstraction layers for model switching/upgrades
Balance experimentation and production:
- Use the best available models during experimentation
- Invest in optimization and customization after finding product-market fit
- Consider smaller, fine-tuned models for production efficiency once the approach is validated
Infrastructure considerations:
- Evaluate build vs. buy decisions for AI infrastructure
- Consider managed services that abstract complexity while maintaining control
- Plan for scaling infrastructure as application usage grows
Will smaller models replace larger ones for production applications?
Smaller models (≤10B parameters) have physical limitations—fewer parameters store less knowledge—but targeted, fine-tuned small models can already outperform larger ones on well-scoped tasks while meeting strict latency and cost budgets.
Key considerations:
- Physics still matters: smaller models inherently capture less information than larger ones
- Fine-tuning can bridge quality gaps for specific, narrow tasks
- Smaller models offer significant advantages in speed and cost efficiency
- Optimization typically happens after product-market fit is established
- Initial focus should be on capability and quality rather than efficiency
Expect a two-tier future: large “generalist” models in the backend handling complex reasoning, with small “specialist” models at the edge or in tight latency loops handling specific tasks.
Does underlying hardware choice (NVIDIA vs. AMD, etc.) matter to application teams?
Hardware considerations are increasingly important as applications scale:
Abstraction and management:
- Modern platforms abstract hardware details, allowing developers to focus on application logic
- Cross-provider compatibility ensures resilience against regional outages
Emerging competition:
- AMD is closing the hardware performance gap with NVIDIA
- The primary challenge remains software stack maturity for non-NVIDIA hardware
- Increased competition ultimately benefits the ecosystem through better pricing and innovation
Workload-specific optimization:
- Different hardware architectures excel at different workload patterns
- Effective platforms match workloads to the most appropriate hardware
- Frameworks like PyTorch support multiple hardware types, enabling flexible deployment
Optimization timing:
- Hardware optimization typically comes after finding product-market fit
- Initial focus should be on capability and quality rather than efficiency
- Once scaling begins, hardware optimization becomes increasingly important for cost and performance
The simplest path is to use a platform that auto-routes workloads to the most cost-effective hardware, letting you benefit from emerging options without rewriting code.
What’s the bottom line for practitioners building AI applications?
Focus on user-visible latency and quality, build a model-agnostic architecture, fine-tune early for domain fit, and lean on multi-cloud inference platforms to hide hardware headaches. Keep continuous benchmarking in your CI/CD pipeline to evaluate new models as they emerge.
AI capability is converging fast across providers—your competitive edge will come from product execution, solving real user problems effectively, and optimizing across the quality-speed-cost triangle, not from reinventing the ML infrastructure wheel.
