Why Digital Work is the Perfect Training Ground for AI Agents

Andrew Rabinovich on Upwork’s Radical Bet on Reinforcement Learning — Building RLEF from Scratch.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Andrew Rabinovich, CTO and Head of AI at Upwork, discusses how digital work marketplaces serve as ideal testing grounds for AI agents, exploring their custom reinforcement learning approach called RLEF (Reinforcement Learning from Experience). He covers Upwork’s Uma meta-agent system, advanced RAG implementations with knowledge graphs, and their vision for human-AI collaboration where agents handle manual work while humans provide evaluation and orchestration. The conversation delves into practical challenges of building production AI systems, from multi-modal feedback mechanisms to the technical architecture needed for agents that can perform real-world digital work tasks.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
How Leaders Are Using RL to Build a Competitive AI Advantage
Travis Addair → The Evolution of Reinforcement Fine-Tuning in AI
Building better AI agents, for less
Beyond Imitation: How Reinforcement Learning is Reshaping AI Reasoning
Nic Hohn and Max Pumperla → Reinforcement Learning in Real-World Applications

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Upwork’s AI Architecture and Implementation

What is Uma and how does it function within Upwork’s product ecosystem?

Uma (Upwork’s Mindful AI) is a meta-agent that facilitates the entire workflow between clients and freelancers on the platform. When a client comes to Upwork, Uma’s primary job is to understand their needs through conversation, then execute concrete steps: drafting job posts, identifying and ranking suitable freelancers, helping freelancers write compelling proposals, assisting clients in evaluating proposals, and answering platform or domain-specific questions. Rather than being a single monolithic system, Uma orchestrates multiple specialized components to deliver a seamless end-to-end experience.

Can you describe the technical architecture behind Uma? Why not use a single large reasoning model?

We use a “mixture of experts” architecture built on open-source foundation models with two distinct levels of fine-tuning. The first level establishes Uma’s consistent personality across all interactions, ensuring a unified experience regardless of the task. The second level creates highly specialized “skills” – narrow, task-specific models fine-tuned for particular functions like job post creation, freelancer search and ranking, proposal writing, or question answering.

A lightweight routing model sits on top, directing user requests to the appropriate skill model based on the context and requirements. We deliberately avoid relying on heavyweight reasoning models for most tasks due to latency constraints – clients expect real-time interactions. Through extensive testing, we’ve found that for well-specified, end-to-end tasks that can be explicitly trained, narrow fine-tuned models consistently outperform slower multi-step reasoning chains on both speed and reliability. While certain tasks like candidate evaluation may include reasoning steps, most functions don’t require explicit reasoning when you have models specifically trained for those exact tasks.

How do you implement Retrieval-Augmented Generation (RAG) within this system?

We deploy multiple RAG systems backed by a knowledge graph that serves as the central nervous system for our data architecture. The knowledge graph connects to diverse data sources across various databases and provides two critical functions. First, it helps the router determine which RAG system to query for specific information. Second, it enables sophisticated query expansion – for example, when someone searches for a “front-end developer,” the knowledge graph understands the relationships to skills like JavaScript, React, HTML, and CSS, expanding the query to retrieve more comprehensive and relevant context.

This approach is essential for incorporating real-time, dynamic information that can’t be baked into static model weights – things like freelancer availability, current project postings, or recent contract completions. The knowledge graph allows us to aggregate context from multiple sources before feeding it to the appropriate language model.

Optimizing Search and Retrieval Systems

What were the key levers for improving RAG performance at scale?

Our biggest advantage comes from the massive amounts of self-curating data generated by the platform itself. When clients sign contracts with freelancers and revenue gets transacted, these events provide strong ground truth signals about search and recommendation quality. A successful contract worth significant revenue is essentially a powerful positive label that our retrieval and matching worked correctly. This feedback loop allows us to continuously identify which data sources and features are most valuable.

On the technical implementation side, we’ve moved beyond fixed-size chunking to develop proprietary algorithms using sliding-window approaches. This flexible chunking strategy significantly improved relevance in multi-source retrieval scenarios. However, the most critical factor has been our deep institutional expertise in search and information retrieval. Upwork has a long history of bidirectional search capabilities – clients searching for freelancers and freelancers receiving job recommendations. We’ve upgraded these systems to state-of-the-art algorithms over the past 18 months.

The key insight is that RAG is fundamentally a search problem – you must retrieve the right context for the model. Retrieval, search, ranking, and reranking need to be treated as a holistic end-to-end system, not piecemeal components. Our team’s deep bench in these areas has been fundamental to building high-performing RAG systems.

Digital Work as an AI Training Ground

Why is digital work particularly well-suited as a training environment for AI agents?

Digital work offers a unique sweet spot for AI agent development. Unlike self-driving cars where mistakes have catastrophic real-world consequences requiring extensive simulation, or games like AlphaGo where the simulation perfectly equals reality, digital work allows safe exploration with real-world complexity. An AI agent can attempt to design a website, write code, or create graphics with unconventional approaches without physical harm. If it produces something completely novel or even “wrong,” it becomes a valuable learning opportunity rather than a disaster.

The key advantage is that the work is naturally constrained by professional deliverables and norms, yet open-ended enough to allow creative exploration. It’s an environment where agents can potentially discover novel solutions that humans might never have considered – the equivalent of AlphaGo’s famous “move 37” that surprised human experts. The digital nature means agents can iterate rapidly, explore broadly, and learn from both successes and failures in a controlled but realistic setting.

How do you define rewards for subjective tasks like “design a good website” where there’s no clear win/loss condition?

This is the central challenge distinguishing real-world tasks from games. The real world is largely non-deterministic, and you can’t predefine a mathematical value function for creativity or quality. Our solution leverages Upwork’s ecosystem of hundreds of thousands of human experts across every conceivable digital domain. These experts become the source of the reward function.

When an agent produces an output – say a website design – we treat it like a job to be reviewed. Our matching system finds appropriate human experts in that specific domain to evaluate the output. These experts provide the reward signal, creating a scalable feedback loop that allows agents to learn in complex, subjective domains. This approach transforms the subjective evaluation problem into a matching and aggregation challenge, which we’re well-equipped to handle given our marketplace expertise.

Reinforcement Learning from Experience (RLEF)

What is RLEF and how does it differ from traditional RLHF approaches?

RLHF (Reinforcement Learning from Human Feedback) typically involves humans ranking between multiple machine-generated outputs on predefined tasks – for example, choosing which of two summaries they prefer. This approach effectively confines the model’s learning to the space of human preferences within known task boundaries.

Our RLEF (Reinforcement Learning from Experience) approach is designed for much broader exploration. Instead of ranking predetermined options, agents are free to propose any solution they can devise, even wildly unconventional ones. Human experts then provide simpler reward signals – essentially thumbs up/down on overall quality plus targeted feedback on specific aspects. This enables exploration of a vastly larger solution space. The goal isn’t just to optimize for human preferences within known boundaries, but to potentially discover entirely new approaches that lie outside conventional human thinking.

The key difference is that RLEF allows agents to “do whatever they want” and learn from environmental feedback, similar to classical reinforcement learning but enhanced with LLM representations for understanding and generating complex outputs. Fine-tuning approaches are inherently limited by the engineer’s imagination – if you don’t know about a particular dimension of the solution space, it won’t be in your training set. RLEF aims to break through these human knowledge limits.

Why build a new RL library from scratch rather than using existing frameworks?

Existing open-source RL libraries don’t quite fit the paradigm we’re developing. RLEF represents a new style of reinforcement learning that combines classical RL principles with the powerful representation capabilities of LLMs. In our system, the LLM acts as both the communication layer (understanding tasks and generating outputs) and the representation layer (encoding states and actions), but the core learning mechanism is novel.

We believe building this custom framework is necessary to achieve systems that can genuinely surpass current human knowledge ceilings. It’s a mix of classical RL ideas with the RL+LLM paradigm, designed specifically for learning in open-ended, creative domains where traditional RL approaches would struggle with the complexity of the state and action spaces.

Implementation Challenges and Solutions

How do you handle multi-modal outputs and feedback, given that over 50% of Upwork’s work is non-textual?

Standard text prompts are insufficient for providing feedback on graphic designs, UI mockups, audio editing, or video production. We’re building sophisticated UX affordances that allow humans to provide precise, actionable, multi-modal feedback. For example, experts can circle specific regions in an image and annotate “this section’s color saturation is wrong,” mark timestamps in audio files, or indicate spacing issues in UI mockups.

To enable this, we’re extending the Model Context Protocol (MCP) into what we call the “Up format” – a protocol that captures three-way interactions between data, machines, and humans in a way that both parties can understand and learn from. This protocol needs to handle multi-turn interactions where agents can ask clarifying questions, humans can provide progressive refinement, and both can reference specific elements of complex outputs. The interface design here is critical – it needs to be intuitive enough for human experts to use efficiently while being structured enough for agents to parse and learn from.

How do you handle evaluation at scale, especially as agent capabilities improve?

Initially, all agent outputs are evaluated by human experts since current agents achieve only 30-40% task completion rates depending on the domain. These human evaluations serve a dual purpose: they provide immediate feedback for agent learning and create training data for developing automated evaluation systems.

Our long-term strategy involves training specialized “LLM as judge” models for each category of work. The key insight here is that verification is typically easier than generation – similar to the P vs NP distinction in computer science. A model that can reliably judge whether a logo is professional may require far less capability than one that can create professional logos from scratch.

The critical research question we’re tackling is determining the threshold at which an LLM judge becomes reliable enough to automate evaluation. This likely varies by domain – judging code functionality might require fewer examples than judging creative design quality. We’re systematically exploring how many human evaluation examples are needed for different categories before automated judges can provide meaningful assessments.

Platform Evolution and Ecosystem Strategy

What does the five-year vision look like for how clients will interact with Upwork?

We’re evolving from a marketplace for matching talent to a platform for delivering finished work. In the future, clients will simply describe their goals – “build a new mobile app for my business” or “create a marketing campaign” – and after a detailed conversation with Uma about scope, budget, and timeline, Uma will deliver the completed product.

Behind the scenes, Uma will act as an orchestrator, decomposing projects into tasks and assembling hybrid teams of specialized AI agents and human experts. The human role will shift up the value chain – instead of doing manual execution, humans will focus on high-level problem formulation, creative direction, quality assurance, and handling edge cases that require genuine understanding or creativity. Machines will handle most routine execution, with humans providing guidance and evaluation at critical junctures.

How will third-party agents integrate into the Upwork platform?

We’re positioning Upwork as an “app store” for AI agents. Just as we currently have hundreds of thousands of human freelancers, we envision millions of specialized agents, provided they meet quality standards for their respective domains. We won’t build all agents ourselves – third-party developers can submit agents that, once they pass our quality bar and evaluation process, can offer services on the marketplace.

We’re also developing a framework for “second-party agents” where existing freelancers on the platform can create their own automated assistants or co-pilots. These freelancers have deep domain expertise and understand marketplace dynamics, positioning them perfectly to build valuable automation tools. As long as these custom agents meet our quality standards, they can be deployed within the marketplace for others to use as well. This creates a powerful flywheel where human expertise gets encoded into agents that augment other humans’ capabilities.

What changes should users expect in the next 6-12 months?

The categories of work available won’t change dramatically, but the speed and cost of delivery will improve significantly. Current tools like Cursor or Lovable don’t create fundamentally new types of outputs, but they democratize creation by reducing required technical skills and time investment. Similarly, we’ll see agents acting as powerful co-pilots and assistants, boosting human freelancer productivity.

The key differentiator in our approach is avoiding the convergence problem seen in current systems where all outputs look similar because they’re sampling from the same training distribution. Through RLEF, we’re aiming for agents that can generate out-of-distribution concepts while maintaining quality through human expert evaluation. This should lead to more diverse and innovative outputs rather than the homogenized results we often see from current AI tools.

Perspectives on Foundation Models and AGI

What would you like to see from foundation model providers in the next year?

The priority should be smaller, faster models rather than ever-larger reasoning systems. Current empirical evidence shows that reasoning chains often don’t lead to correct answers, and many tasks can be solved correctly without explicit reasoning steps. There’s tremendous redundancy in frontier-scale models – they don’t need to be so large for most practical applications.

More fundamentally, the field needs to move beyond learning statistical patterns from internet text. To make real progress, models need to incorporate symbolic AI constructs and grounded understanding of real-world principles like physics and causality. While current LLMs provide an incredibly efficient interface for interacting with data, they’re missing crucial components for genuine intelligence.

For practitioners, this means being selective about when to use large reasoning models versus smaller, task-specific models. The key is matching the tool to the task – not every problem requires or benefits from chain-of-thought reasoning or massive parameter counts.

Do you believe LLMs represent the path to AGI?

No, LLMs alone are not the path to AGI. They provide an excellent representational interface and are remarkably efficient at organizing and accessing information, but intelligence requires more than statistical pattern matching over text. Similar to how convolutional neural networks provided useful representations for moving from perception to concepts in computer vision, LLM embeddings organize data effectively but don’t create knowledge or perform genuine inference.

True AGI will likely require alternative architectures that incorporate world models, symbolic reasoning, and richer forms of learning beyond next-token prediction. For teams building AI applications today, this means recognizing LLMs as powerful tools for specific tasks – particularly those involving language understanding and generation – while being realistic about their limitations and planning for hybrid systems that combine multiple AI approaches for complex problems.