Building An Experiment Tracker for Foundation Model Training

Aurimas Griciūnas on Overcoming Experiment Tracker Bottlenecks in Large-Scale Model Training.


Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Aurimas Griciūnas is the  Chief Product Officer of Neptune.AI, a startup building experiment tracking tools for foundation model training. This episode delves into the intricate world of Large Language Models (LLMs), covering everything from training and scaling to operational challenges and emerging trends. We explore the complexities of LLM training, the importance of robust & scalable experiment tracking and visualization, and the critical role of fault tolerance and checkpointing in managing long-running processes. The discussion also touches on the transition from research to production, fine-tuning techniques, and the significance of proper evaluation methods. We examine the challenges of scaling GPU clusters and infrastructure, as well as the importance of observability and guardrails in AI applications. Finally, we look at cutting-edge concepts like agentic AI and multi-agent systems, and discuss emerging trends that are shaping the future of LLMs.

Subscribe to the Gradient Flow Newsletter

 

Interview highlights – key sections from the video version:

 

Related content:


If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:


Transcript.

Below is a heavily edited excerpt, in Question & Answer format.

What are the key differences between ML Ops and LLM Ops?

LLM Ops can be viewed as ML Ops specifically tailored for large language models, but with significant differences in scale and complexity. While Neptune started six years ago focusing on traditional ML Ops experiment tracking, the rise of LLMs has shifted their focus. The main distinction is that in LLM Ops, research has become production – previously separate professional domains have merged, with researchers joining companies training foundation models. Training frontier models is now more of a production workflow than experimentation, as these models take months to train with only one opportunity to get it right, creating what Aurimas calls “a healthy amount of paranoia” in teams running these training jobs.

What specific scaling challenges do LLM training operations face that traditional ML tools can’t handle?

LLM training presents unprecedented scaling challenges across multiple dimensions:

  1. Data volume: Teams need to track thousands or tens of thousands of unique metrics for each layer of these massive models, requiring extremely high-throughput logging systems.
  2. Visualization requirements: The data must be visualized precisely and quickly, as researchers spend their days examining charts for performance indicators or anomalies. You can’t afford to lose data resolution that might hide anomalies signaling system errors.
  3. Resource scale: Modern LLM training clusters involve 100,000+ GPUs, sometimes spanning multiple data centers, with training runs lasting months rather than days.
  4. Fault tolerance: Given the expense of these training runs, robust checkpoint management and the ability to restart from specific points without data loss is critical.

Traditional ML tools weren’t built for this scale, which is why Neptune had to rebuild their entire core infrastructure to support these demands.

How has Neptune redesigned their architecture to handle LLM training at scale?

Neptune completely rebuilt their backend systems to handle massive-scale LLM training:

  1. They moved from a fully consistent to an eventually consistent system to improve performance.
  2. They implemented an asynchronous rather than synchronous ingestion pipeline through Kafka.
  3. They developed sophisticated visualization techniques that preserve anomalies even when dealing with millions of data points – not simple downsampling, but precisely calculating statistics in buckets while ensuring outliers remain visible.
  4. They made all operations idempotent to ensure data integrity, with robust fault tolerance and the ability to replay data if backend systems go down.
  5. They implemented “forking capability” to branch experiments from checkpoints, allowing teams to inherit data from previous running jobs while exploring alternative approaches – a feature previously only available in proprietary systems like Google’s internal ML platform.

What are practical uses of experiment tracking for teams that aren’t training frontier models?

While frontier model training grabs headlines, experiment tracking tools like Neptune offer significant value for enterprise teams:

  1. Fine-tuning tracking: Teams can track the lineage of model fine-tuning, allowing them to branch from previous fine-tuning checkpoints and continue training while maintaining the complete history.
  2. Workflow management: The “forking capability” lets teams branch experiments from specific checkpoints, inheriting previous data while exploring new directions – useful for any iterative ML process, not just frontier model training.
  3. Anomaly detection: Visualizing training metrics helps identify issues in both pre-training and fine-tuning, allowing teams to catch problems early.

The enterprise need for these capabilities is growing as more companies pursue ambitious fine-tuning projects and even train their own domain-specific foundation models – not just tech companies, but potentially large financial services companies and other enterprises with domain-specific data and needs.

What three things should enterprise AI teams focus on right now in LLM Ops?

For enterprise teams working with LLMs, Aurimas recommends focusing on:

  1. Proper prompt engineering: Rather than pre-training foundation models, focus on getting the most from existing models through effective prompting.
  2. Implementing guardrails: This critical but often forgotten step ensures safe and responsible AI deployment.
  3. Building comprehensive observability: This includes defining evaluation metrics before deployment and implementing systems to track model performance in production.

Only after mastering these fundamentals should enterprises consider more advanced approaches like agentic systems, which bring their own complexity.

What future developments is Neptune planning to address evolving LLM scaling challenges?

Looking forward, Neptune is focusing on several areas to address evolving LLM training needs:

  1. Advanced search and filtering: Making it easier to find specific data among hundreds of thousands of unique metrics.
  2. Experiment monitoring capabilities: Moving beyond tracking to monitoring, with alerts, thresholds, and intelligent anomaly detection to identify issues before humans can spot them.
  3. Knowledge embedding: Capturing the expertise that experienced LLM researchers develop about what works in training, either through automated pattern recognition or by allowing researchers to define rules for what the system should look for.
  4. Per-GPU tracking: Supporting the collection of metrics at individual GPU level, which could increase data volume by 1000x but provide crucial insights for optimizing massive training clusters.

These capabilities will help make researchers and platform teams more efficient as models continue to grow in size and complexity.


This post and episode are part of a collaboration between Gradient Flow and Neptune.AI. See our statement of editorial independence.