The Hidden Failure Modes of AI Agents

Helen Gu on AI Reliability, Silent Failures, SLMs, and the Future of Ops.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.

Ben Lorica speaks with Helen Gu, CEO of InsightFinder AI and professor at North Carolina State University, about why reliability is becoming a central challenge for enterprise AI. They discuss AI incident response, monitoring agents in production, domain-specific small language models, causal inference for operational data, and why the “last 1%” of accuracy and reliability is often the hardest part of deploying AI systems. The conversation also covers the limits of general foundation models, the need for AI-specific playbooks, and why computer science education still matters in an era of coding agents.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Zhou Yu → Why Your AI Agent Isn’t Ready to Ship (And How to Know When It Is)
Your AI agent looks capable. But can it actually finish the job?
The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It
Richard Garris and Barry Dauber → The Gap Between AI Hype and Enterprise Reality
Arun Kumar (of UCSD and RapiFire AI) → Are Multi-Agent Systems More Complex Than They Need to Be?

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, so today we have Helen Gu, professor at North Carolina State University, but more importantly for this episode, she is the CEO at InsightFinder AI, which you can find at insightfinder.com. Taglines from the website: “Reliability for the AI era — deliver reliable AI and IT services in production, fix issues in real time with the ARI operational agent.” And with that, Helen, welcome to the podcast.

Helen Gu: Thank you. Thanks for inviting me, Ben.

Ben Lorica: It seems like, in reading about InsightFinder, the company started around cloud management technologies, and obviously now a lot of that has to do with AI. So, is it fair to say, Helen, that the DNA of the company still remains with the IT ops community? Is that still the target for the tools you’re building?

Helen Gu: Yeah. From the beginning, the company’s mission has basically been to improve reliability for systems using AI technology. That mission has never changed. The systems we started with were IT systems, but now you can see more agentic AI systems being built, and so we’ve expanded into AI systems and agentic AI systems as well.

Ben Lorica: So you’ve moved beyond your original target audience of IT ops and DevOps, to now even knowledge work — is that what you’re saying?

Helen Gu: Not exactly moved out — it’s more that we’re building a more unified platform for both. What we observe is that to actually make your system services reliable, you need to monitor your infrastructure and your IT side, as well as your models, your protocols, and your data. So essentially you need a more holistic solution to achieve reliability.

Ben Lorica: Since you talk to a lot of companies, where do they still have a hard time going from demo to production? What are some of the key challenges?

Helen Gu: For AI systems, there are a lot of challenges. Most of the time, an AI system is trained on certain training data, and when you deploy those AI models or AI systems in the real world, you can’t predict what the real data will look like. It might look similar to your training data, or it might not. When you have that kind of unexpected data, workload, or scenarios, AI models can give wrong answers — and sometimes very misleading ones. This is basically where we see the biggest gap in production environments, and it poses a lot of risk to companies, especially in heavily regulated industries like finance, insurance, pharmaceuticals, and healthcare. That’s where InsightFinder wants to help our customers fill the gap — to catch any problems that might cause high financial impact or security risk, catch them automatically and early, and fix them before they actually have an impact.

Ben Lorica: In terms of helping AI teams — it sounds like you definitely help once I’ve deployed the agent, with tools to monitor and make sure it’s working well. But do you help pre-deployment? There’s a wave of new startups focused on this topic — they do things like generate synthetic data for the types of edge cases you alluded to, and maybe help with evals pre-deployment. Is that something you folks do as well?

Helen Gu: Yes, absolutely. We work with our customers throughout the lifecycle. We have a product capability called LLM Labs, which allows customer data scientists in particular to evaluate different versions of prompts when they’re tuning them. Prompt engineering is a very important step to making an AI system work in a robust way. We help our customers keep track of prompt versions and compare different versions, and we do comprehensive evaluation across prompts, data, and models. When you do this kind of comprehensive comparison, you can pick the right model, right prompt, and right dataset for each use case, rather than just relying on one model — say, going to OpenAI or Anthropic for everything. In real-world scenarios, just changing some of the wording or the order of your prompts can cause a big impact on the accuracy of the output, so we help our customers compare and fine-tune all those aspects.

Ben Lorica: In fact, when you talk to some of these people building agents and find that they’re using an Anthropic model or OpenAI, and then you suggest moving to a cheaper model —

Helen Gu: Yes.

Ben Lorica: — they’ll say, “Well, our prompts are so sensitive. If we move to the cheaper model, that’s a lot of work for us.” So in terms of automatically tuning prompts, did you look into tools like DSPy?

Helen Gu: We’re not actually focused on just prompt tuning itself, because we’re looking at a broader problem — more like whole workflow management. Right now, a lot of real-world use cases require multi-agent systems, so you’re not just deploying one agent. The outcome depends on different models and different agents, and for different tasks you’re probably better off using different models. Tuning one model also has an impact on other agents downstream, so it’s a very complex workflow management process. We provide a platform for data scientists to view the whole workflow from beginning to end across all the agents, and we can track where a deviation starts and how the impact propagates throughout the workflow. The tuning steps are still semi-automatic — we want to give data scientists the evidence first: why there’s a hallucination, why there’s a logical inconsistency, why there’s potential data leakage or malicious information. We give them the evidence and then some recommendations, but the final decision still rests with the data scientists, because prompt tuning is very use-case specific. We have different customers across different sectors, and the way they evaluate their systems is also very different. That’s why our approach differs from other vendor solutions that rely on foundational AI models to evaluate foundational models — like using OpenAI to evaluate OpenAI, which probably doesn’t make sense, and is also very expensive. What we do instead is build customized SLMs for our customers for each different domain, and we allow them to easily customize the evaluation judge LLM based on their specific use cases. So we’re not building just one judge LLM — we’re building specific judge LLMs for different use cases. The idea is to do very fine-grained tracing and analysis, expose any vulnerabilities or inaccurate responses to the data scientists as much as possible, and then let them tune based on our guidance. Maybe down the road we can do automatic tuning, but I think we’re still far away from that.

Ben Lorica: So one takeaway seems to be — and one analogy is that in the early days of generative AI, the “Hello World” example was RAG, and it turns out that in RAG there are a lot of knobs to tune: the chunking strategy, information extraction, search and retrieval, and so on. Based on what you’re saying, you have a platform that gives data scientists a view of the different knobs they can turn, with a principled sweep — like the equivalent of hyperparameter tuning — that allows them to get some reassurance that they’re setting up their agent in an optimal way. That includes the prompt, might involve generating some synthetic data to push the agent beyond the sample dataset, and also involves some fine-tuning of a small language model that acts as a judge. What’s the family of language models you’re using as a base?

Helen Gu: We’ve tried different bases, and most of the models we target are very small, because we want to achieve cost efficiency for our customers. We have a suite of popular small language models — like Llama, for example.

Ben Lorica: Open-weight models. So how much fine-tuning data is needed? I guess that’s hard to answer since it’s task-specific, but is the user responsible for putting together the fine-tuning data?

Helen Gu: Actually, we help our customers generate it automatically. What we do is hook into their production pipeline — for example, if they use Temporal, we have native integration with it. When they deploy the model through the CI/CD system, we automatically extract all the data: traces like inputs and outputs, as well as performance data, token consumption, everything. Then we start doing evaluation in the background on their real production workload and identify any potential problems — whether it’s a performance issue, an accuracy issue, or a security issue — and display everything in one dashboard for them to review. For detected anomalies — whether performance, accuracy, or security anomalies — we also generate evidence explaining why something is wrong, and we use that evidence as guidance to automatically generate training data. For example, a very simple case: if you ask any foundational model — Anthropic, Gemini, or OpenAI — about a Windows error code like 0x0000009F, all of them will give you an answer, and all of them will be wrong. This has been going on for three years. They don’t fix the problem. If you ask any domain-specific technology question, those foundation models often give you the wrong answer, because they’re trained on public web data. You can find a lot of recipes for lemon juice on the internet, so ChatGPT and Anthropic are pretty good at that. But if you ask about a Windows error code, very few people have actually asked or posted about that — even Microsoft doesn’t make that kind of information very public — so those models perform very badly on domain-specific technical questions. Because we have our domain-specific SLMs — we started with IT operations, so we’re experts in operations — we build our AI agents, like ARI, to perform all kinds of technical tasks. We have judge LLMs built specifically for the IT domain, which is why we can detect those errors and automatically correct them. For example, Anthropic might tell you a certain error code means some Windows library linking error, which is completely wrong — it actually means a watchdog timeout crash code for Windows. We can automatically correct that and generate training data so customers can fine-tune their model. Yeah.

Ben Lorica: There are other teams in the IT ops and DevOps domain that have gone to the point of building foundation models. The canonical example is Datadog — they’re building time series foundation models, and now they’ve also started talking about what they call a “world model,” which is basically a foundation model that uses alerts, logs, traces, events, metrics, even source code. The idea is that this foundation model would allow you to do things like predictions, proactive alerting, maybe simulation, and so on. Are there researchers building foundation models for this type of operational task?

Helen Gu: Absolutely. This has been in the research domain for several years. Here’s our take: different AI systems and algorithms are designed for different data. A large language model is basically text token prediction — it’s not new technology. It’s a well-known canonical text processing technology.

Ben Lorica: Don’t tell Anthropic and OpenAI that — they’re about to go public.

Helen Gu: They’re just text token prediction, but the reason they’re so successful is because they’re trained on a massive amount of data — and that training cost is huge. Training just once takes millions of dollars, which is why those companies need to raise so much money.

Ben Lorica: Right.

Helen Gu: So let’s talk about operational data. Operational data — you’re looking at time series data —

Ben Lorica: There are many people who have put forward time series foundation models, and the impression I get is they’re pretty good, but it depends on your use case and how accurate you need to be. What’s your sense of time series foundation models in general?

Helen Gu: So, essentially, you have time series data, you have log data, and that machine data is fundamentally different from human language. First of all, it’s a lot noisier than human language, and a lot more diversified.

Ben Lorica: And even compared to images, right? Time series are a lot more unpredictable.

Helen Gu: More complex than images. A lot of people think image data is very complex — it’s not, because the color pixel space is limited. You have 256 dimensions, that’s it. But if you look at machine data, we’re dealing with hundreds of millions of dimensions easily. Different machine learning technologies are designed for different data. That’s the reason we invented unsupervised machine learning technology — published almost 15 years ago, and also licensed by Google. We developed that technology because we had high-dimensional, noisy data in mind. That’s the reason we don’t believe building a single foundational model is a practical or right solution. A, it’s very expensive —

Ben Lorica: You mean a single foundation model that you can throw any time series against?

Helen Gu: Yeah. And B, it’s very hard to achieve real-time performance. Remember, we’re dealing with hundreds of millions of data points, and you need to generate insights within seconds —

Ben Lorica: Or even milliseconds.

Helen Gu: Yeah, milliseconds. We serve credit card companies — we need to detect problems within milliseconds, because every second there are millions of transactions. It’s basically very hard to achieve that kind of real-time insight using a foundational model, even with the fastest GPUs in the world. So we don’t believe building one foundational model is the right approach. From day one, we’ve been building composite AI technology — different AI models for different kinds of data, focused on real-time analysis. Another key difference is that foundational models don’t support real-time learning. If you ask an LLM like ChatGPT to predict Intel’s stock price today, it can’t do it, because each training run takes a lot of time and money — they’re not going to do real-time learning. In contrast, if you want to apply AI agents to dynamic system monitoring like IT systems, those systems change every second. You need to do learning and adaptation all the time. For example, your workload can fluctuate because of some triggering event — a big stock price move, a major media event — and a lot of systems will change based on that. If you’re doing real-time prediction and anomaly detection, you need to adapt your model to the new trend, so that the new normal becomes recognized as normal behavior. Foundational models are fundamentally not designed for that.

Ben Lorica: The other approach — somewhat related but more targeted at relational data — is something pioneered by Jure Leskovec’s group at Stanford. Using graph neural networks, he has a company, Kumo (K-U-M-O.ai), and the idea is to take all that relational data, build a graph, and use some sort of relational foundation model to do forecasts, churn prediction, and so on. It’s not quite the same as what you describe — constantly changing machine data in IT and DevOps requiring low-latency predictions. But what’s interesting about what he’s doing is that he’s leveraging relationships. Maybe the time series data you alluded to — maybe there are hundreds of them, and maybe they’re not all unrelated.

Helen Gu: They are heavily related, and that’s actually one of the key capabilities we provide, called causal inference. For a lot of our customers, what they care about is: how do I know what caused this problem? We don’t just tell people, “Oh, there’s something wrong here” — we tell them why there’s something wrong and how to fix it. The relationship you mentioned is key to figuring out the relation between different anomalies and incidents or outages or model problems. Traditional time series analysis is more about trend analysis, and that kind of prediction is very limited. Most of the time, customers feel those predictions are useless because they focus on a single metric — you can predict a disk filling up, simple threshold violations, but what customers want is to predict outages: when my service is unavailable to customers, why, and how do I fix it? This requires causal inference, because most of the time it’s not just one metric or one log entry going wrong. It’s a complex distributed system, heavily replicated, so it doesn’t just break in a second from one point. It usually starts from early symptoms, early problems, and then propagates through the distributed system. It’s very important to understand how your systems interact with each other and how different anomalies propagate and impact your whole system. That’s the other part of what we do — real-time causal inference.

Ben Lorica: I assume the more you work with different companies, the more capable your agents become, right? Because basically they learn over time how these systems interact.

Helen Gu: Yeah, the system can be fine-tuned based on common distributed system knowledge — that’s basically how we built ARI’s agent at the beginning. However, we don’t mix different customers’ data together, so even for one customer with different use cases and different applications —

Ben Lorica: I’m not saying you mix their data — I’m saying you understand how these systems interact at a high level.

Helen Gu: At a high level, yes, but there’s still a learning process. For each customer, we make the model and agent highly customizable, and we give customers many mechanisms to provide feedback. We built a feedback center that collects all kinds of feedback from real users, and we continuously do reinforcement learning for our AI agents. Our belief is that no AI is perfect. Nowadays, if you use Anthropic or ChatGPT for regular questions — how do I build this, how do I make a recipe — and they give you the wrong answer, no big deal. But if you have model drift or hallucination out there in a production system —

Ben Lorica: By the way, Helen, for our audience — when you use the word “agent,” can you describe what that means? Imagine I’m a user of your system. What does an agent mean?

Helen Gu: Yeah, so we call it the AI agent, and ours is called ARI. When you log in, you can interact with ARI. ARI will do a lot of things. First of all, without you prompting it, ARI will summarize your system health immediately when you log in, and then guide you through: “Do you want to get the root causes for this incident?” You can also ask ARI to take actions — like “Create a Jira ticket for me” — or ARI will tell you, “You have an incident, but it’s related to an AWS outage,” because ARI is checking Down Detector, checking weather information, doing all kinds of work in the background. When you interact with ARI, it’s already equipped with all the insights you’d traditionally have to gather manually through dashboards on something like Datadog. ARI has already done that work, so you get precise, concise, extracted information from our AI agent. The challenge for humans is not making decisions — humans are very good at making decisions when given a limited number of choices. But give a human a lot of noise and a lot of information, and they’ll make very bad judgments. Humans are also very bad at predictions. That’s basically what our AI agent is good at. That’s what we mean when we talk about the agent.

Ben Lorica: So ARI, as I understand it, is your agent that helps me as an ops person. But what if a company has already built their own ops-related agents? What’s your role in that scenario?

Helen Gu: We built ARI on top of MCP. So essentially, if a customer has their own agents, we expose our insights as an MCP server, and their agents can query our MCP server and tap into all the information ARI has, and link that to their own agent. It’s very generic and easy to integrate.

Ben Lorica: Interesting. You used the phrase “reinforcement learning,” and I’m a big reinforcement learning fan — I always think it’s just around the corner from being democratized, but not quite. When you use the phrase “fine-tuning,” there’s classic fine-tuning with labeled examples, but increasingly there’s reinforcement fine-tuning. In what sense do you mean reinforcement learning?

Helen Gu: There are different kinds of learning. Fine-tuning can be supervised — here’s a prompt, here’s the answer, here’s my expected result. That’s what most other vendors like Braintrust are doing.

Ben Lorica: There’s a ton of services doing that now.

Helen Gu: Right, that’s supervised learning. What we do is more on the unsupervised learning side — we don’t have expected answers, because we’re evaluating real production data. What we do is take the prompt, the dataset, and the answer, and reason about the reasoning steps. We look at whether the response is based on facts — we examine the reasoning and inference paths from the LLM models and check whether each reasoning step is logical and makes sense. This evaluates how the LLM works, not based on an expected result. That’s what we call unsupervised. Then there’s a third category — what we call reinforcement learning — where we don’t have expected results, but we generate evaluations and responses, and the user can give feedback —

Ben Lorica: Thumbs up, thumbs down.

Helen Gu: Yes, those feedbacks. We collect all the positive and negative feedbacks, generate training data from them, and for positive feedbacks we retrain the model so you get the accurate answer with higher probability. Whatever model you’re dealing with is statistical — a lot of people who don’t understand machine learning well think everything is deterministic, but that’s not true. Most LLMs are statistical. That’s why if you ask the same question to an LLM multiple times, you’ll get different answers. And they also sometimes customize answers based on your context. If you ask “What is one plus one?” and they say two, and you say “No, it’s three,” and you keep doing that, they’ll change their mind — because they take context into their reasoning. They’ll say, “In our past conversation, you told me one plus one is three, so when you ask me again, I’ll say three.” That’s what statistical learning is about — every interaction increases a probability. It’s not based on fact; it’s based on how many times you reiterate. That’s basically the reinforcement learning we have.

Ben Lorica: So is the reinforcement learning in the context of ARI in the following sense: ARI is an agent you offer, it uses a certain language model — possibly a reasoning-enhanced model — and I’m helping ARI get better by deconstructing ARI’s reasoning steps?

Helen Gu: Yeah, you can give feedback. Each conversation has a thumbs up and thumbs down button — you can just click that. Every piece of feedback from engineers using our tool, we take in. Of course, how to fine-tune based on that is controlled by the user — that’s another reason why we focus on SLM solutions, because those SLMs are very easy to adjust. You have full control, and you can decide how you want to fine-tune or retrain the model with that training data. It’s a very customer-driven solution. We want to give the platform tools to customers so they can do that themselves.

Ben Lorica: By the way, what’s your cutoff for the “S” in SLM? What number of parameters constitutes “small”?

Helen Gu: That’s a great question. We typically go to very small models. Nano models are typically not very good, so we look at a couple of billion parameters — that should be good enough for most of our use cases. We use the GPU to run those models and fine-tune them, and it works pretty well.

Ben Lorica: You know, as we alluded to earlier, you originated out of ops, DevOps, IT ops. I’m assuming this is a space that’s starting to use AI agents — ARI, or maybe their own. But historically this is a space quite familiar with incident response. A lot of them have incident response playbooks, but maybe not many have AI-specific incident response playbooks. What is incident response for our listeners? It usually has these steps: prepare, identify what an incident is, contain it, eradicate it, recover, and then have some sort of post-mortem and lessons learned. To what extent do the teams you’re meeting have actual AI-specific incident response playbooks?

Helen Gu: I would say very few. Most incident response is still rule-based or policy-driven — you set up rules saying, “When you see CPU exceeding this threshold, when you see this error code, do this.” That’s one of the biggest pain points for a lot of our clients.

Ben Lorica: But now you also have agents and AI services that will generate their own incidents. How many of these teams have playbooks for incidents specific to their AI apps?

Helen Gu: Essentially, what we do here is give operators more context about problems — not just sending alerts. Traditional incident response tools are still kind of alert-response tools. Look at PagerDuty — they have tons of alerts, and most of the time —

Ben Lorica: At least in that context, they have some notion of what an incident is. My concern is that for AI teams, in some cases they don’t even agree on what an incident is.

Helen Gu: Yes. Some AI incident terminology overlaps with traditional IT, right? For example, if you go to an AI system and it’s not responding — that’s very common with a lot of cheap models. But is hallucination an incident?

Ben Lorica: Is hallucination an incident?

Helen Gu: Hallucination is definitely an incident, especially when you’re not just chatting with a bot but actually taking actions. If you use ChatGPT to predict whether Apple’s stock price will increase, and you purchase a million dollars of Apple stock based on that prediction — that’s a hallucination incident. There are also incidents caused by sensitive information leakage — guardrail-related issues.

Ben Lorica: Guardrail-related, yeah.

Helen Gu: Just today I saw news about a hack into a Meta AI agent where they found sensitive customer information from companies like Target. It’s very dangerous if you don’t put enough guardrails in place, because those models are trained on massive amounts of data and put in front of humans — including potentially malicious users who can ask all kinds of questions to trick the LLM into giving sensitive information.

Ben Lorica: And AI models and systems are software — they can go down, suffer outages, have latency issues, get jailbroken. There are a lot of incidents specific to AI that I’m afraid many teams haven’t developed a process for dealing with. And obviously the holy grail in incident response is lowering mean time to recovery. I’m assuming this is where ARI helps a lot — in a principled and targeted way, not only identifying an incident but maybe suggesting remediations?

Helen Gu: Yeah, we monitor model drift, we monitor hallucinations, we capture sensitive information leakage. We can intercept malicious prompts before they hit the LLM, and we can intercept malicious or sensitive information in responses before they’re shown to the user. There are a lot of guardrails we can help our customers put in place. And it’s very important that you capture that information and adjust your model accordingly — that’s what I mean by a closed feedback loop. Otherwise you’ll make the same mistake again and again. A lot of silent failures —

Ben Lorica: Silent failures are a killer.

Helen Gu: Exactly. I tell customers: if your system crashes, no big deal — it’s easy, because you detected it, you know it’s important, you’ll fix it. But if you have model drift or hallucination out there —

Ben Lorica: Or an agent that’s supposed to check some CRM system, and it turns out that CRM system hasn’t been updating for months, so the agent keeps saying, “Yeah, I just checked, it’s fine.”

Helen Gu: Exactly. Those kinds of things — if you don’t have the right platform in place to help you keep track, there are a lot of potential pitfalls you’re going to fall into.

Ben Lorica: So Helen, on the foundation model front — based on the types of applications InsightFinder is interested in, what do you find most interesting? Is it improvements in multimodality? Systems like Mythic for cybersecurity? What developments in foundation models are particularly intriguing to you?

Helen Gu: We’re not really focused on foundation models — we have very minimal usage of foundational models, and most of our AI models are unsupervised machine learning that we’ve built ourselves.

Ben Lorica: But your customers might be relying on them, so you have to pay attention, right?

Helen Gu: Right. Customers typically have foundational models, and they’re starting to use gateways for routing. Price is definitely the number one concern for a lot of customers, hallucination is also important, and then guardrails — those are the three main concerns. We see a lot of cost adjustments as foundational models release new versions at lower prices, but our customers observe a lot of performance issues with those newer models. A newer model doesn’t necessarily mean better accuracy. I’ve seen customers complain to us: “Anthropic upgraded their model, and now we have silent failures again.” They don’t know — is it because our AI agent has a problem, or is it because the model changed? Relying on third-party public models carries a lot of risk and unknowns for our customers. A lot of customers are still in the early stages of deploying those in mission-critical pieces. Coding is the easy one — that’s technical, people use it for that. But for other use cases outside of coding, there are a lot of challenges, especially in finishing the last mile. We always say: finishing 99% is easy, but getting that remaining 1% accurate and reliable is the hardest part.

Ben Lorica: You mentioned that cost is obviously a top-level concern — I joke that the CFO is now the CTO, the Chief Token Officer, because cost is so top of mind for the C-suite. Does that mean, Helen, that based on the companies you work with, they recognize that it’s not a good idea to rely on just one model provider, and that they should explore open-weight models?

Helen Gu: Yes, absolutely. In our solutions, customers like that we use open-source models — they have full control, we can do on-premise deployment, and we can help them fine-tune easily. Those are some of the benefits. And you always want to avoid a single point of failure. Just recently, Claude had a several-hour outage, and it’s funny — our team was doing coding, and one of the engineers said, “Why am I not getting answers?” So we asked ARI, “Is Claude down?” and ARI said yes. For development, okay, some development slows down — they can still write code. But imagine you’re doing that for something more critical. That could be a real disaster.

Ben Lorica: So, closing segment — it’s tradition for any guest who’s a professor in computer science: I like to ask about the future of computer science. I’m assuming your students are reading the same headlines — not as many jobs, “Why should I get into CS?” And the ones in PhD programs are thinking, “Why should I finish? I should just go to the Bay Area and get a job at one of these labs.” So what say you, Helen? What’s happening in computer science? Are people getting depressed out there?

Helen Gu: I definitely see an impact on graduating students. If you’re graduating this year with a bachelor’s degree, it’s probably hard to find a job.

Ben Lorica: Is it noticeable — really noticeable from this year to last year?

Helen Gu: Yeah, definitely noticeable. However, I tell my students that the skills you need to excel in this new world are different. It’s still very hard for AI to write high-performance, highly reliable distributed systems — that’s probably why my class is always very popular. I tell students: you need to learn how to focus on architecture, algorithms, and design. Those things also require creativity and creative thinking. I think it’s very dangerous when people think, “Oh, everything is about the cloud now.” I don’t believe the cloud can solve all the problems, and I believe innovation and diversification are very important for the whole world to evolve. Look at the early days — if you look at the IBM mainframe era, there was no PC, and everybody thought, “All we do is on the supercomputer or the mainframe.” But that turned out to be a bad solution because very few people could access those resources, and that wasn’t going to drive evolution. Later on, we had PCs, and then you started to see all kinds of applications and operating systems being created. I think AI is still at a very early stage. We see one solid use case in coding, but there are a lot of other use cases that require domain knowledge, design, and customization. In fact, I would say getting an advanced degree — a master’s or PhD — is actually more important now, because it’s no longer just about writing code. You need to be able to articulate your ideas, write them down, communicate them — that’s what we train PhD and master’s students to do. An LLM knows whatever you teach it, but it cannot think outside the box. It cannot say, “Wait, I should not use a ring architecture here — I should use a hierarchical tree for this system.” It doesn’t know the context, and it cannot invent new things. Computer science education has always focused on that, and now it’s more important than ever. I always tell my students: it’s not important to learn what something is — it’s important to learn how to do it, how to solve the problem, and why you do it. Then you can invent new things.

Ben Lorica: It seems like what’s clear is this: if you’re a programmer and you’re open to these new tools, I think you’re fine. But if you’re a programmer who resists these tools, you might be in trouble. And I think, despite headlines about downsizing, the general sense I get is that companies are retaining their developers — but those developers are unfortunately even more overworked now, because coding agents are generating so much code that they have to fix or improve. And I think you might have a front-row seat to this in the IT ops and DevOps world, in the sense that your tools probably aren’t causing layoffs, but they’re causing people to become more productive — and maybe expectations get higher, because now they’re expected to be able to fix more things.

Helen Gu: Yeah. It’s actually good news for startups, because if you have novel ideas, you can develop products much faster at much lower cost. And also, because there’s so much code being shipped without thorough testing, tools like ours are —

Ben Lorica: Even the users of your tools inside companies — it’s not like once you adopt them, you reduce your ops team. No, you keep your ops team, but now your ops people have superpowers.

Helen Gu: Exactly. They need to supervise more systems and more agents.

Ben Lorica: More agents.

Helen Gu: Exactly. I think on that sense it’s a good thing — it will speed up innovation and true technology advancement.

Ben Lorica: On the other hand, the more pessimistic view is that the entry-level first job — getting your foot in the door — seems harder. You mentioned your graduates this year are having a harder time compared to last year. My sense is that first job is a little harder to get.

Helen Gu: Yeah, but on the positive side — we hired five engineers just this past week.

Ben Lorica: Are they fresh graduates?

Helen Gu: Yeah, fresh graduates, and two of them are coming from my class. Those students used to be attracted by big corporations because of higher salaries and amenities we don’t have. But now students are looking at innovative startup companies where they can contribute to more innovative ideas, rather than just being a coder or product developer in a big corporation. I think that’s actually a good thing. I tell my students: you will find jobs, and salary is not the most important thing for your first job. A lot of very successful engineers grew from junior to very successful leaders. You want to be in an environment where you get a lot of chances to do different things — that’s what startups offer, and big corporations can’t. I think that’s a good thing for the younger generation — they’re exposed to more challenges.

Ben Lorica: So, in closing — would you recommend someone major in computer science?

Helen Gu: Yeah. I think right now is a good time. I don’t believe this is a time when we don’t need so many programmers. We still need a lot of designers, a lot of architects, a lot of system implementers — and that’s becoming more and more important. Think about it: now you need to support not just coders, but millions of agents and millions of users who don’t know how to code. When those people produce systems, you still need to monitor them and manage them. That’s what computer systems are about. We invented operating systems for personal computers — now we probably need a new operating system for the AI world. That’s sort of what we’re trying to build here.

Ben Lorica: And with that, thank you, Helen.

Helen Gu: Thank you.

Helen Gu on AI Reliability, Silent Failures, SLMs, and the Future of Ops.

Transcript

Share this:

Like this:

Discover more from The Data Exchange