Stefania Druga on Socratic Copilots, Math Misconceptions, and Multimodal AI on Phones.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.
In this episode, host Ben Lorica talks with Sakana AI research scientist Stefania Druga, formerly a research scientist at Google DeepMind, about building AI tools for young learners and what that teaches us about AI design for everyone. Stefania shares how she developed Socratic copilots for kids, a multimodal math tutor that detects misconceptions from paper worksheets, and why agency, tinkerability, and on-device open-weight models matter for real-world deployments. They also explore human-in-the-loop education (including parents and teachers), the emerging UX of multi-agent systems, and how these ideas translate into practical patterns teams can use in their own AI applications. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]
Interview highlights – key sections from the video version:
-
-
- Intro & What Kids Teach Us About AI Design
- Gen Z as Power Users & the Missing “Agency Knob” in AI Tools
- Designing a Socratic Coding Copilot with Cognimates
- Tinkerable Prompts, Multimodal On-Device AI & the MathMind Tutor
- Open-Weight Multimodal Models on Phones (Gemma, Llama, Mistral)
- Math as a Neuro-Symbolic Testbed & Formal Specifications for AI
- Communities, Grants & Funders in AI Education (AI for K–12 and Beyond)
- Khan Academy’s Khanmigo: Socratic Tutoring & Getting the Math Right
- From Advanced Math Models to Distilled, Tool-Using Systems on Edge Devices
- Parents in the Loop: Helping Families Navigate Kids’ AI Use
- Multi-Agent Futures, Simulations & Social Learning with Kids and Agents
- Enterprise Multi-Agent Hype, Reliability Challenges & Closing Thoughts
-
Related content:
- A video version of this conversation is available on our YouTube channel.
- Emmanuel Ameisen → How Language Models Actually Think
- Nick Schrock → Beyond the Dashboard: Collaborative Analytics in Slack
- Jakub Zavrel → How to Build and Optimize AI Research Agents
- Raiza Martin → From NotebookLM to Audio Companions: Why Google’s AI Team Went Startup
Support our work by subscribing to our newsletter📩
Transcript
Below is a polished and edited transcript.
Ben Lorica: All right, so today we have Stefania Druga. She is an independent researcher, most recently a research scientist at Google DeepMind, where she worked on multimodal AI applications. So Stefania, welcome to the podcast.
Stefania Druga: Thank you. Thanks for having me.
Ben Lorica: So you’ve actually had a very interesting journey. You’ve built AI education tools for young people, I guess all the way up to seven-year-olds. And then afterwards you worked on cutting-edge multimodal AI at DeepMind. So, what have kids taught you about AI design that most of us AI engineers miss?
Stefania Druga: I think kids teach us a lot in general, but in particular about AI design. It’s been quite a journey. I started working on AI education in 2015 and I was part of a Scratch team at the Media Lab at the time and started building the first AI education platform for kids called Cognimates. Kids could train their custom models with images and text and then use those custom models in their own projects, like in a visual programming language.
And yeah, they would do things I never thought of or adults never think of. Like build a model to identify weird hairlines, right? Like with pictures of people that have very weird hairlines. Or for the text models, they wanted a model that can recognize and give you backhanded compliments.
All sorts of things that are weird and quirky and fun and not necessarily utilitarian in nature. I think when we talk about large language models and AI today, it is very much from this utilitarian perspective of how it can automate part of our work or how it can make our work faster. But young people don’t have that approach, right? For them, driving a car is fun. Having a self-driving car is not fun. So it’s kind of figuring out from that very explorative and curious mindset how we make sense of AI and how we use AI. And I think they have a lot of insights that could inspire adults.
Ben Lorica: Yeah. And you and others have also noted that actually a lot of the users of AI are Gen Z, yet most tools aren’t really designed with them in mind. So what do you think is the biggest disconnect between how we currently build AI and how many of these young people want to use it?
Stefania Druga: The biggest disconnect is that we don’t have a knob for agency to control how much we delegate to the tools and how much we take control or stay in the driving seat. So most Gen Z use off-the-shelf generative AI products like ChatGPT, Gemini, Claude. And these tools have this baked-in assumption in their design that they need to do the work for you instead of maybe asking you questions so you figure out how to do the work, or brainstorming, or collaborating.
So I like a much more Socratic approach. I very much believe in Socratic learning and learning by doing. And I think a big part of learning is asking and being asked good questions. So a huge role I see for Generative AI in the future for learning and for the new generation is: use it as a tool that can teach you things, ask you questions, show you the blind spots, brainstorm with, but not a tool that you just delegate the work to.
And you know, we’re seeing now a disconnect because obviously young people are using AI all the time, and then in schools there is all this lack of trust and loss of trust, with teachers complaining about cheating while the teachers themselves use generative AI, right? So there’s kind of this big elephant in the room where we don’t have clear guidelines and a good conversation around what are best practices for how and when and if we should use AI. And also that the tools are not always designed for that.
Ben Lorica: So you mentioned the Socratic approach. So how do you technically implement a Socratic approach in this world of text interfaces?
Stefania Druga: Yeah, I will give you several examples and they’re not all on text interfaces because I’m building multimodal AI apps. So the first example in Cognimates, which is this block programming language on top of Scratch, I created a copilot. It’s the first copilot for kids coding, to my knowledge. And this copilot is not doing the code for the kids. This copilot is actually asking them questions. So to give you an example, if a child says, “Okay, how do I make the dude move?”—this is a verbatim quote from one of the questions they would ask the copilot referring to a character in their game on the screen—the copilot would be like, “Oh, which way do you want it to move?” Or “How do you want to break down the motion event?” Right?
So it’s asking them questions instead of saying “Put this block there” or “Use this block and then that block,” just giving them the solution. So it’s asking them questions. If they’re really stuck, like if they’re asking the same question more than three times, then it’s going to start kind of leading them or giving them insights towards the solution. But the main goal is to ask them questions. Or if the kids place an image of the code blocks, to explain to them how it works. Generate images for their games, but we put a lot of work into how the filters and the styles and how these images fit the Scratch programming model and kids’ games. And they can talk to it too. They don’t need to type, right?
So when I was designing this, it actually started with research. So I’ve done several studies in the past three years to figure out how to build this copilot for kids with young people from 11 countries. So not only English speaking, not only North America. I’m from Romania, I speak seven languages, so I work with young people from all over the world. And before building the tool, the approach was to do a Wizard of Oz study, faking the copilot, having a person behind the scenes who was pretending to be the AI, just so we figure out what kids want, what questions they ask, and what are the most important use cases.
Then we went and built it and we realized like, oh, kids really want a pair or a system that can help them clarify their thinking. Because a lot of programming and game building is around computational thinking. So how do you break down a complex event into steps that are good computational units? And that was one thing, help them understand how to break down their complex ideas for games into smaller steps that they can actually implement by asking questions. Another one was when they were stuck to help them with debugging.
And the third one—and this actually surprised me, but this was really prevalent in the data—was affirmations. So whenever they do something that is fun or cool, the copilot says like, “That’s awesome. I like that effect.” And “What if you add this sound?” And we’ve seen—and the parents told us as well—the kids attending our longitudinal study would spend double the amount of time coding because they had this infinitely patient copilot that would ask them questions, help them debug, and then give them affirmations that reinforce their creative identity when they did something.
So with those clear design directions, I went and built the tool and then of course after building the tool, evaluated it again, tested it with kids. And I’m presenting a paper at the ACM IDC conference, Interaction Design for Children, in two weeks. It’s online, so I can definitely include the link to the paper; it’s on my website as well. But yeah, I think this example is something that I would hope gets replicated. I worked primarily with middle schoolers, but in general in education, to co-design with your users. And because these interactions and interfaces and user flows are evolving very fast, I think it’s very important to understand what young people today want and how they work and how they think and design with them, not only for them.
Ben Lorica: Yeah, it’s interesting the way you describe it because the typical developer or knowledge worker now, when they interact with these things, in some ways they over-specify the prompt, right? Because basically they describe so precisely… because they’ve probably interacted with the LLM before and they’ve gotten it into their head that you’ve got to specify as much as possible in order for you to get the right result. But what you’re describing is kind of interesting because basically you’re building what you’re building incrementally, but somehow we’ve gotten away from that as grown-ups, the way we prompt.
Stefania Druga: It’s all about tinkerability and having the right level of abstraction. Like what are the Lego blocks that we operate with? And if they’re too abstract, I think sometimes a prompt is just not tinkerable enough. Like you’re giving this blob of text and then you get the final output. And then maybe you try to refine or edit, but it doesn’t allow for enough expressivity and tinkerability to be like, “Oh, you know, I’m just gonna build… let’s say I’m generating an image… like the background and then like the outfit and then like the…” So it needs to be composable and also allow the user to be in control in the driving seat in the creative process or in the learning process.
And I mentioned that I don’t only work with text. And I think right now the part that is very exciting for me is multimodal and things that can work on a phone, right? Because a lot of us and young people spend a lot of time on their phones and it’s also more accessible worldwide. And we do have open source models that are multimodal and can run on device. So you don’t even need to send your data to the cloud. Like today the cloud was down, right? So you wouldn’t have a problem. The model runs on your phone. It can parse images, it can parse videos, it can parse sensor data, text. And then it can also produce all of these modalities, not only text.
And I worked recently on two projects that are multimodal and mobile-first. First one is around math because math is a domain where we’ve seen the biggest learning losses during the pandemic. And it’s also one of the reasons why so many people drop out of STEM after middle school. And in math, like in algebra, together with collaborator Nancy Otero… we created a benchmark first of misconceptions. Like what are all the possible mistakes middle schoolers can make when they learn algebra? From not understanding the order of operations, not understanding negative numbers, or the equal sign—like that you could move elements from one side to another of the equal sign. And this is based on prior work and a lot of research that was done on algebra learning.
So we created a benchmark, 55 misconceptions. And then we tested to see if multimodal LLMs can pick up a misconception based on pictures of kids’ exercises of math, handwritten on paper. And found that their—and then we got the results, ran the results by teachers, like experts, to say, “Do you agree? Do you agree with this labeling? Do you agree that these examples are this type of misconception?” And confirmed that teachers agree with all of the labeling done by the models for these misconceptions.
Then I built an app that in real-time—you can use a webcam or your phone—you’re solving math on paper, it’s called MathMind. And as you are solving math it’s gonna ask you questions. It’s like, “Oh, what do you think happens if you divide this by five?” Or “What happens if you multiply it?” And then if it detects a misconception, it detects that you don’t understand the core concept, it’s going to propose additional exercises so you get to practice that concept.
And I think that’s extremely powerful because there are lots of cognitive science papers that show that we retain and learn much better working on paper. So I think it makes a lot of sense to solve math on paper. But then having this tool that can ask you questions and in real-time detect like… It’s not about getting the right answer. It’s about understanding the concepts because all of these concepts build on each other. So there’s this zone of proximal development from pedagogy. So if you miss a core concept, it’s very hard to keep on building. And for teachers it is useful as well at the beginning of a class to say like, “Okay, I just want to see from my class as they’re solving, how many people did not understand this concept before I move on,” right? And that’s an example of a tool that is not only text and I think it can be very powerful. And I guess the insight there is creating these domain-specific benchmarks and working, in my case with math teachers, and really with domain experts to understand what’s the most valuable way to evaluate the tool for this domain.
Ben Lorica: So who are building these open weights models that you’re using as your kind of starting point?
Stefania Druga: Yeah, so I used a lot of the Gemma 3N. Gemma is the open source series of models from Google. And this is not because I worked there as a research scientist. Gemma is actually quite good.
Ben Lorica: Yeah, Gemma is actually quite good.
Stefania Druga: Yeah, they’re really good models. And then I think the latest model, like the 3N, is not only multimodal, but it’s multilingual. So it works very well in multiple languages and it’s small enough that it’s easy to run on a phone or on a laptop. So that’s pretty cool. Llama of course has good models that are small. Mistral is another good one. I think these would be the top ones that I’ve been using.
Ben Lorica: And the latency? So we’re talking milliseconds, not seconds? And also what about battery consumption?
Stefania Druga: I haven’t done extensive tests for battery consumption, so I don’t have the final results for that. But I haven’t seen anything egregious. It’s just like any other app basically so far.
Ben Lorica: Yeah, interesting. So Math actually is kind of the perfect testbed in many ways, right? Because basically there’s a right and wrong answer in many ways.
Stefania Druga: Exactly. And it’s the future of application multimodal AI and just AI in general. It’s going to be neuro-symbolic. What that means is that there is a part that the LLM does, right? Like the LLMs are very good at fuzzy logic. Like we ask “How can I move the dude?” and it figures out what I mean. But then there is a formal system part, which is actually having concrete specifications and ways to verify them. And math is perfect for that, right? Because we know what the ground truth is. And then the challenge is how do we create formal specification in other domains?
And I can give you some other examples and how I went about doing that. But the most promising results right now at the cutting edge of AI are coming from this intersection of formal methods, you know, like program synthesis, and large language models. So one example is the AlphaGeometry from DeepMind where they won Olympic math competitions; it is because they were using this grammar for describing all geometric rules to constrain the space of searches of “What are possible solutions to geometric problems?” And that grammar, it was developed by a researcher. It took him three years. So it takes time. It takes time to fully formalize a domain. But it’s what we need.
Ben Lorica: So for our listeners and myself who actually are not paying that close attention to the use of these tools in education, can you give us a sense for the size of the community working on these things? The same sort of solutions that you’re working on? Is it mostly academics? Are there startups? Or are there research grants? So what is the level of support for this work?
Stefania Druga: That’s a great question. I’m trying to figure out what’s the most diplomatic way to answer. So… Okay, so the first community that started, when I started working on AI education in 2014, 2015… When I said AI education everyone was like, “Oh you mean like use AI to teach kids?” And I was like, “No, no, no. Teach kids how AI works,” right? So it was very early stages and I think in 2016 we had the first community call gathering of people interested in this space called AI for K12. So there’s a website and a mailing list and a very active community of researchers, educators, EdTech startups. And yeah, I think AIforK12.org is a great place to see what other people are working on and it’s a good community. It was supported by NSF. And yeah, pretty diverse and there’s people from all over the world that joined. But it’s still pretty niche. It’s not a huge community.
And then there’s this learning and tools community from Schmidt Futures. So Eric Schmidt Foundation funded a lot of initiatives and projects in this space. In particular focusing on math learning last year. So that’s another community. Renaissance Philanthropy, Kumar Garg and Tom Kalil also fund a lot of initiatives in this space and kind of push…
Ben Lorica: What about Khan Academy?
Stefania Druga: Yeah, so Khan Academy is part of the learning tools and yeah, Khan Academy, Khanmigo are great examples. There’s some issues. I actually worked with the Khanmigo team during my postdoc together with Angela Duckworth who wrote Grit and other people. So they brought in a developmental psychology expert like Angela, and a behavioral economist like Sendhil Mullainathan (who was my postdoc advisor), and AI researchers like myself. And it was interesting because they wanted Khanmigo to be all about intrinsic motivation and really understanding how do you give positive encouragement to the kids.
But what I discovered at the time—and this is like a year and a half ago, right? So things have most certainly evolved—is that the math was wrong. So it’s all fun to work on positive encouragement and growth mindset, but the math needs to be correct, right? So the basics need to be covered. And because they were relying primarily on OpenAI and GPT models, we know that Large Language Models didn’t use to be very good at math. They’re better, but not perfect, right? Because they’re trained primarily on text.
And yeah, I think that was a challenge. And I did meet and talk to the person who runs their AI lab and be like, “Hey, do you guys have evaluations? Like how do you test if the math is always correct?” Because if the math is wrong, you know, 20% of the time, that’s a really big deal, especially when you’re learning the foundations. And yeah, I think their tool is Socratic and they definitely share this approach of teaching by asking questions and giving a lot of agency to the students. But yeah, for me it was an early example of why it matters so much to have evaluations and benchmarks that are education specific, right?
Ben Lorica: So my PhD is actually in math, so I kind of pay attention to what the research mathematicians are doing in this space. And I’m increasingly reading that they’re getting more and more impressed about the capability of some of these leading-edge foundation models. But my question to you is, let’s say for example, a month from now, one of these foundation models gets really good at that advanced mathematics. How quickly before you can distill a small model so that you benefit on the phone?
Stefania Druga: So there was a project, Minerva, that was like an LLM specifically for math. And people put a lot of effort and time into it. And then the AlphaGeometry architecture I described, right? And that’s more on the geometry side. I think a really good model that is always correct at math is not going to be a transformer under the hood. So not like your classic LLM. It’s going to be a combination of transformer that does the next token prediction, which works very well in text, together with maybe tool use. So you’re always transforming the math problem in Python code and you execute it and then you see if it compiles or not, right?
Ben Lorica: Or in the case of these research mathematicians, they might be using one of these automatic theorem provers, right?
Stefania Druga: Lean. Yeah, Lean… exactly. So we need to have a piece of the system that is verifiable. And I do think we are there, like we do have those working solutions. How quickly do we make it work on a phone and how fast? Like it’s totally doable right now. Like we don’t need to wait. And the APIs, I mean of course you can distill a model. There’s open source systems like Unsloth, which is this young, super young researcher Daniel Han from Australia who’s single-handedly kind of created this community where it’s like as soon as there’s a model, they distill it and make it available and it goes very fast. But yeah, I think even if you don’t want to use local models and use APIs, the APIs are becoming more and more affordable and there is a wide range of options for people to pick and choose from. So we can build those tools right now and make them run fast and reliable on local, on edge devices.
Ben Lorica: So human in the loop in the case of education means parents in the loop, right? So what extra steps do you have to make in order to be comfortable that whatever you build is ready to be deployed and be scrutinized by tiger moms and tiger dads?
Stefania Druga: You know, I speak at a lot of conferences and the most common question that I get is: I have a kid, what should I do with my child? And I was getting this question so often, so I sat down and I wrote a very long handbook for parents as a support for: if you want to talk to your kids about AI, or if you want to know what’s the landscape of AI education, AI tools and everything you should know for your family, here’s a breakdown. And I’ll again, we can hopefully add the link to that in the podcast.
But it’s something I researched for a long time because during the pandemic I worked with the same community of families, like 20 families from 10 different states in US. But two and a half years I worked with them, right? So I saw their kids growing up, I saw how the parents were mediating use of AI in the house or how they would learn different AI concepts together with their kids. From how to take apart voice assistants, or how they would design their own voice assistant, or how they learned through games how machine learning works and recommendation systems and about bias and how to solve bias. So I think there’s a lot of work to be done for families.
And parents are very overwhelmed. And I understand, right? Like I’m overwhelmed as a researcher in AI because things are moving so fast. I can only imagine as a busy parent having to keep track of everything from smart toys that collect data from your kid to all the apps to the things that are being used at school, college applications… And there is this constant fear of not wanting your child to be left behind, but then also not wanting them to be on devices all the time, which is extremely valid, right?
So the answer there, I mean, I don’t have the final answer, but I think it’s important to make a plan. And it could be very simple just to have conversations. It could be a dinner conversation with your child of: how are you using AI? Or what do you think about AI? Or how do you think it’s gonna change what you’re gonna study for college or what you do after college? And coming from a place of curiosity, right? Not necessarily from a place of telling them what to do, but opening the conversation. I think that that’s very important. Connecting with other parents and trying to share best practices and learn best practices. But it’s a lot. It’s a lot to navigate.
Ben Lorica: So let’s close with UX. So we talked about implementing this Socratic method, but one of the things that people are talking about—now granted it’s more science fiction at this point than reality—is multi-agents, right? So I guess at some point some kid will be using some tool where they’re orchestrating a bunch of agents somehow. So what kinds of innovations in UX are you seeing in order to prepare us for this world? Or are you seeing anything? Or do you collaborate with UX people?
Stefania Druga: Yes, yes. So let me break it down. So there are multiple questions in there. Thanks. So the multi-agent part is very interesting because when I was doing this study on the Scratch copilot, at the end we had a design session with kids which was like: How do you imagine this being in the future? What features do you want? How would you like to use it with friends and family? And this theme of agents, multiple agents emerged, right? So many of them wanted that and they wanted to be able to run simulations.
So you have one copilot where you kind of, let’s say you want to do Scratch… Pacman. And you tell it, “Make the maze in this way.” And you have another copilot, you give it slightly different instructions. And you’re programming part of it, but then you can see how all these like five different agents in parallel are programming Pacman slightly differently and learn from what they do.
So this is something they already expressed the desire to have. And we talked about the Scratch community because the reason Scratch is so big is not because it’s block programming, but because it’s social learning. Because it’s a community where kids share their projects and remix each other’s projects. So I asked them in the future, “What happens if some of these games are done by agents and how would you like to know that? And if you make a game with an agent, how would you like to share that with other people?” right? And we talked a lot about that. And it’s definitely something that they want. And it’s definitely something they want to be very transparent about, to say like, “This is 50% done with AI. 70% done with AI.” Or “This is 100% done by an agent.” And kind of having a hybrid online community where it’s agents and kids and kids and agents and everything in between.
And it’s not science fiction. The technology for this already exists. So I am collaborating with the folks from Alphamorphs, a startup based in San Francisco. And they created a technology called InfiniBranch, which is basically allowing you to create a lot of virtual machines, virtual environments, where you can test agents and see agents in action. So the idea for them was with this new crop of operators, like OpenAI launched Operator that can take tasks, can take actions on the web and book a flight for you or search for things, add things to your calendar. Google launched Project Mariner. So in the future, we’re clearly going to have agents that take actions, right? And how do we evaluate them? How do we observe them?
So Alphamorphs created this technology where you can deploy hundreds of these agents and observe them. And I told them about what kids wanted in Cognimates and they were like, “Totally. Let’s make it happen.” Right? So it’s not science fiction. It’s something that we can test and implement now. It’s definitely gonna be kind of this era of simulations and tools for thought. I think it’s one of the most exciting areas because it means we parallelize exploration and learning. Like we don’t need to do: now we do this experiment and then we do that experiment. You could run ten experiments all at once or a hundred. Pretty excited about that part.
Ben Lorica: Yeah, I guess to qualify what I meant by science fiction, I meant in the enterprise because a lot of enterprise people get ahead of themselves, right? So let’s get one agent working well first before we aim to have a bunch of agents talking with each other and coordinating, right? And I think a lot of the vendors frankly are getting ahead of themselves in terms of marketing.
Stefania Druga: Absolutely. You know, it’s one thing to do a demo and kind of show the proof of concept for a technology, it’s another thing to get it to work 100% reliably.
Ben Lorica: Yeah. And with that, thank you, Stefania.
Stefania Druga: Yeah, thank you so much for having me.
