Why ‘Structure’ Is All You Need: A Deep Dive into Next-Gen AI Retrieval

Tom Smoker on GraphRAG, Knowledge Graphs, Property Graphs, and LLM Hallucination.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Tom Smoker is the co-founder of WhyHow.ai, a startup that transforms unstructured data into structured knowledge, including knowledge graphs, enabling enterprises to deploy accurate and explainable AI solutions with its open-source graph tooling. In this episode, we delve into how structured data underpins advanced AI applications, spotlighting GraphRAG as a powerful evolution of Retrieval-Augmented Generation. We discuss knowledge graphs, metadata filtering, and property vs. RDF graphs, revealing how these approaches reduce hallucination and improve explainability in LLMs. The conversation also covers the critical role of operational data, semantic search, and data governance in building next-generation AI systems. Ultimately, we underscore that “structure is all you need” for better performance, scalability, and reliability in AI-driven solutions. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Related content:

Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
Philip Rathle → Supercharging AI with Graphs
Semih Salihoglu → The Intersection of LLMs, Knowledge Graphs, and Query Generation
Mars Lan → The Security Debate: How Safe is Open-Source Software?

If you enjoyed this episode, consider supporting our work by leaving a small tip here and inviting your friends and colleagues to subscribe:

Transcript.

Below is a heavily edited excerpt, in Question & Answer format.

What is your definition of GraphRAG? Do you have a specific definition or framework?

I don’t have a strict definition. The R in RAG is about retrieval, which can be better when it comes from structured data. You get more consistent, often more compressed retrieval with structured queries on structured data. While there’s no consensus definition for Knowledge Graph or GraphRAG, I tend to think of it as simply bringing structured data into the retrieval process, whatever form that takes.

Before GraphRAG became popular, you worked with knowledge graphs. What are your favorite examples of GraphRAG actually in production?

Companies like LinkedIn and Pinterest have spent years structuring their data consistently, which now benefits them with RAG applications. Their knowledge graphs are fantastic when they work.

A more unique example I worked on was for veterinary radiology. In this case, the LLM kept recommending conditions for Labradors when the radiologist was working with a Bulldog, despite clear prompting. Graph RAG helped constrain the context to Bulldog-specific information, solving a very specific but important problem.

What are the key bottlenecks for teams wanting to implement GraphRAG, and how do they get started?

The biggest question is whether you actually need a graph in the first place. There’s a spectrum of how much structure you need – you don’t have to jump straight from unstructured data to a full knowledge graph. Most people should stop before reaching the end goal, as they’ll get the majority of the value before diminishing returns set in.

For teams that have already squeezed as much as they can from traditional RAG, start by getting your data in a more structured way. The simplest approach is prompting an LLM to extract structured data in formats like JSON. However, I should note that automated knowledge graph construction often creates bad graphs. If you just press “play” without intention, you won’t get much lift over vector RAG.

When does traditional vector RAG work well for someone, and when should they consider moving to GraphRAG?

Vector RAG works well when your chunks are well-structured. A lot of that work happens upfront. For example, with legal contracts or academic papers that follow certain structures, if you spend time properly chunking, labeling, and using good metadata filtering, you can get significant value from vector RAG.

GraphRAG becomes valuable when you need a more compressed representation (fewer tokens means better rate limits and lower costs at scale) or when non-technical domain experts need explainability about why the system is producing certain results.

You mentioned the different graph communities – property graphs and RDF/Semantic Web. How are these communities evolving with the rise of GraphRAG?

Property graphs are more accessible and easier to get started with, while RDF offers more consistency, specification, and deterministic reasoning. In the age of LLMs, most people are just trying to get started, and it’s already challenging enough to implement a graph system.

I don’t see many new use cases coming to RDF, while property graphs are where most innovation is happening. People who already liked RDF still use it, but most newcomers go to property graphs. JSON-LD (an extension of JSON allowing ontological consistency) is showing interesting potential because LLMs have been trained on web data that includes this format.

The bigger question might be whether Graph RAG even needs graph structures at all, or if other structured data approaches might be sufficient.

What’s your prediction for Graph RAG by the end of 2025?

The RAG world will define the future more than the graph part. While progress in RAG has been slower than expected, structured data will definitely grow in importance. I believe more people will structure their data, but they may store it in different systems that aren’t necessarily graphs.

There will need to be more structure and explainability, especially as we move toward an agentic future where reducing token usage becomes critical for cost management. By next year, there will be a much easier understanding of Graph RAG, and many more people will have structured their data.

You’ve mentioned operational data in live systems. How important is this compared to the current focus on documents?

The most interesting data is often in operational systems – payment systems like Stripe, logistics systems like FedEx, CRM systems – rather than in PDFs and text documents that developers tend to focus on.

When people talk about unstructured data being a gold mine, they assume it was captured well and just needs accessing. In reality, PDFs and other unstructured sources often contain messy data that requires significant effort to extract value from.

By contrast, networked operational data is already structured, lends itself naturally to hierarchy and graphs, and streams in consistently without needing extensive processing. The holy grail of a GraphRAG system would be where physical meets digital across legislation and process – like a large-scale multinational supply chain with live stream sensor and logistical data.

Tom Smoker on GraphRAG, Knowledge Graphs, Property Graphs, and LLM Hallucination.

Transcript.

Share this:

Like this:

Discover more from The Data Exchange