The Security Debate: How Safe is Open-Source Software?

Mars Lan on Open Source Security, Supply Chain Vulnerabilities, and Graphs in Knowledge Management.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Mars Lan, Co-Founder & CTO at Metaphor¹, an AI-powered social platform that enhances data governance by empowering all employees, not just data teams, to easily collaborate, search, and share insights through an intuitive, AI-driven interface. In this episode, we dive into critical issues surrounding open-source security, software supply chain vulnerabilities, and the role of AI in securing modern applications. We explore how knowledge graphs improve data management, from technical graphs to semantic and social graphs, and discuss their impact on AI workflows. Additionally, we cover the challenges in building knowledge graphs and the importance of social trust in data usage, while highlighting current limitations in security approaches for both open-source and proprietary software.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Transcript.

Below is a heavily edited excerpt, in Question & Answer format.

What inspired you to investigate security vulnerabilities in open source projects like Data Hub and Open Metadata?

Security has always been an interest of mine, though not my primary expertise. At Metaphor, we work with clients in financial services, insurance, and healthcare – industries where security is non-negotiable. This led us to focus on third-party and supply chain security.

We rely heavily on open source libraries, frameworks, and tools like most software companies. This creates a complex chain of dependencies where vulnerabilities anywhere in the chain impact your software. While examining our own security practices, we became curious about how open source projects handle security. There’s a common perception that open source is inherently more secure because “many eyes” can review the code, so we wanted to test this assumption.

I was particularly familiar with Data Hub (a project I led at LinkedIn) and Open Metadata, which both operate in the same data catalog space as Metaphor. These projects naturally depend on numerous libraries to connect with various data systems, making them interesting security case studies.

What did you discover when you examined these projects for security vulnerabilities?

I cloned both repositories and enabled GitHub’s Dependabot, which scans for known vulnerabilities. Surprisingly, I discovered 20-30 high and medium severity vulnerabilities in each project, some of which had been open for years without being addressed.

This challenged the “many eyes” theory of open source security. While theoretically many people could review the code, in practice, few were paying attention to security vulnerabilities. The open nature created a false sense of security – everyone assumes someone else is handling it.

What’s particularly concerning is that Dependabot only shows these vulnerabilities to repository owners by default. But anyone with malicious intent can easily clone the repository, enable Dependabot, and get a complete list of exploitable vulnerabilities.

How did the project maintainers respond when you reported these issues?

Initially, there was no response from either project for a full week after publishing my blog post, despite tagging the companies behind them. I then opened GitHub issues and posted in their Slack channels to ensure visibility.

The responses were quite different. Open Metadata initially tried to downplay the issues, but after some back-and-forth, they acknowledged the problems and committed to fixing them. Within five weeks, they had resolved most of the vulnerabilities, with only two or three remaining.

Data Hub, on the other hand, maintained complete silence. No response to my GitHub issue or Slack message. Their vulnerability list remained largely unchanged, and in fact, the number of high-severity issues actually increased from 16 to 18 since my initial post.

Does this mean open source software is less secure than proprietary software?

I wouldn’t make that blanket statement. The advantage of open source is transparency – you can apply tools like Dependabot to see what’s wrong and potentially pressure maintainers to fix issues. With proprietary software, you’re somewhat flying blind regarding dependencies.

The main takeaway is that security isn’t automatic – it doesn’t magically happen just because a project is open source. Security ultimately depends on teams prioritizing it. Some of the most secure software used daily by millions is proprietary, not because proprietary is inherently more secure, but because those companies invest heavily in security.

The “many eyes” theory only works if those eyes are actively looking for and fixing security issues. Without that commitment, an open source project can be just as vulnerable as any other software.

How does Metaphor use knowledge graphs to help users discover and understand data?

At Metaphor, we’re building a data catalog that helps users find, understand, and trust data. We’ve observed that analysts and data scientists at large companies don’t struggle with having enough data – they struggle with finding the right data, determining if it’s trustworthy, and understanding how it’s computed.

We’ve built three interconnected graphs to address these challenges:

Technical Graph: This is automatically built from database schemas and query logs. It shows how tables and columns relate to each other – essentially capturing data lineage (what table derives from what).
Semantic Graph: This translates the technical graph into business language. Instead of seeing cryptic table names like “customer_table_raw” and “customer_table_refined,” users see “Customer” as a business entity. This involves clustering similar technical elements and mapping them to business concepts.
Social Graph: This captures how people interact with data. It combines organizational information with actual data usage patterns to show which teams work with which data assets. This is crucial because people often trust data based on who else is using it.

When users search for data or ask questions in natural language (like “sales data for East Coast”), we use these graphs to provide contextually relevant results that include both technical information and social signals like who else uses this data. The combination of all three graphs, enhanced with AI, creates a powerful system for data discovery and understanding.

How large can these knowledge graphs get and how do you technically implement them?

They can get quite large. Some of our customers have millions of tables, and if you multiply that by the average number of columns, and then consider every possible connection between columns, the technical graph alone becomes enormous.

The semantic graph is more constrained since it maps to a company’s business concepts. The social graph can also grow large as it captures all user interactions with data – queries, dashboard views, definition updates, etc.

We’re in the early stages of using graph database technology. Currently, we’re using AWS Neptune with the OpenCypher query language, though as we scale, we’ll likely move to hosting our own graph database solution.

[Ben Lorica is an advisor to Metaphor and other startups.]

The Security Debate: How Safe is Open-Source Software?

Mars Lan on Open Source Security, Supply Chain Vulnerabilities, and Graphs in Knowledge Management.

Related content:

Transcript.

Like this:

Mars Lan on Open Source Security, Supply Chain Vulnerabilities, and Graphs in Knowledge Management.

Related content:

Transcript.

Share this:

Like this:

Discover more from The Data Exchange