The Data Exchange Podcast: Sijie Guo on how Apache Pulsar is able to handle both queuing and streaming, and both online and offline applications.
In this episode of the Data Exchange I speak with Sijie Guo, founder of StreamNative, a new startup focused on making enterprise messaging technologies – specifically Apache Pulsar – easy to use on the cloud. Sijie was previously a cofounder of Streamlio (acquired by Splunk) and prior to that he led the messaging team at Twitter. He is also the main organizer behind the Pulsar Summit (April in San Francisco), a new conference whose Call for Speakers closes on January 31st.
I’ve written about the importance of foundational data technologies, and data ingestion and messaging are the starting point for modern data applications. As data and machine learning continue to grow in importance, it’s critical for companies to make sure they have the right messaging systems in place.
Our conversation spanned many topics, including:
- The role of messaging in modern data applications and platforms.
- The two main types of messaging applications: queuing and streaming.
- Apache Pulsar as a unified messaging platform, able to handle both queuing and streaming, and both online and offline applications.
- A status update on Apache Pulsar.
(Full transcript of our conversation is below.)
Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.
Download a complete transcript of this episode by filling out the form below:
Ben: For someone who’s not technical and who just wants to have an idea of what these systems can do, at a high level, what are the two types of messaging—what is queuing and what is streaming? And what are some good examples for each?
Sijie: In terms of communication, it typically divides into two patterns. One is queuing. A simple way to think about this is the queue. For example, when you go into a bank, you wait in a queue for a banker to help you. That is like a worker queue, with task-oriented workloads that are processed on a per-event basis. Since they’re processing all events, they don’t really care about ordering. This messaging queuing system is common in industries like e-commerce retailers to process payments, transactions, and billing statements. That is one of the most common communication channels, taking the messages, all the events, from the end user to your system.
Ben: So, what about streaming?
Sijie: In terms of streaming, you get a sequence of events that you want to collect based on a per-entity, or per-stream basis. For example, for an IoT device that you want to use to collect one data point, like temperature, you’d have a sensor collecting all the temperature changes. You’d want to collect those data points, or events, in a particular order so you can virtualize or analyze the changes in sequence.
Streaming systems are focused on things like user behavior, fraud detection, and maybe event processing and log collection. For situations where you want to collect the events from a certain device or certain entity in a particular order, and you want to analyze the behavior from the sequence of events.
- A recent thread on Hacker News compared Apache Pulsar with other popular systems including Kafka (here).
- “One simple chart: Who is interested in Apache Pulsar?”
- “Apache Pulsar Versus Apache Kafka”
- Karthik Ramasamy: “Architecting and building end-to-end streaming applications”
- Sijie Guo on ”Comparing Pulsar and Kafka: unified queuing and streaming”
- Becoming a machine learning company means investing in foundational technologies
[Image from pxfuel.]