Azeem Ahmed on the evolution of Shopify’s data and machine learning platforms, and the power of the lakehouse architecture.
Azeem Ahmed, is Director of Engineering at Shopify, where he leads the team that builds the primitives and the API’s used by all data scientists, machine learning engineers, and members of Shopify’s engineering team. Prior to Shopify, Azeem led data and analytics infrastructure teams at Linkedin and Consensys. Our conversation focused on the evolution and design of data and machine learning platforms within Shopify. Azeem and I also discussed broader trends, including the rise of modern data platforms and the maturation of data lakehouses.
We think about three large primitives: the ingest primitive in this chat interface, the transform interface, and the publisher interface. All of these apply to “data sets” – which could be tables, they could be models, they could be reports, dashboards, and all the other things that you mentioned. When you think of ingest, transform, publish, these are all operating on instead of storage. We are building the lakehouse architecture: our storage is GCS, Iceberg table format, plus Parquet. … Trino is our query engine.
… Where I think Ray is different and excels at, is built around the idea of a unit of compute that you need to scale. Not just taking a piece of data and then you just need to distribute it everywhere and process it, as opposed to a unit of compute that you need to distribute and parallelize. That’s really useful in ML training, reinforcement learning, deep learning, and other tasks that are CPU heavy and GPU heavy.
We’re early in our experiment with Ray, I think we’re about six months into it now. So we have got a system now where users can come in and with the command line that they can run, and they get a Jupyter Notebook. They have our library that they can use to spin up a workspace and the workspace gives them a cluster. After that, they can write Python.
Highlights in the video version:
Introduction to Azeem, Director of Engineering and Data Platform at Shopify
Describe your role at Shopify
Data platform has three primitives: ingest in chat, transform, and publish
Companies can move forward with the lake house
Deconstructed or disaggregated database
Lakehouses are not limited to SQL
Breakdown between structured and unstructured data
Lake house architecture and unstructured data
Starting a team using a modern data platform
Ray and the Shopify journey
Two challenges: tuning and customizing
Vision around the ML platform
Who uses the ML platform inside Shopify?
Onboarding and training data scientists to Ray
Core components of an ML platform
Experiment management and tracking
Data teams struggle with experiment platform
Experiment platform challenges
We don’t pay attention to experimentation
Shopify ML use cases
Merchant loan and product classification
NLP, computer vision, and recommender systems
Academic research community and graphs
Is the metadata platform based on an open source project?
Use cases for reinforcement learning
Machine learning platform twelve months from now
What about the brother lake house?
Shopify will be using Ray twelve months from now
- A video version of this conversation is available on our YouTube channel.
- “Top Places to Work for Data Engineers”
- “An Enterprise Software Roadmap for Sky Computing”
- Travis Addair: “The Future of Machine Learning Lies in Better Abstractions”
- Nikhil Muralidhar: “MLOps Anti-Patterns”
- Che Sharma: “Modern Experimentation Platforms”
Subscribe to our Newsletter:
[Image: “Lamps Bazaar Vintage Lantern Lanterns” from Maxpixel.]