Why You Should Optimize Your Deep Learning Inference Platform

The Data Exchange Podcast: Yonatan Geifman and Ran El-Yaniv on the benefits that accrue from using an inference acceleration platform.

SubscribeApple • Android • Spotify • Stitcher • Google • RSS.

In this episode of the Data Exchange, I speak1 with Yonatan Geifman, CEO and co-founder of Deci, as well as with Ran El-Yaniv, Chief Scientist and co-founder of Deci and Professor of Computer Science at Technion. As companies deploy machine learning and deep learning to critical products and services, the number of predictions that models have to render can easily reach millions per day (even hundreds of trillions, in the case of Facebook).

These “prediction services” continue to grow in importance – 80% of content on Netflix is discovered through recommenders – and thus companies need to build platforms to serve predictions to an ever growing number of users and services. Deci builds tools to help companies accelerate and scale their inference platform to meet the requirements of their specific application and use case. They do so through an array of tools that looks at inference holistically and systematically:

[Image: Inference Acceleration Stack, from Deci.ai, used with permission.]

Download a complete transcript of this episode by filling out the form below:

Short excerpt:

    Yonathan: ❛ An inference platform will have several layers. At the bottom, we have the hardware stack where we see CPUs, GPUs, and specialized hardware (ASIC) – there are a lot of startups working on new chips for deep learning. We see chips that are more focused on training, as well as chips that are more focused on inference. In our case, we are talking about dedicated hardware for deep learning inference … 

    On top of that layer, we have software drivers of the hardware, which determine how the the model is utilizes the hardware based on the driver. So the most familiar example is the CUDA drivers for Nvidia GPUs.

    On top of that, we have the graph compilers: graph compilers are components compile a deep learning model in order to run more efficiently on the hardware. … Tools like TensorRT, Apache TVM,  and OpenVINO are in the graph compiler layer.

    … On top of the graph compiler, we see some open source tools and techniques that are well-known in academia (such as pruning and quantization).


Related content and resources:

[1] This post and episode are part of a collaboration between Gradient Flow and Deci. See our statement of editorial independence.

[Photo by Robynne Hu on Unsplash]