Efficient AI Predictions through Explainability-driven Neural Network Quantization

Solving increasingly complex real-world problems, continuously contributes to the success of deep neural networks (DNNs) (Schütt et al. 2017; Senior et al. 2020). DNNs have long been established in numerous machine learning tasks and for this have been significantly improved in the past decade.

This is often achieved by over-parameterizing models, i.e., their performance is attributed to their growing topology, adding more layers and parameters per layer (Simonyan and Zisserman 2014; He et al. 2016). Processing a very large number of parameters comes at the expense of memory, computational efficiency and invested energy. Such immense storage and energy requirements however contradict the demand for efficient deep learning applications for an increasing number of hardware-constrained devices, e.g., mobile phones, wearable devices, Internet of Things, autonomous vehicles, or robots.

Beyond these, typical applications on such devices, e.g., healthcare monitoring, speech recognition, or autonomous driving, require low latency and/or data privacy. These latter requirements are addressed by executing and running the aforementioned applications directly on the respective devices (also known as “edge computing”) instead of transferring data to third-party cloud providers prior to processing.