Inference scaling

What is it?

Optimising the performance of machine learning (ML) models is often based on factors intrinsic to the architecture of models, such as the number of neurons, number of layers and/or adjusting of numerical precision. Hence, techniques such as quantisation and pruning are adopted. Tuning performance by such methods requires changing the intrinsic features of a model and often results in changes in accuracy. For example, assuming a lower precision from floating point representation to integer representation results in a loss of numerical accuracy. Implications of loss in accuracy may differ across different machine learning domains. 

Why is it necessary?

In some ML settings, trading off a model’s accuracy for improved performance may not be convenient. This means other channels to optimise performance should be explored. We evaluate inference performance from a protocol perspective. Currently, two architectural styles serve ML models: REST and gRPC. These two approaches provide different performance profiles when serving models. Understanding how these architectural designs affect inference performance for a deployed model can be helpful when designing inference architectures and procuring infrastructure resources. REST design primarily uses JSON as a serialisation format to send data between client and server, while gRPC, by default, used protocol buffers as the serialisation format. In fact, the prediction protocol standard requires standardised ML servers to provide support for REST and gRPC designs.

How does it work?

1. To test a model’s performance and scalability characteristics, a model is deployed using a standard ML serving server such as Tensorfow Serve [1]. Once a model is deployed, inferences based on REST and gRPC are generated using a load-testing framework such as locust [2]. This setup simulates the scaling characteristics of the model given on a given infrastructure setting for a determined number of users. Latency results, as shown in Figure 1, demonstrate the comparative effect of increasing batch size during inference when using a REST and gRPC architecture independently.

This image has no alt text.
Figure 1: REST and gRPC inference comparison University of Helsinki

Further, we can characterise the throughput behaviour of a given inference architecture, as shown in Figure 2.

This image has no alt text.
Figure 2: Request throughput in REST vs gRPC University of Helsinki

2. To provide comparable benchmarks, these experiments will be conducted on benchmarked hardware GPU or CPU [3] to increase the credibility of obtained results. 

Links and further reading

1. Olston, Christopher, et al. "Tensorflow-serving: Flexible, high-performance ml serving." arXiv preprint arXiv:1712.06139 (2017).
2. Locust. 2023. Locust: An open source load testing tool. (Last visited: 08/05/2023). 

3. Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020.