Inference scaling

What is it?

This is an evaluation framework that supports optimization of machine learning (ML) deployments for inference performance. Optimizing the performance of machine learning (ML) models typically focuses on intrinsic factors, such as the number of neurons, depth of layers, or adjusting the numerical precision of computations. Techniques like quantization and pruning are widely used to improve efficiency by reducing the computational or memory footprint of a model. However, these methods inherently involve modifying the model’s internal architecture, which can lead to trade-offs in accuracy. For instance, transitioning from higher-precision floating-point representations to lower-precision integer formats often introduces a loss of numerical precision, which may impact the model’s predictive performance.

Contact

Dennis Muiruri
University of Helsinki

Send email

Why is it necessary?

The consequences of model accuracy degradation vary across different machine learning domains, depending on the sensitivity of the task to numerical errors. Therefore, it is essential to explore alternative avenues for performance optimization beyond modifying attributes that are intrinsic to the model. 

In this project, we explored inference performance characteristics from a protocol perspective, focusing on REST and gRPC protocols as the two widely supported communication protocols across various model serving frameworks. These protocols exhibit distinct performance characteristics depending on the payload type and the serving framework. Their impact on inference latency and throughput is critical for optimizing the design of inference architectures and efficient allocation of deployment infrastructure resources.

How does it work?

To test a model’s performance and scalability characteristics, a model is deployed using a standard ML serving server such as Tensorfow Serve [1]. Once a model is deployed, inferences call to REST and gRPC model server endpoints are generated using a load-testing framework such as locust [2] or a custom script. Using this setup, various scaling characteristics of the model and the model serving infrastructure can be simulated.

Our experiments, as illustrated in Figure 1, evaluate the performance of REST and gRPC in two serving frameworks –TensorFlow Serving (TFServing) and TorchServe – under varying load intensities where the load is controlled by the lambda parameter of a Poisson distribution.

Results

The findings from our experiments indicate that inference protocols, payload characteristics, caching mechanisms, load balancing and integration options should be considered for deployment of ML systems. 

  1. Protocol Selection and Payload Type: Binary payloads are preferred compared to JSON or other text formats. The protobuf serialization is better optimized than JSON which leads to better performance for the gRPC protocol. In settings where REST must be used, binary payloads on REST should be considered. Hybrid inference architectures that use both protocols in strategic parts of the system interface could benefit from the features presented by both protocols. For externally facing APIs, optimized REST APIs should be configured to send more efficient binary payloads to the server.
  2. Caching: If payloads with similar requests are expected in each deployment setting, a caching layer should be considered. This would improve the overall system performance because of reduced latency to requests and support of caching at the model level. On the downside, an added caching layer may introduce new complexities to the overall architecture.
  3. Resource utilization: gRPC delivers higher loads to the model server due to its ability to scale at increased request intensity levels.
  4. Supporting multiple frameworks: Porting the model from one framework to another can be considered for production purposes, by adopting frameworks with demonstrated better performance especially for performance critical applications. The main constraint would be added complexity in the development workflow.

Links and further reading:

  • Olston, Christopher, et al. “Tensorflow-serving: Flexible, high-performance ml serving.” arXiv preprint arXiv:1712.06139 (2017).
  • Locust. 2023. Locust: An open source load testing tool. (Last visited: 08/05/2023). 

  • Reddi, Vijay Janapa, et al. “Mlperf inference benchmark.” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020.