IML4E

Inference scaling

What is it?

This is an evaluation framework that supports optimization of machine learning (ML) deployments for inference performance. Optimizing the performance of machine learning (ML) models typically focuses on intrinsic factors, such as the number of neurons, depth of layers, or adjusting the numerical precision of computations. Techniques like quantization and pruning are widely used to improve efficiency by reducing the computational or memory footprint of a model. However, these methods inherently involve modifying the model's internal architecture, which can lead to trade-offs in accuracy. For instance, transitioning from higher-precision floating-point representations to lower-precision integer formats often introduces a loss of numerical precision, which may impact the model's predictive performance.

Why is it necessary?

The consequences of model accuracy degradation vary across different machine learning domains, depending on the sensitivity of the task to numerical errors. Therefore, it is essential to explore alternative avenues for performance optimization beyond modifying attributes that are intrinsic to the model.

In this project, we explored inference performance characteristics from a protocol perspective, focusing on REST and gRPC protocols as the two widely supported communication protocols across various model serving frameworks. These protocols exhibit distinct performance characteristics depending on the payload type and the serving framework. Their impact on inference latency and throughput is critical for optimizing the design of inference architectures and efficient allocation of deployment infrastructure resources.

The experiment framework provides a way to evaluate the following aspects of a model deployment.

Evaluation of Inference protocols: Performance differences between endpoints can be attributed to the underlying data exchange protocols for a given payload type. gRPC outperforms REST as seen by lower latencies across evaluated serving frameworks. The payload type matters due to serialization overhead.
Determining ML Serving framework differences: Some frameworks can be deemed more appropriate for production than others based on factors such as performance profiles. Models implemented using TorchServe achieve better performance compared to Tensorflow Serving after controlling for differences in payload characteristics.
- Caching effects: Model serving frameworks can have different caching effects. Stronger caching effect is shown in TorchServe, compared to TensorFlow Serving.
- Scalability Under Load: How serving frameworks scale inference under increased load. leads to better CPU resource utilization in both serving frameworks but gRPC delivers a higher load than REST hence leads to better performance.

How does it work?

To test a model’s performance and scalability characteristics, a model is deployed using a standard ML serving server such as Tensorfow Serve [1]. Once a model is deployed, inferences call to REST and gRPC model server endpoints are generated using a load-testing framework such as locust [2] or a custom script. Using this setup, various scaling characteristics of the model and the model serving infrastructure can be simulated.

Our experiments, as illustrated in Figure 1, evaluate the performance of REST and gRPC in two serving frameworks—TensorFlow Serving (TFServing) and TorchServe—under varying load intensities where the load is controlled by the lambda parameter of a Poisson distribution.

These key insights emerged from the experiments:

Impact of payload types: Binary payloads lead to better performance.
- Tensorflow Serving Framework: A significant performance difference emerges between REST and gRPC protocols attributed to difference in payload types. Tensorflow’s serving framework REST API allows sending of JSON based payloads. This is the typical way of serving web applications where payloads are serialized to the JSON format before being sent between client and server. On the otherhand, gRPC, uses protocol buffers as the serialisation format. The latency difference between REST and gRPC shown on the left image of Figure 1 demonstrates this difference.
- Torchserver: The Torch Serve framework uses Bytearrays, the right image of Figure 1 shows a smaller difference between REST and gRPC protocols.
Cache hits vs Cache misses: Cached payload inherently led to better performance (lower latency). This observation was consistent across both protocols and frameworks as seen in both images of Figure 1. This experiment was a control measure for the setup to ensure that recorded performance measurements were not artificially good due to ood due to caching (cache hits). Also, this presents a realistic scenario in deployed systems where cache misses can dominate the CPU load.
Performance optimization under high load conditions. Improving performance (lower latency) at higher workloads is a result of higher CPU utilization. gRPC endpoints can deliver more payload to the CPU compared to REST hence higher CPU utilization. In both Tensorflow and TorchServe frameworks, higher loads led to better performance as shown in both images of Figure 1.
Binary payloads
- Tensorflow Serving Framework : To control for the performance difference based on payload types, we conducted an experiment with TensorFlow Serving where the REST and gRPC endpoints were configured to use binary payloads. The image on the left of Figure2 shows that the significant difference between REST and gRPC on TensorFlow Serving diminishes.
- REST vs rGPC: Consistent to previous results, gRPC endpoints provide better performance than REST across the two frameworks where the same payload is used for the different endpoints.
- Difference between frameworks: There is a distinct perfromance difference between frameworks as shown the left image of Figure 2. Torchserver provides better performance than tensorflowserving. Offline evaluation of the models with same model architecture (ViT model) but different implementation frameworks indicated that TensorFlow models resulted in a higher number of function calls compared to TorchServe implemented models.
- Performance distribution: The right subplot of Figure 2 shows that REST and gRPC have similar distribution shapes, but the distributions obtained from different frameworks are distinct.

Figure 1. Performance profile of Tensorflow serving and TorchServe under different load profiles and payload characteristics. Results of each load profile (qps) are based on 16384 inference queries; this value is calculated to model the 90% quantile at a 95% confidence internal. The experiment involves sending the same payload (induce caching) and randomising the payload (induce cache misses). University of Helsinki

Figure 2. Performance profile of Tensorflow serving and TorchServe under different load profiles and same payload characteristics (binary payload). The two servers are subject to similar Load profiles (10, 50, 70, 100, 200). The distributions on the right indicate less difference on the protocols but distinct server difference. University of Helsinki

The findings from our experiments indicate that inference protocols, payload characteristics, caching mechanisms, load balancing and integration options should be considered for deployment of ML systems.

Protocol Selection and Payload Type: Binary payloads are preferred compared to JSON or other text formats. The protobuf serialization is better optimized than JSON which leads to better performance for the gRPC protocol. In settings where REST must be used, binary payloads on REST should be considered. Hybrid inference architectures that use both protocols in strategic parts of the system interface could benefit from the features presented by both protocols. For externally facing APIs, optimized REST APIs should be configured to send more efficient binary payloads to the server.
Caching: If payloads with similar requests are expected in each deployment setting, a caching layer should be considered. This would improve the overall system performance because of reduced latency to requests and support of caching at the model level. On the downside, an added caching layer may introduce new complexities to the overall architecture.
Resource utilization: gRPC delivers higher loads to the model server due to its ability to scale at increased request intensity levels.
Supporting multiple frameworks: Porting the model from one framework to another can be considered for production purposes, by adopting frameworks with demonstrated better performance especially for performance critical applications. The main constraint would be added complexity in the development workflow.

Links and further reading

1. Olston, Christopher, et al. "Tensorflow-serving: Flexible, high-performance ml serving." arXiv preprint arXiv:1712.06139 (2017).
2. Locust. 2023. Locust: An open source load testing tool. (Last visited: 08/05/2023).  
3. Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020.  

IML4E – Inference scaling

Inference scaling

What is it?

Why is it necessary?

How does it work?