Asynchronous Inference Graph Execution for Model Routing in Machine Learning Systems

  IJCTT-book-cover
 
         
 
© 2024 by IJCTT Journal
Volume-72 Issue-10
Year of Publication : 2024
Authors : Gangadharan Venkataraman
DOI :  10.14445/22312803/IJCTT-V72I10P101

How to Cite?

Gangadharan Venkataraman, "Asynchronous Inference Graph Execution for Model Routing in Machine Learning Systems," International Journal of Computer Trends and Technology, vol. 72, no. 10, pp. 1-4, 2024. Crossref, https://doi.org/10.14445/22312803/IJCTT-V72I10P101

Abstract
It is for this reason that this paper creates a routing mechanism in machine learning systems by performing asynchronous inference graphs for such systems. The system will allow model chaining, champion/challenger evaluation, and traffic splitting; hence, it will have very efficient model deployment strategies. In detail, we describe the architecture and implementation of the routing mechanism along with its application to real-world ML pipelines.

Keywords
Inference Service, Model Routing, Asynchronous Execution, Model Chaining, Champion/Challenger, Traffic Splitting.

Reference

[1] D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal Canada, vol. 2, pp. 2503-2511, 2015.
[Google Scholar] [Publisher Link]
[2] Daniel Crankshaw et al., “Clipper: A Low-Latency Online Prediction Serving System,” 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 613-627, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Matei Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI '13: 10th USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, pp. 1-14, 2012.
[Google Scholar] [Publisher Link]
[4] Martín Abadi et al., “TensorFlow: A System for Large-Scale Machine Learning,” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, pp. 265-283, 2016.
[Google Scholar] [Publisher Link]
[5] Neoklis Polyzotis et al., “Data Management Challenges in Production Machine Learning,” SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data, Chicago Illinois USA, pp. 1723-1726, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Ruben Mayer, and Hans-Arno Jacobsen, “Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools,” ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1-37, 2020.
[CrossRef] [Google Scholar] [Publisher Link]