Secondly, a machine learning technique is employed to build a binary classifica- tion model that combines the performance characteristics of heterogeneous serverless computing frameworks, enabling online switching of the model inference service framework. Finally, a testi...
and, slowly but surely, more models are being brought into production. When making the step towards production, inference time starts to play an important role. When a model is external user facing, you typically want to get your inference time in the millisecond range, and no longer than...
(2023.09) [PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) [vllm]18k Stars Prompt Compression (2023.04) [Selective-Context] Compressing Context to Enhance Inference Efficiency of Large Language Models(@Surrey) [Selective-Context] 165 St...