FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss. One key idea of FlexGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods, but the I/O efficiency of offloading can be greatly boosted...