How Batching in LLM Decode Works

Maximizing speed: How continuous batching unlocks unprecedented LLM throughput

Think of continuous batching as the LLM world’s turbocharger — keeping GPUs busy nonstop and cranking out results up to 20x faster. I discussed how PagedAttention cracked the code on LLM memory chaos ...

VentureBeat

Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens accumulate from retrieved documents, reasoning traces and conversation history, and the more memory ...

Semiconductor Engineering

GPU Analysis Identifying Performance Bottlenecks That Cause Throughput Plateaus In Large-Batch Inference

A new technical paper titled “Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference” was published by researchers at Barcelona Supercomputing Center, Universitat Politecnica de ...

Semiconductor Engineering

Heterogeneous System With Specialized HW For Disaggregated LLM Inference (Princeton Univ., Univ. of Washington)

A new technical paper titled “SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference” was published by researchers at Princeton University and University of Washington. “Large ...

CRN

Nvidia Says New Software Will Double LLM Inference Speed On H100 GPU

The AI chip giant says the open-source software library, TensorRT-LLM, will double the H100’s performance for running inference on leading large language models when it comes out next month. Nvidia ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results