#llm-inference tag

Three AI engines walk into a bar in single file...

Dependency-free single-file LLaMA inference engines in C and JavaScript enable transparent GGUF parsing and token generation for educational, broadly compatible local hardware use.

fromInfoQ

1 month ago

NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

The new capabilities center on two integrated components: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work together to solve the "rate matching" challenge in disaggregated serving. The teams use this term when they split inference workloads. They separate prefill operations, which process the input context, from decode operations that generate output tokens. These tasks run on different GPU pools. Without the right tools, teams spend a lot of time determining the optimal GPU allocation for these phases.

Artificial intelligence

fromTheregister

2 months ago

Nvidia says DGX Spark is now 2.5x faster than at launch

Nvidia's DGX Spark and GB10 systems gain significant software-driven performance improvements and broader software integrations, boosting prefill compute performance for genAI workflows.

Artificial intelligence

fromTechzine Global

2 months ago

The state of AI in 2026 - part 1

By 2026 AI will focus on long‑context reasoning at inference time, managing model size tradeoffs, and resolving operational strains like hallucinations and energy costs.

Python

fromPyImageSearch

5 months ago

Introduction to KV Cache Optimization Using Grouped Query Attention - PyImageSearch

Grouped Query Attention reduces KV cache memory by letting multiple query heads share fewer KV heads, lowering memory use with minimal accuracy loss.

Artificial intelligence

fromTechzine Global

4 months ago

Red Hat AI 3 tackles the complexity of AI inferencing

Red Hat AI 3 delivers an enterprise-grade, open-source platform focused on scalable, cost-efficient LLM inference and production deployment on Kubernetes.

fromTheregister

4 months ago

DGX Spark Nvidia's desktop supercomputer: first look

But the machine is far from the fastest GPU in Nvidia's lineup. It's not going to beat out an RTX 5090 in large language model (LLM) inference, fine tuning, or even image generation - never mind gaming. What the DGX Spark, and the slew of GB10-based systems hitting the market tomorrow, can do is run models the 5090 or any other consumer graphics card on the market today simply can't.

Artificial intelligence

fromInfoQ

5 months ago

Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure

Disaggregated serving separates LLM prefill and decode onto specialized hardware, improving throughput, latency variance, and reducing infrastructure costs by optimizing hardware allocation.

Scala

fromHackernoon

1 year ago

Related Work: vAttention in LLM Inference Optimization Landscape | HackerNoon

Efficient optimization of LLM inference is essential for reducing latency and improving performance in AI applications.

#llm-inference#llm-inference

Three AI engines walk into a bar in single file...

NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

Nvidia says DGX Spark is now 2.5x faster than at launch

The state of AI in 2026 - part 1

Introduction to KV Cache Optimization Using Grouped Query Attention - PyImageSearch

Red Hat AI 3 tackles the complexity of AI inferencing

DGX Spark Nvidia's desktop supercomputer: first look

Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure

Related Work: vAttention in LLM Inference Optimization Landscape | HackerNoon

#llm-inference
#llm-inference