Flash, Fused and Fast - Optimizing Attention

Flash, Fused and Fast Attention

A change of approach From ~2020 onward, the focus of attention research subtly shifted: Before: Researchers were asking, “How can we change the algorithm to avoid quadratic cost?” Thi...

Sep 23, 2024 Optimization, Attention

Why attention deserves your attention?

What is Attention? At its core, attention is about figuring out what matters most. In machine learning models, attention helps focus on the most relevant pieces of information when making a decisi...

Aug 16, 2024 Optimization, Attention

FP8 Quantization - Viable for production?

Coming soon ..

Feb 1, 2025 Optimization, Quantization

Attention Olympics - Fastest attention kernel?

Coming soon …

Jan 29, 2025 Optimization, Benchmarking

Building Custom TensorRT Plugins

Extending TensorRT with Custom Plugins TensorRT’s standard operations cover many common use cases, but there are scenarios where custom solutions become necessary: Third-party Integration ...

Dec 13, 2024 Optimization, Graph Compilation

Understanding Attention Approximation

Coming soon ..

Nov 5, 2024 Optimization, Attention

FP8 Quantization using TensorRT

FP8 Quantization with TensorRT Model Optimization Calibration Process Attention Fusion Verification Verify Fusion in Profiler Output

Oct 27, 2024 Optimization, Graph Compilation

TensorRT - From Frustration to Production

TensorRT in Practice TensorRT promises significant performance improvements for deep learning inference. Though it varies from case to case, I have consistently seen minimum reductions in latency ...

Oct 11, 2024 Optimization, Graph Compilation

Understanding Inference Optimization Frameworks

Overview Modern deep learning models require efficient execution to meet production demands. While the core logic of our code defines what we want to achieve, its execution depends on numerous low...

Aug 4, 2024 Optimization, Graph Compilation

Breaking Down ML Inference Bottlenecks

Summary Profile a sample inference pipeline Identify the bottlenecks: CPU vs GPU When we are CPU bound: Eg. Use NeMO GPU dataloader When we are GPU bound: Eg. ?? Ensure overlap of CPU a...

Jul 11, 2024 Optimization, Benchmarking