ML 26

FP8 Quantization - Viable for production? Feb 1, 2025
Attention Olympics - Fastest attention kernel? Jan 29, 2025
Building Custom TensorRT Plugins Dec 13, 2024
Understanding Attention Approximation Nov 5, 2024
FP8 Quantization using TensorRT Oct 27, 2024
TensorRT - From Frustration to Production Oct 11, 2024
Flash, Fused and Fast Attention Sep 23, 2024
Why attention deserves your attention? Aug 16, 2024
Understanding Inference Optimization Frameworks Aug 4, 2024
Breaking Down ML Inference Bottlenecks Jul 11, 2024
GPU vs CPU - Matmul, Sine Waves, and the Myth of Speed Jun 14, 2024
How does a GPU work? Jun 11, 2024
LLMs Explained - Part 6 - Transformers Jun 4, 2024
LLMs Explained - Part 5 - Attention! Jun 3, 2024
LLMs Explained - Part 4 - seq2seq Jun 2, 2024
LLMs Explained - Part 3 - RNNs Jun 1, 2024
LLMs Explained - Part 2 - Word Embeddings May 28, 2024
LLMs Explained - Part 1 - Tokenizers May 27, 2024
Variational Autoencoders Feb 6, 2024
Deep Learning Basics Jan 22, 2024
Generative Modeling Jan 22, 2024
Understanding Feature Engineering Jul 27, 2022
Understanding Training Data Jul 27, 2022
Data Engineering Fundamentals Jul 26, 2022
Designing Machine Learning Systems Jul 26, 2022
Machine Learning Systems Jul 22, 2022