Mastering LLM Techniques: Customization
Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes. However, off-the-shelf LLMs often fall short...
View ArticleMastering LLM Techniques: Training
Large language models (LLMs) are a class of generative AI models built using transformer networks that can recognize, summarize, translate, predict, and generate language using very large datasets....
View ArticleNVIDIA TensorRT-LLM Revs Up Inference for Google Gemma
NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology used to create the Gemini models. An...
View ArticleGenerate Stunning Images with Stable Diffusion XL on the NVIDIA AI Inference...
Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by iteratively shaping random noise into AI-generated...
View ArticleTurbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA...
We’re excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You can immediately try Llama 3 8B and Llama...
View ArticleSupercharging Llama 3.1 across NVIDIA Platforms
Meta’s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are...
View ArticleRevolutionizing Code Completion with Codestral Mamba, the Next-Gen Coding LLM
In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software development. They provide significant...
View ArticlePower Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU
NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class. With a growing number of language models purpose-built...
View ArticleJamba 1.5 LLMs Leverage Hybrid Architecture to Deliver Superior Reasoning and...
AI21 Labs has unveiled their latest and most advanced Jamba 1.5 model family, a cutting-edge collection of large language models (LLMs) designed to excel in a wide array of generative AI tasks. These...
View ArticleBoosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model...
The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion...
View ArticleDeploying Accelerated Llama 3.2 from the Edge to the Cloud
Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support...
View ArticleLlama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs
Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs....
View ArticleTensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x
NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for...
View ArticleNVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight...
NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures,...
View ArticleBoost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM...
Meta’s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance...
View ArticleIntroducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM
Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the previous tokens are used as historical context...
View ArticleOptimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding
Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with...
View ArticleNVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance
NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a...
View Article