Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA
By AI Engineer
Key Concepts
- Jetson Spark: A standalone AI development system powered by the GB10 Grace Blackwell superchip.
- Unified Memory Architecture: A design combining CPU and GPU memory, allowing for large model handling (up to 200B parameters) on a local device.
- vLLM: A high-throughput, memory-efficient library for LLM inference and serving.
- NVFB4 (NVIDIA 4-bit Floating Point): A quantization format that reduces model size and memory footprint while maintaining high performance.
- Time to First Token (TTFT): The latency metric measuring how quickly a model begins generating a response, critical for perceived user responsiveness.
- Throughput: Measured in tokens per second (TPS), representing the speed of text generation after the initial response.
1. The Challenge of Modern AI Development
Moska Gabricima highlights that developers often face a "bottleneck" when moving from experimentation to production. Common issues include:
- Resource Constraints: Running out of memory or lacking the correct software stack.
- Infrastructure Dependency: Reliance on cloud/data centers leads to issues with cost predictability, data residency, and latency.
- Workflow Delays: Shared infrastructure causes scheduling conflicts, slowing down iteration speed.
The Jetson Spark is positioned as a solution to bring production-grade AI development to the developer's desk, ensuring that local workflows are identical to cloud/data center deployments.
2. Benchmarking Methodology
To ensure reproducibility and data-backed insights, the following protocol was established:
- Environment: Isolated via Docker containers to mirror production environments.
- Protocol: Three mandatory warm-up runs followed by background GPU metrics logging at 1-second intervals.
- Automation: An orchestrator script generates unique, timestamped directories for every run, capturing full model responses and metadata.
- Measurement Logic: The script explicitly handles streaming responses from the vLLM server to capture the exact timestamp of the first token, rather than waiting for the full API response.
3. Performance Data and Analysis
The experiments compared models ranging from 1.5B to 14B parameters, focusing on the impact of quantization.
| Model Size | Format | Throughput (Tokens/sec) | | :--- | :--- | :--- | | 1.5B | Instruct | 61.73 | | 14B | NVFB4 | 20.19 | | 14B | Base | 8.40 |
Key Findings:
- The Engineering Sweet Spot: The 14B NVFB4 model achieves ~20 TPS, which is faster than human reading speed, proving that high-intelligence models can be run efficiently locally.
- Quantization Impact: The 14B NVFB4 model is 3.4 times faster to the first token than the unoptimized 14B base model.
- Memory vs. Bandwidth: While the 128GB of unified memory allows for massive models, throughput is governed by data movement efficiency. NVFB4 acts as the "hero" by increasing "intelligence per byte."
4. Notable Quotes
- "The key idea here is not replacing the cloud, but bringing powerful AI development closer to the developer."
- "On Blackwell hardware, the choice of quantization format is just as important as the hardware itself."
- "Time to first token is the metric that defines the user's perceived performance."
5. Practical Applications and Workflow
The Jetson Spark is recommended for:
- Steady-state workloads: Consistent, predictable local inference.
- Privacy-sensitive data: Keeping sensitive datasets on-premises.
- Rapid Prototyping: Iterating locally with the same software stack used in the cloud, then scaling to the data center when ready.
Conclusion
The Jetson Spark bridges the gap between research and production by providing a high-performance, local environment. By utilizing the GB10 Grace Blackwell architecture and NVFB4 quantization, developers can achieve production-level responsiveness and throughput. The primary takeaway is that local development is not just about hardware capacity, but about optimizing the software stack (vLLM) and precision formats (NVFB4) to maximize the efficiency of the available memory bandwidth. Developers can access the playbooks and software stack used in these benchmarks at build.nvidia.com/spark.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA". What would you like to know?