AirLLM
Stars: 19.7K ★ | Forks: 2.2K Language: Jupyter Notebook | License: Apache-2.0
AirLLM 70B inference with single 4GB GPU.
Overview
AirLLM enables running 70B parameter LLMs on a single 4GB GPU (or even 2GB VRAM) without quantization, distillation, or pruning. It uses a clever single-batch inference approach — loading layers one at a time from CPU to GPU rather than keeping the entire model in VRAM.
How It Works
- Layer-by-layer inference — only one layer loaded on GPU at a time
- No quantization needed — full precision inference on consumer GPUs
- 4GB VRAM minimum — works on laptops, older GPUs, low-end cards
- Supports fine-tuning — QLoRA-based fine-tuning also works on low VRAM
Key Features
- 70B inference on 4GB GPU (e.g., RTX 3050, laptop GPUs)
- No model modification — works with standard HuggingFace models
- Includes fine-tuning support with QLoRA
- Chinese LLM support included in topics