AirLLM

Stars: 19.7K ★ | Forks: 2.2K Language: Jupyter Notebook | License: Apache-2.0

AirLLM 70B inference with single 4GB GPU.

Overview

AirLLM enables running 70B parameter LLMs on a single 4GB GPU (or even 2GB VRAM) without quantization, distillation, or pruning. It uses a clever single-batch inference approach — loading layers one at a time from CPU to GPU rather than keeping the entire model in VRAM.

How It Works

  • Layer-by-layer inference — only one layer loaded on GPU at a time
  • No quantization needed — full precision inference on consumer GPUs
  • 4GB VRAM minimum — works on laptops, older GPUs, low-end cards
  • Supports fine-tuning — QLoRA-based fine-tuning also works on low VRAM

Key Features

  • 70B inference on 4GB GPU (e.g., RTX 3050, laptop GPUs)
  • No model modification — works with standard HuggingFace models
  • Includes fine-tuning support with QLoRA
  • Chinese LLM support included in topics