AirLLM

Stars: 19.7K ★ | Forks: 2.2K Language: Jupyter Notebook | License: Apache-2.0

AirLLM 70B inference with single 4GB GPU.

Overview

AirLLM enables running 70B parameter LLMs on a single 4GB GPU (or even 2GB VRAM) without quantization, distillation, or pruning. It uses a clever single-batch inference approach — loading layers one at a time from CPU to GPU rather than keeping the entire model in VRAM.

How It Works

Layer-by-layer inference — only one layer loaded on GPU at a time
No quantization needed — full precision inference on consumer GPUs
4GB VRAM minimum — works on laptops, older GPUs, low-end cards
Supports fine-tuning — QLoRA-based fine-tuning also works on low VRAM

Key Features

70B inference on 4GB GPU (e.g., RTX 3050, laptop GPUs)
No model modification — works with standard HuggingFace models
Includes fine-tuning support with QLoRA
Chinese LLM support included in topics

LLMs from Scratch
GPU Programming Guide

description	Run 70B LLM inference on a single 4GB GPU — no quantization, no distillation, no pruning. Single-batch inference through layer-by-layer loading.
tags	llm-inference, gpu, memory-optimization, 70b, open-source

Huy's Wiki

Explorer

AirLLM

AirLLM

Overview

How It Works

Key Features

Graph View

Table of Contents

Backlinks

Huy's Wiki

Explorer

AirLLM

AirLLM

Overview

How It Works

Key Features

Related

Graph View

Table of Contents

Backlinks