Training Compute & FLOP Estimator
Estimate the total floating-point operations (FLOPs) required to train a neural network, based on model parameters, dataset size, and training configuration.
Model Architecture
Total trainable parameters (e.g. 7B = 7,000,000,000)Dataset & Training
Total tokens seen during training (e.g. 2T = 2,000,000,000,000) How many times the dataset is iterated (typically 1 for LLMs)Hardware Configuration
Peak theoretical throughput in FLOP/s Typical range: 30–50% for large-scale trainingResults will appear here.
Formulas Used
Total Training FLOPs (Kaplan et al. / Chinchilla):
C = 6 × N × D × G
- C — Total compute in FLOPs
- N — Number of model parameters
- D — Total tokens processed (tokens × epochs)
- G — Gradient checkpointing multiplier (1.0 or ~1.33)
- 6 — Factor accounting for: 2 (multiply-add) × 3 (1 forward pass + 2 backward passes)
Wall-Clock Time:
T = C / (FLOP/s_peak × num_accelerators × MFU)
Chinchilla-Optimal Tokens:
D_optimal ≈ 20 × N
Memory (minimum lower bound):
Mem = weights (N × bytes) + optimizer (2 × N × 4B for Adam) + gradients (N × bytes)
Assumptions & References
- The factor of 6 per parameter per token is derived from Kaplan et al. (2020), "Scaling Laws for Neural Language Models" and validated in Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla).
- Model FLOP Utilization (MFU) of 30–50% is typical for large-scale distributed training; see Chowdhery et al. (2022), PaLM reporting ~46% MFU on TPUs.
- Reference: Epoch AI "Compute Trends" (2023) and OpenAI "AI and Compute" (2018) for historical context on training compute scaling.