Training Compute & FLOP Estimator

Estimate the total floating-point operations (FLOPs) required to train a neural network, based on model parameters, dataset size, and training configuration.

Model Architecture

Total trainable parameters (e.g. 7B = 7,000,000,000)

Dataset & Training

Total tokens seen during training (e.g. 2T = 2,000,000,000,000) How many times the dataset is iterated (typically 1 for LLMs)

Hardware Configuration

Peak theoretical throughput in FLOP/s Typical range: 30–50% for large-scale training
Results will appear here.

Formulas Used

Total Training FLOPs (Kaplan et al. / Chinchilla):

C = 6 × N × D × G
  • C — Total compute in FLOPs
  • N — Number of model parameters
  • D — Total tokens processed (tokens × epochs)
  • G — Gradient checkpointing multiplier (1.0 or ~1.33)
  • 6 — Factor accounting for: 2 (multiply-add) × 3 (1 forward pass + 2 backward passes)

Wall-Clock Time:

T = C / (FLOP/s_peak × num_accelerators × MFU)

Chinchilla-Optimal Tokens:

D_optimal ≈ 20 × N

Memory (minimum lower bound):

Mem = weights (N × bytes) + optimizer (2 × N × 4B for Adam) + gradients (N × bytes)

Assumptions & References

  • The factor of 6 per parameter per token is derived from Kaplan et al. (2020), "Scaling Laws for Neural Language Models" and validated in Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla).
  • Model FLOP Utilization (MFU) of 30–50% is typical for large-scale distributed training; see Chowdhery et al. (2022), PaLM reporting ~46% MFU on TPUs.
  • Reference: Epoch AI "Compute Trends" (2023) and OpenAI "AI and Compute" (2018) for historical context on training compute scaling.

In the network