ModernTSF: an engineering-grade benchmark for time-series forecasting

Progress in time-series forecasting is hard to read. Papers report numbers on overlapping-but-different datasets, with different windowing, preprocessing, and training setups — so a new result is rarely comparable to the one it claims to beat. The science is fine; the benchmarking infrastructure is the bottleneck.

ModernTSF is our answer: a structured, engineering-grade forecasting benchmark where every experiment is a versionable config, and results are comparable by construction. It is open source (MIT), built on Python 3.12 and PyTorch 2.6.

The experiment is the config

ModernTSF is TOML-first. A dataset, a model, and a sweep are all declared in composable configuration files rather than buried in scripts. Running an experiment is then a single command against a config:

# install
uv sync --python 3.12
 
# a single dataset/model run
uv run modern-tsf --config configs/runs/run_single_data.toml
 
# sweeps: across models, across datasets, or multi-axis
uv run modern-tsf --config configs/runs/sweep_model.toml
uv run modern-tsf --config configs/runs/multi_sweep.toml

Sweeps compose predictably: sweep.extend expands first, then the remaining sweep keys, so the total number of runs is just the product of the axes. Because the experiment is the config, it is diffable, reviewable, and re-runnable by someone else — the properties a benchmark needs to be trustworthy.

Breadth, out of the box

31 models — from linear baselines (Linear, DLinear, NLinear, RLinear) through Transformer-based forecasters (PatchTST, iTransformer, Autoformer, FEDformer), MLP/patch and multi-scale mixers (TSMixer, PatchMLP, TimeMixer), a 2D time–frequency CNN (TimesNet), and a long tail of modern methods (FITS, SparseTSF, CycleNet, TiDE, and more).
60+ datasets — 9 classic benchmarks (the ETT family, electricity, weather, traffic, solar) plus full, native support for GIFT-EVAL: 53 dataset configurations spanning 23 base datasets, 10 sampling frequencies (secondly to monthly), and 7 domains including energy, traffic, weather, and finance.

Adding a new model, dataset, or metric is deliberately low-friction — each plugs in with minimal wiring, defined by a schema and a config.

From runs to understanding

A benchmark is only useful if you can read it. ModernTSF ships analysis tooling that turns raw runs into comparable results:

aggregate performance and profiling metrics across a dataset,
rank models per prediction length and seed,
and plot bubble charts (e.g. error vs. parameter count) to see accuracy/efficiency trade-offs at a glance.

Profiling is first-class, so a result carries not just its error but its cost — parameters, and the compute behind the number.

Why it lives here

ModernTSF reflects what Diaugeia.AI stands for: modern engineering, reproducible by design, open by default, and friendly to agent-driven workflows — its docs-first structure makes it easy for both people and LLM agents to extend. It is exactly the kind of shared groundwork we think AI research should be able to build on.

Explore the project, run a sweep, or add a model: github.com/Diaugeia/ModernTSF.