whestbench.
CLI

whest run

Run local evaluation for an estimator.

whest run

Run local evaluation for an estimator.

whest run [options]
OptionDefaultDescription
--estimatorPath to estimator.py (see https://github.com/AIcrowd/whest-starterkit for starter files).
--classEstimator class name to load from the estimator file (auto-detected if omitted).
--runner'local'Execution backend: 'local'/'inprocess' run in-process; 'subprocess'/'server' run in an isolated subprocess (default: local).
--n-mlpsNumber of MLPs to evaluate. Default: 10 when --dataset is not provided; otherwise the full dataset size. Clamped to the dataset size when --dataset is set and --n-mlps exceeds it.
--detail'raw'Report verbosity: 'raw' for a concise summary or 'full' for expanded per-MLP detail (default: raw).
--profileCollect and display per-MLP FLOP/budget profiling breakdowns in the report.
--show-diagnostic-plotsInclude diagnostic plot panes in the rendered (non-JSON) report.
--formatSelect output format: rich, plain, or json.
--jsonAlias for --format json.
--datasetPath to a baked dataset directory, or hf://owner/repo[@revision] for HF Hub.
--streamingStream the dataset from HF instead of downloading it. Iteration-only (no random access). Data is NOT cached — subsequent runs will re-fetch. Useful for small --n-mlps debugging runs. See docs/guides/datasets.md#streaming-mode.
--revisionHF Hub revision (tag or commit SHA) for --dataset.
--splitFor multi-split datasets, the split to evaluate. Required when the dataset is multi-split; optional when single-split (defaults to the only split).
--flop-budgetEffective compute budget per MLP in FLOPs. Caps C_m = F_m + lambda*R_m (analytical FLOPs plus charged residual wall time). Always honored; any flop_budget stored in --dataset's metadata is ignored. Default: 68_000_000_000 (6.8e10).
--lambda-flops-per-secondResidual wall-time penalty rate lambda in C_m = F_m + lambda*R_m (FLOP-equivalents per second of residual wall time). Default: 1e11.
--n-samplesGround truth samples per MLP (default: widthwidth256). Lower values speed up generation at the cost of noisier scores.
--debugShow full Python tracebacks for errors instead of condensed messages.
--fail-fastStop on the first estimator error and let the raw Python traceback propagate (combine with --debug to show it).
--wall-time-limit60.0Wall-clock time limit per predict call (default: 60.0 seconds).
--residual-wall-time-limitTime limit for non-flopscope operations per predict call (default: unlimited).
--seedRandom seed for the run. Without --dataset, seeds both MLP generation and estimator setup. With --dataset, MLP seeds come from the dataset; this flag seeds estimator setup only. Default: omitted (ctx.seed defaults to 0; run_config.seed is null in the JSON output).
--max-threadsLimit BLAS to at most N CPU threads.

On this page