maple.utils.eval.BatchEvaluator.run

BatchEvaluator.run(policy_id: str, env_id: str, tasks: List[str], seeds: List[int] = None, max_steps: int = 300, timeout: int = 200, env_kwargs: Dict[str, Any] | None = {}, model_kwargs: Dict[str, Any] | None = {}, save_video: bool = False, video_dir: str | None = None, parallel: int = 1, progress_callback: Callable[[int, int, EvalResult], None] | None = None) → BatchResults

Run batch evaluation across multiple tasks and seeds.

Executes all combinations of tasks × seeds, either sequentially or in parallel. Aggregates results with statistics and provides progress tracking via optional callback.

The total number of episodes executed is len(tasks) × len(seeds). Each episode is tracked individually with results stored in the database and aggregated in the returned BatchResults.

Parameters:

policy_id – Policy container ID to evaluate.
env_id – Environment container ID to use.
tasks – List of task specifications to evaluate.
seeds – List of random seeds (default: [0]).
max_steps – Maximum steps per episode.
timeout – Timeout multiplier for HTTP requests.
env_kwargs – Env-specific parameters.
model_kwargs – Model-specific parameters.
save_video – Whether to record videos.
video_dir – Directory for video files.
parallel – Number of parallel workers (1 = sequential).
progress_callback – Optional callback(completed, total, result).

Returns:

BatchResults with all episode results and statistics.