maple.utils.eval.BatchEvaluator.run

BatchEvaluator.run(policy_id: str, env_id: str, tasks: List[str], seeds: List[int] = None, max_steps: int = 300, timeout: int = 200, env_kwargs: Dict[str, Any] | None = {}, model_kwargs: Dict[str, Any] | None = {}, save_video: bool = False, video_dir: str | None = None, parallel: int = 1, progress_callback: Callable[[int, int, EvalResult], None] | None = None) BatchResults

Run batch evaluation across multiple tasks and seeds.

Executes all combinations of tasks × seeds, either sequentially or in parallel. Aggregates results with statistics and provides progress tracking via optional callback.

The total number of episodes executed is len(tasks) × len(seeds). Each episode is tracked individually with results stored in the database and aggregated in the returned BatchResults.

Parameters:
  • policy_id – Policy container ID to evaluate.

  • env_id – Environment container ID to use.

  • tasks – List of task specifications to evaluate.

  • seeds – List of random seeds (default: [0]).

  • max_steps – Maximum steps per episode.

  • timeout – Timeout multiplier for HTTP requests.

  • env_kwargs – Env-specific parameters.

  • model_kwargs – Model-specific parameters.

  • save_video – Whether to record videos.

  • video_dir – Directory for video files.

  • parallel – Number of parallel workers (1 = sequential).

  • progress_callback – Optional callback(completed, total, result).

Returns:

BatchResults with all episode results and statistics.