maple.cmd.maple_cli.eval_cmd

maple.cmd.maple_cli.eval_cmd(policy_id: str = <typer.models.ArgumentInfo object>, env_id: str = <typer.models.ArgumentInfo object>, backend: str = <typer.models.ArgumentInfo object>, tasks: str = <typer.models.OptionInfo object>, seeds: str = <typer.models.OptionInfo object>, max_steps: int = <typer.models.OptionInfo object>, timeout: int | None = <typer.models.OptionInfo object>, env_kwargs: str = <typer.models.OptionInfo object>, model_kwargs: str = <typer.models.OptionInfo object>, save_video: bool = <typer.models.OptionInfo object>, video_dir: str | None = <typer.models.OptionInfo object>, output: Path | None = <typer.models.OptionInfo object>, format: str = <typer.models.OptionInfo object>, parallel: int = <typer.models.OptionInfo object>, port: int = <typer.models.OptionInfo object>) → None

Run batch evaluation across multiple tasks and seeds.

Orchestrates large-scale evaluations by running a policy on multiple tasks with multiple random seeds. Supports automatic task suite expansion, parallel execution, video recording, and multiple output formats.

Results are saved in JSON format by default, with optional Markdown and CSV exports for analysis and reporting.

Parameters:

policy_id – Identifier of the policy container to evaluate.
env_id – Identifier of the environment container to use.
backend – Name of the environment backend.
tasks – Comma-separated task list or suite name.
seeds – Comma-separated list of random seeds.
max_steps – Maximum steps per episode.
timeout – Timeout multiplier for each episode request.
env_kwargs – Model-specific parameters.
model_kwargs – Model-specific parameters.
save_video – Whether to save videos of all episodes.
video_dir – Directory for saving videos.
output – Output directory for results files.
format – Output format (json, markdown, csv, or all).
parallel – Number of parallel evaluations to run.
port – Daemon port number.