maple.backend.policy.smolvla

SmolVLA policy backend.

This module implements the policy backend for SmolVLA (Small Vision-Language-Action), a compact vision-language-action model for robotic manipulation. SmolVLA takes visual observations, proprioceptive state, and natural language instructions as input and outputs robot actions.

SmolVLA is based on transformer architectures and supports multiple observation modalities including images and robot state. Unlike OpenVLA, SmolVLA does not require explicit action unnormalization as it directly outputs actions in the target space. The model is served via Docker containers with the inference API accessible over HTTP.

Available versions: - libero: SmolVLA fine-tuned for LIBERO benchmark tasks - base: Base SmolVLA model trained on diverse robot datasets

Classes

SmolVLAPolicy()

Backend for SmolVLA vision-language-action models.