Abstract
Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed.
FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions, requiring no architectural changes, can drop in as a general upgrade to the capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.
Method
FlexLAM modifies only the latent-action bottleneck in the standard LAM pipeline. During pretraining, FlexLAM samples a retained prefix length k and replaces suffix slots with a shared null latent before decoder training, making many prefixes of the same transition code useful for reconstruction.
The same null-filled prefix representation is used for latent-to-action alignment with a small labeled set, while a fixed latent-token evaluator predicts latent-action tokens for downstream evaluation through the same translator interface. Earlier tokens receive denser training pressure, so compact transition structure is encouraged to appear first and later tokens can add residual detail.
Real-World Video
DMLab evaluates downstream task performance, but the fixed-capacity bottleneck trade-off is not specific to simulated environments. We also evaluate FlexLAM on visually diverse real-world video using Ego4D and robot-video reconstruction examples, with the decoder initialized from SD3 and fine-tuned under the same retained-prefix conditioning principle.
Cross-Embodiment Latent Action Transfer
We test whether the latent action z encodes scene-independent motion rather than appearance. If it does, applying z to a visually different embodiment and then re-extracting an action z' from that generated pair should, when applied back to the original frame, reconstruct the original transition:
z = Enc(ot, ot+1)→ôt+1 = Dec(ottgt, z)→z' = Enc(ottgt, ôt+1)→Dec(otsrc, z') ≈ ot+1
The recovered frame (step 3, dashed) closely matching the ground-truth next frame (solid) confirms the round trip, so z captures transition structure that is independent of scene appearance. Green frames (and the otsrc superscript) mark the source scene, red (ottgt) the new scene, and dashed frames are model-generated.
Real-World Transition Reconstruction
Compared with the released villa-X-LAM reference, FlexLAM produces more stable one-step reconstructions under camera and background changes. Varying the retained prefix length k within the same model progressively adds visual detail while smaller prefixes retain coarse transition structure. Use the slider to explore this effect.
(1) Agibot
(2) Ego4D
(3) Ego4D
(4) Ego4D
k = 80
Latent Action Prediction (DMLab)
Prediction vs. Reality
A latent-token sequence model predicts the next latent-action tokens at each step. A translator converts them to executable actions, and the agent acts in DMLab environments. Observation shows the resulting trajectory; Predicted shows the LAM decoder's visualization of the same latent-action tokens. Each cell shows Observation | Predicted | Difference.
Quantitative Results
Scarce-Label Alignment and Matched-Budget Return
We pretrain the LAM and latent-token evaluator on action-free transitions, freeze them, and vary only the amount of labeled data used to train the translator. Under 0.025% labels, one FlexLAM model outperforms separately trained Fixed-K baselines at every matched token budget.
Narrow Single-Task Alignment
The translator is trained using labels from a single low-return source task (Lasertag One Opponent Large; 0.04% of the full dataset), then evaluated on the normalized multi-task suite excluding that source task.
Joint LAM-Translator Fine-Tuning
With 0.5% action-labeled data, joint alignment lets action loss update the LAM bottleneck. This strengthens bottlenecked LAMs and preserves FlexLAM's advantage over the fixed-capacity baseline.
BibTeX
@inproceedings{yoshimoto2026flexlam,
title = {FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning},
author = {Yoshimoto, Takanori and Hu, Yang and Kondo, Naruya and Matsushima, Tatsuya},
year = {2026},
}