CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Learning Systems and Robotics Lab, Technical University of Munich
*Equal contribution
LSY TUM
CLARE overview

Abstract

To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for non-exemplar continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark and five real-world tasks, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods.

Motivation

Deploying robots in dynamic real-world environments — such as homes or hospitals — requires them to continuously acquire new skills without losing previously learned capabilities. Standard fine-tuning of VLAs leads to catastrophic forgetting, where new training overwrites critical prior knowledge. Storing past data for experience replay is often impractical due to privacy or storage constraints. Furthermore, robots operating autonomously rarely have access to explicit task identifiers to tell them which skill to use. CLARE addresses all three challenges: no forgetting, no stored exemplars, no task labels.

Method

Routing and expansion
Routing and expansion mechanisms.
DiT architecture
DiT policy architecture with CLARE adapters.

CLARE injects lightweight, trainable adapters into selected modules in the observation conditioning pathway of a frozen, pre-trained VLA. A dynamic expansion strategy monitors layer-wise feature statistics and only adds new adapter parameters when the incoming task data is sufficiently novel, preventing unnecessary capacity growth (~2% parameter increase per task in our experiments). During deployment, an autonomous routing mechanism uses autoencoder-based discriminators to analyze input features and activate the most relevant adapter without requiring task labels. This allows the robot to seamlessly switch between skills — even mid-execution — based purely on what it observes.

Real-World Experiments

We validate CLARE across five sequentially learned manipulation tasks, designed to cover a wide range of physical challenges: varying object weights (from 7 g Lego blocks to a 0.5 kg Moka pot), distinct interaction dynamics (contact-rich insertion, leveraging friction to straighten an angled pot), nonlinear friction profiles (plastic drawer), and multi-stage behavior with autonomous switching (pick, insert, close).

Task 1: Bowl — start
Task 2: Stack — start
Task 3: Moka — start
Task 4: Drawer — start
Task 5: Lego — start
Task 1: Bowl — end T1: Bowl
Task 2: Stack — end T2: Stack
Task 3: Moka — end T3: Moka
Task 4: Drawer — end T4: Drawer
Task 5: Lego — end T5: Lego

Start (top) and goal (bottom) states for each task.

63.3%
AUC (overall performance)
−2.9%
NBT (zero forgetting)
<3 ms
Routing overhead per step
~2%
Parameter growth per task
Method AUC ↑ FWT ↑ NBT ↓
SeqFFT 23.868.080.0
SeqLoRA 22.964.076.9
ER 51.160.017.1
CLARE (ours) 63.362.0−2.9

Table IV: Overall results in our hardware experiments. Bold: best.

CLARE achieves the highest overall AUC and zero forgetting across all five tasks, significantly outperforming SeqFFT, SeqLoRA, and experience replay (ER). Notably, CLARE is the only method that completely avoids catastrophic forgetting. The routing mechanism remains robust to real-world sensory noise, lighting variation, and camera drift, consistently activating the correct adapter for each new observation. CLARE also demonstrates autonomous mid-task switching: when the language command changes from "put the Lego block into the drawer" to "close the drawer", the robot correctly adapts its behavior within a single execution without any manual intervention.

Inference time and memory complexity — hardware
Inference time and memory complexity of CLARE in our hardware experiments. The routing overhead is negligible and GPU memory grows by only ~2% per learned task. Values for stages 6–10 are extrapolated.

Simulation Results

We evaluate CLARE on the LIBERO benchmark across three suites: LIBERO-Long (complex long-horizon tasks), LIBERO-Goal (varying task goals), and LIBERO-Spatial (varying object placements), each with 10 sequentially arriving tasks. CLARE achieves the highest AUC on all three suites, outperforming the best baseline (ER) by 10–14 percentage points, while maintaining zero forgetting (NBT ≈ 0) — without storing any previous data.

Method LIBERO-Long LIBERO-Goal LIBERO-Spatial
AUC ↑FWT ↑NBT ↓ AUC ↑FWT ↑NBT ↓ AUC ↑FWT ↑NBT ↓
SeqFFT 22.476.174.7 26.794.195.3 27.794.794.6
SeqLoRA 21.473.171.6 26.190.190.8 27.390.189.2
PackNet 4.837.241.3 10.560.367.0 8.654.760.3
ER 60.576.622.7 76.094.425.1 77.692.720.9
LOTUS 52.958.1−7.2 56.061.030.0 NANANA
DMPEL 58557 78680 70643
MLR NANANA 77.280.06.9 NANANA
CLARE (ours) 75.175.01.9 89.389.70.3 87.488.00.9

Table III: Baseline comparison across three LIBERO suites. Bold: best. Underline: second best.

To assess long-term scalability, we created LIBERO-40: a new suite of 40 tasks drawn from all four LIBERO suites (Long -> Goal -> Spatial -> Object). As shown below, CLARE successfully learns and retains all 40 tasks, whereas experience replay — despite having access to past data — exhibits significant performance degradation after just a few stages.

LIBERO-40 long-term scalability
Continual learning of 40 tasks on LIBERO-40. CLARE scales to long task sequences without forgetting, whereas ER exhibits significant performance degradation.
Expansion threshold ablation
Ablation of the dynamic expansion threshold γ. Higher γ reduces the number of added adapters with a moderate AUC decrease, while NBT stays near zero — the model never forgets.
Computation analysis on LIBERO-40
Inference time and memory complexity on LIBERO-40. Routing overhead is small compared to the base policy; memory grows by ~2% per task. At 40 tasks, storing data for ER requires 5× more memory than CLARE's adapters.

Highlights

Key features of CLARE:
  • Real-world validated: Tested on five physically diverse manipulation tasks. CLARE achieves AUC = 63.3% and zero forgetting (NBT = 2.9%), outperforming all baselines including experience replay.
  • Non-exemplar learning: Sequentially learns new skills without storing or replaying past data, respecting privacy and storage constraints.
  • Autonomous adapter routing: Selects the correct task-specific module during inference based on feature similarity — no task IDs needed, even when switching tasks mid-execution.
  • Dynamic expansion: Adds new parameters only when necessary, achieving ~2% parameter growth per task with negligible (<3 ms) routing overhead.
  • Scales to 40 tasks: On LIBERO-40, CLARE retains all previously learned skills while ER exhibits catastrophic forgetting.
  • Outperforms exemplar-based methods: Achieves higher AUC than ER across all benchmarks, without access to any previous data.

BibTeX

@article{clare2025,
          title={CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion},
          author={Ralf R{\"o}mer and Yi Zhang and Yuming Li and Angela P. Schoellig},
          journal={arXiv preprint arXiv:2601.09512},
          year={2026}
        }