Part B

Generalization of skills across different tasks and environments using meta-learning
techniques

WP3: Federated learning in diverse environments

In WP3, we teach fleets of robots to share skills without sharing secrets. When robots work in private spaces like homes, they must learn from experience without exposing personal data. We solve this by stripping camera footage down to essential, anonymized features — relevant object poses and keypoints rather than faces and furniture in the background. This method protects privacy while helping robots learn faster and recognize objects in new environments. To handle complex jobs, we break long tasks into short, reusable skills that the robots can shuffle and recombine like building blocks. Finally, because every home is different, we give the robots tools to tweak their own behavior the moment they encounter a new obstacle.

Recent Highlights from WP3

Task Parametrized Gaussian Mixture Models (TP-GMM) offer a promising way for robots to learn by imitation. But taking this math from the simulator to the real world remains difficult. We solve three specific problems to make it work.

First, robot hands move in arcs, not straight lines. We capture this by modeling velocity on its natural curved surface rather than a flat grid. Second, we use these motion patterns to slice long tasks—like cooking a meal—into distinct skills, such as chopping or stirring. This allows the robot to shuffle and recombine these actions to solve entirely new problems. Third, we teach the robot to see what matters. When stirring, for instance, it automatically tracks the pot and ladle while ignoring the cutting board.

The results are compelling. Our robots learn complex tasks after seeing them performed only five times—one-twentieth the data usually required. But beyond speed, they achieve a flexibility that baselines cannot match: adapting these learned skills to completely new objects and changing environments.

Figure 1: TAPAS-GMM: Task Auto-Parameterized And Skill Segmented GMM learns task-parameterized manipulation policies from only a handful of complex task demonstrations. First, we segment the full task demonstrations into the involved skills. For each segment, we then automatically select the relevant task parameters and learn a Riemannian Task-Parameterized Hidden Markov Model (TP-HMM). The skill models can be cascaded and reused f lexibly. To enable modeling of the robot’s end-effector velocity, we further leverage a novel action factorization and Riemannian geometry.

Find out more

Robots learn slowly because raw video data is overwhelming. To speed them up, we must boil down camera footage into simple, compact summaries. The problem is that current algorithms assume they can see the whole scene. But in the real world, objects are hidden by clutter or slip out of the camera’s frame. When a robot cannot see an object, or when the object moves close enough to look different, the robot usually loses track of it.

We introduce our approach, Bayesian Scene Keypoints (BASK) to solve this puzzle. It is a probabilistic method that tracks specific points on an object —like the handle of cup— regardless of how far away they are. BASK resolves the confusion caused by missing information; it knows where a tool is even when it is hidden, and it can track a symmetrical cup that looks the same from different angles. We tested this using a camera mounted on the robot’s moving wrist. The system mastered difficult tasks involving multiple objects, outperforming standard techniques. It proved stubborn in the face of messy desks, blocked views, and limited field-of-view, even handling objects it had never seen before.

Figure 2: Individual camera observations are often ambiguous. For example, from the observation on the left, the rotation of the saucepan cannot be uniquely inferred. When tracking object keypoints, this leads to multimodal localization hypotheses. We overcome this problem by considering the image in context. We find likely correspondences across image scales and then use spatial or temporal context to resolve the ambiguities. Our model further detects when a keypoint is likely not observed, enabling our approach to track occluded objects and objects outside the current field of view as shown on the right.

Find out more

WP4: Meta-learning with meta-features and augmentation

WP4 focuses on developing methods to enable accurate transfer of robotic policies in real-world household environments through advanced meta-learning techniques. The goal is to make learned behaviors adapt quickly and safely to new homes, object arrangements, and interaction dynamics without requiring extensive retraining or perfectly curated demonstrations. To achieve this, WP4 investigates meta-learning across diverse simulated environments. A key innovation is the development of novel end-to-end methods to compute environment meta-features that enhance policy transfer efficiency. Additionally, WP4 explores the generation of synthetic, varied simulated environments to facilitate faster and more reliable transfer from simulation to real-world settings. This work is crucial for enabling scalable, adaptable, and safe assistive robots that can operate effectively in dynamic household environments.

Recent Highlights from WP4

WP4 worked on meta-learning for task sequencing by creating synthetic graph-based sequencing environments, defining utility scores for (partial) sequences, and successfully meta-learning a Transformer-based sequencing policy that can choose good next tasks even from suboptimal context. In parallel, they improved cross-episode Meta-RL efficiency by shifting the outer loop from on-policy (PPO) to off-policy (SAC), achieving ~2.5× faster learning on ML1 Reach and SOTA performance on ML1 Push from the MetaWorld benchmark, while noting memory limits as task diversity grows (replay buffers become a bottleneck). Next steps include extending the utility function to enable mid-sequence recovery and exploring how these sequencing strategies can support WP1 with less reliance on optimal trajectories.

Figure 3: Task Sequencing Transformer Architecture. We train a Transformer policy to select the next task. The model uses past experience to predict which task will most improve future performance, enabling more efficient learning than fixed curricula.

Figure 4: Synthetic Task Sequencing Benchmark. A lightweight benchmark where tasks form a graph and each task has a measurable “utility.” It lets us test and compare sequencing strategies quickly before transferring insights to other settings.

WP5: Meta-learning with dynamic algorithm configuration for reinforcement learning

The WP5 focuses on enhancing learning efficiency by enabling dynamic configuration of learning agents, allowing their learning parameters to be adapted on-the-fly to the task at hand. This is essential because deep reinforcement learning (RL) algorithms, which often underpin robot learning, are highly sensitive to their configurations. Unlike many other machine learning settings, the data from which we learn changes continuously during the learning process. These changes arise both from learning to solve new tasks and from exploring different behaviors within the same task. Consequently, different stages of learning require different algorithm configurations to achieve optimal results. If this adaptation is not carefully managed, it can severely hinder or even prevent successful learning.

To enable efficient, scalable, and robust learning, we first focus on identifying suitable meta-features that facilitate transfer of configuration policies across varying problems and environments. Building on this, we aim to develop dynamic configuration policies that optimize reinforcement learning efficiency. The ultimate goal of this work package is to create dynamic configuration policies that are transferable even across vastly different problem environments.

Recent Highlights from WP5

WP5 (Dynamic Algorithm Configuration for RL) explored the effects of hyperparameter optimization on RL. Showing work on both hand-designed and inferred meta-features of environments, Dynamic Algorithm Configuration boosts adaptability and meta-learning in such parameterized environment settings with similar dynamics. Their current focus is on improving zero-shot generalization to novel environments with differing dynamics.

Figure 5: CARL allows to configure and modify existing environments through the use of context. This context can be made visible to agents through context features to inform them directly about the current instantiation of the context. Learning with and across different contexts, enables learning of more general behaviors. Reinforcement learning agents that observe this context during training can learn how to adapt their behavior accordingly.

Continue to Part C