Continual Learning for Depth Estimation and Panoptic Segmentation

Teaser image

Deploying robots such as autonomous cars in urban scenarios requires a holistic understanding of the environment with a unified perception of semantics, instances, and depth. While deep learning-based state-of-the-art approaches perform well when inference is done under similar conditions as used for training, their performance can drastically decrease when the new target domain differs from the source domain. This domain gap poses a great challenge for robotic platforms that are deployed in the open world without prior knowledge about the target domain. Additionally, unlike the source domain where ground truth annotations are generally assumed to be known and can be used for the initial training, such supervision is not applicable to the target domain due to the absence of labels, rendering classical domain adaptation methods unsuitable. Unsupervised domain adaptation attempts to overcome these limitations. However, the vast majority of proposed approaches focuses on sim-to-real domain adaptation mostly in an offline manner, i.e., a directed knowledge transfer without the need to avoid catastrophic forgetting and with access to abundant target annotations. Additionally, such works often do not consider limitations on a robotic platform, e.g., limited storage capacity.

In this work, we use online continual learning to address these challenges for depth estimation and panoptic segmentation in a multi-task setup. In particular, we leverage images from an onboard camera to perform online continual learning enhancing performance during inference time. While a naive approach would result in overfitting to the current scene, our method CoDEPS mitigates forgetting by leveraging experience replay of both source data and previously seen target images. We combine a classical replay buffer with generative replay in the form of a novel cross-domain mixing strategy allowing us to exploit supervised ideas also for unlabeled target data. Unlike existing works, we explicitly address the aforementioned hardware limitations by using only a single GPU and restricting the replay buffer to a fixed size. We demonstrate that CoDEPS successfully improves on new target domains without sacrificing performance on previous domains.

Technical Approach

Online Adaptation

After the described network has been trained on a source domain using ground truth supervision for panoptic segmentatio and an unsupervised training scheme for monocular depth based on the photometric error, we aim to adapt it to the target domain in a continuous manner. That is, unlike other works, data from the target domain is revealed frame by frame resembling the online stream of an onboard camera. As depicted in the figure below, every adaptation iteration consists of the following steps:

  1. Construct an update batch by combining online and replay data.
  2. Generate pseudo-labels using the proposed cross-domain mixing strategy.
  3. Perform backpropagation to update the network weights.
  4. Update the replay buffer.
Figure: Overview of our proposed CoDEPS. Unlabeled RGB images from an online camera stream are combined with samples from a replay buffer comprising both annotated source samples and previously seen target images. Cross-domain mixing enables pseudo-supervision on the target domain. The network weights are then updated via backpropagation using the constructed data batch. The additional PoseNet required for unsupervised monocular depth estimation is omitted in this visualization.

Experience Replay

Upon receiving a new image taken by the robot's onboard camera, we construct a batch that is used to perform backpropagation on the network weights. In detail, a batch consists of the current online image, previously received target images, and fully annotated source samples. By revisiting target images from the past, we increase the diversity in the loss signal on the target domain and hence mitigate overfitting to the current scene. This further accounts for situations in which the current online image suffers from visual artifacts, e.g., overexposure. Similarly, revisiting samples from the source domain addresses the problem of catastrophic forgetting by ensuring that previously acquired knowledge can be preserved. Additionally, the annotations of the source samples enable pseudo-supervision on the target domain by exploiting cross-domain mixing strategies.

While previous works do not consider limitations on the size of the replay buffer, we explicitly address this challenge to closely resemble the deployment on a robotic platform, where disk storage is an important factor. This poses two questions: First, how to sample from the source domain to construct the fixed source buffer and, second, how to update the dynamic target buffer? To construct the source buffer, we propose a refined version of rare class sampling. To update the target buffer, we propose a diversity-based measure inspired by loop closure detection of visual SLAM. For details, please consult our paper.

Panoptic Adaptation via Cross-Domain Mixing

Panoptic segmentation is the fused output of a semantic head and an instance head. We observe that the decrease in performance on samples from unseen domains can mostly be attributed to the semantic head, while instance predictions remain stable. Cross-domain mixing strategies allow leveraging ideas from supervised training to an unsupervised setting, where ground truth annotations are unknown. In CoDEPS, we bootstrap annotated source samples and high-confident target predictions to artificially generate pseudo-labels for the target samples in an online fashion to supervise the semantic head.

We design a mixing strategy combining pixels of images from the source and target domains, that considers multiple factors, which are unique to the online continual learning scenario: (1) the robust pretraining on a dedicated source dataset, which may result in significant performance degradation on the target dataset if the pre-trained weights are strongly adapted; (2) the existence of different cameras leading to significant changes in the field-of-view, geometric appearance of objects, resolution, and aspect ratio of the images; and (3) the continuously evolving visual appearance of street scenes during adaptation. We visualize the result of our strategy in the figure below and refer to the paper for technical details.

Cross-domain mixing strategy
Figure: Our proposed cross-domain mixing strategy first transfers the image style from the target to the source sample. Then it augments the target image to match the appearance of the source camera. Finally, a random image patch is copied from the target to the source image. The source annotations are retained and completed by the network's estimate on the copied image patch, combining self-iterative learning with ground truth supervision.



A software implementation of this project based on PyTorch including trained models can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.


Niclas Vödisch*, Kürsat Petek*, Wolfram Burgard, and Abhinav Valada,
"CoDEPS: Online Continual Learning for Depth Estimation and Panoptic Segmentation"
arXiv preprint arXiv:2303.10147, 2022.
*Authors contributed equally.

(PDF) (BibTex)