L-DAE: Latent Denoising Autoencoder for Self-supervised Pre-training

A potential alternative to MAE for image modeling as pre-training

Shuchen Du
5 min readFeb 4, 2024
L-DAE [1]

Unsupervised imaging tasks, e.g. denoising, inpainting, could be utilized to pre-train a computer vision model to improve the performances of downstream tasks. In order to denoise or inpaint the missing part of the image, the model should learn to understand the global shape semantics and local textures of the image, forming a good initialization for subsequent fine-tuning.

A brief recap of MAE

Masked image modeling (MIM) is such a method, in which some parts of the input images are masked and reconstructed, via an encoder-decoder style architecture. Afterwords, the encoder part is utilized to fine-tune the downstream tasks. Masked autoencoder (MAE) [2] is a representative method of MIM.

Since different downstream tasks utilize different levels of features, the reconstruction target in MIM also varies according to this. For example, low-level downstream tasks, e.g. denoising, super resolution, need low-level high-frequency features to be initialized in the pre-trained model. While high-level downstream tasks, e.g. object detection, image classification, need high-level low-frequency features instead. In order to adapt to these…

--

--