In this paper, we explore this learning paradigm for 3D mesh data analysis based on Transformers. Since applying Transformer architectures to new modalities is usually non-trivial, we first adapt Vision Transfo
3.1. Overview of TrackMAE Pretraining As illustrated in Fig. 2, Our TrackMAE pretraining method is an autoencoding paradigm that reconstructs the original signal from its partial observation. This methodology bifurcates into two distinct branches: one dedicated to appearance and the other to motio...
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking Limin Wang1,2,* Bingkun Huang1,2,* Zhiyu Zhao1,2 Zhan Tong1 Yinan He2 Yi Wang2 Yali Wang3,2 Yu Qiao2,3 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Shang...
To address these limitations, we present Siamese Masked Autoencoders (SiamMAE): a simple extension o MAEs or learning visual correspondence rom videos. In our approach, two rames are randomly selected rom a video clip, with the uture rame having a signifcant portion (95%) o its patches ran...
Transformers ([17]) are overpowering the old convolutional neural network (CNN) paradigm in medical imaging ([18]). In particular, masked auto-encoder (MAE), a Transformer-based method ([19]), is another type of promising self-supervised learning approach. Introduced in the context of natural...
Inspired by it, BEiT [4] introduced the mask-then-predict paradigm to the computer vision filed and exploited the great potential of masked image modeling (MIM) on various tasks. BEiT v2 [35] constructed a semantic-rich visual tokenizer in order to get better target. MAE [1...
MATE: Masked Autoencoders are Online 3D Test-Time Learners MATE is the first 3D Test-Time Training (TTT) method which makes 3D object recognition architectures robust to distribution shifts which can commonly occur in 3D point clouds. MATE follows the classical TTT paradigm of using an auxiliary...
Figure 1. Comparison between conventional MIM pre-training paradigm and our proposed HPM. 在典型的 MIM 方法中,模型通常专注于预测 masked patches 的某一形式的 target (例如 BEiT[1] 的离散token,MAE[2] 的pixel RGB)。而由于 CV 信号的稠密性,MIM 方法通常需要预先定义的掩盖策略,以构造具有挑战性的自...
To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice...
Nevertheless, it is a non-trivial to perform reconstructing task over such a novel formulated modeling paradigm. To solve the discrepancy issue incurred by newly injected masked embeddings, we design a decoupled autoencoder architecture, which learns the representations of visible (unmasked) positions ...