Kaiming最近的一作论文,个人认为一句话总结评价的话,应该是——Simple and Effective pipeline for self-supervised learner in vision task. 本文目的在于探索masked autoencoder 在CV任务中做无监督预训练的方式,像BERT之于NLP task一样,使用masked autoencoder在原始数据中进行自监督预训练,并将得到的transformer encod...
Masked Autoencoders are Efficient Class Incremental Learners Jiang-Tian Zhai 1 Xialei Liu 1,* Andrew D. Bagdanov 2 Ke Li 3 Ming-Ming Cheng 1 1 VCIP, CS, Nankai University 2 MICC, University of Florence 3 Tencent Youtu Lab Abstract Class Incremental Learning (CIL) ...
Masked Autoencoders Are Scalable Vision Learners Kaiming He∗,† Xinlei Chen∗ Saining Xie Yanghao Li Piotr Dolla´r Ross Girshick ∗equal technical contribution †project lead Facebook AI Research (FAIR) Abstract This paper shows that masked autoencoders (MAE) are scalable self-...
Global contrast-masked autoencoders are powerful pathological representation learnersPathological imageRepresentation learningSelf-supervised learning2024 Elsevier LtdUsing digital pathology slide scanning technology, artificial intelligence algorithms, particularly deep learning, have achieved significant results in the...
Masked autoencoders (MAEs) are a self-supervised pretraining strategy for vision transformers (ViTs) that masks-out patches in an input image and then predicts the missing regions. Although the approach is both simple and effective, the MAE ...
The main results of our experiment are given in Fig.5. As can be seen, the model pretrained with the trajectory masked autoencoder (with segment length\(l_S=5\)and mask ratio\(r=0.8\)) outperforms the models without pretraining consistently in all four data regimes. Notably, this implie...
It employs a two-branch masked autoencoder method to train the ViT to understand both the appearance and the motion information, resulting in less training cost and higher tracking performance. Extensive experiments demonstrate that TrackMAE is effective, achieving competitive performance in the tracking...
Paper tables with annotated results for Masked autoencoders are effective solution to transformer data-hungry
ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training. This results in ViT not performing as well as CNNs on small datasets like medicine and science. We exp
Action RecognitionSomething-Something V2VideoMAE (no extra data, ViT-L, 32x2)Top-1 Accuracy75.4# 8 Compare Top-5 Accuracy95.2# 4 Compare Parameters305# 17 Compare GFLOPs1436x3# 7 Compare Action Recognition Something-Something V2 VideoMAE (no extra data, ViT-L, 16frame) ...