:robot: PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+
English | 简体中文
🤖 PaddlePaddle Visual Transformers (PaddleViT
or PPViT
) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.1+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.
🤖 PaddleViT provides models and tools for multiple vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.
🤖 PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It’s intuitive and straightforward to get started for new users.
PaddleViT implements model architectures and tools for multiple vision tasks, go to the following links for detailed information.
We also provide tutorials:
State-of-the-art
Easy-to-use tools
Easily customizable to your needs
High Performance
Note: It is recommended to install the latest version of PaddlePaddle to avoid some CUDA errors for PaddleViT training. For PaddlePaddle, please refer to this link for stable version installation and this link for develop version installation.
Create a conda virtual environment and activate it.
conda create -n paddlevit python=3.7 -y
conda activate paddlevit
Install PaddlePaddle following the official instructions, e.g.,
conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
Note: please change the paddlepaddle version and cuda version accordingly to your environment.
Install dependency packages
pip install yacs pyyaml
pip install cityscapesScripts
Install detail
package:git clone https://github.com/ccvl/detail-api
cd detail-api/PythonAPI
make
make install
pip install lmdb
Clone project from GitHub
git clone https://github.com/BR-IDL/PaddleViT.git
Model | Acc@1 | Acc@5 | #Params | FLOPs | Image Size | Crop pct | Interp | Link |
---|---|---|---|---|---|---|---|---|
vit_base_patch32_224 | 80.68 | 95.61 | 88.2M | 4.4G | 224 | 0.875 | bicubic | google/baidu(ubyr) |
vit_base_patch32_384 | 83.35 | 96.84 | 88.2M | 12.7G | 384 | 1.0 | bicubic | google/baidu(3c2f) |
vit_base_patch16_224 | 84.58 | 97.30 | 86.4M | 17.0G | 224 | 0.875 | bicubic | google/baidu(qv4n) |
vit_base_patch16_384 | 85.99 | 98.00 | 86.4M | 49.8G | 384 | 1.0 | bicubic | google/baidu(wsum) |
vit_large_patch16_224 | 85.81 | 97.82 | 304.1M | 59.9G | 224 | 0.875 | bicubic | google/baidu(1bgk) |
vit_large_patch16_384 | 87.08 | 98.30 | 304.1M | 175.9G | 384 | 1.0 | bicubic | google/baidu(5t91) |
vit_large_patch32_384 | 81.51 | 96.09 | 306.5M | 44.4G | 384 | 1.0 | bicubic | google/baidu(ieg3) |
swin_t_224 | 81.37 | 95.54 | 28.3M | 4.4G | 224 | 0.9 | bicubic | google/baidu(h2ac) |
swin_s_224 | 83.21 | 96.32 | 49.6M | 8.6G | 224 | 0.9 | bicubic | google/baidu(ydyx) |
swin_b_224 | 83.60 | 96.46 | 87.7M | 15.3G | 224 | 0.9 | bicubic | google/baidu(h4y6) |
swin_b_384 | 84.48 | 96.89 | 87.7M | 45.5G | 384 | 1.0 | bicubic | google/baidu(7nym) |
swin_b_224_22kto1k | 85.27 | 97.56 | 87.7M | 15.3G | 224 | 0.9 | bicubic | google/baidu(6ur8) |
swin_b_384_22kto1k | 86.43 | 98.07 | 87.7M | 45.5G | 384 | 1.0 | bicubic | google/baidu(9squ) |
swin_l_224_22kto1k | 86.32 | 97.90 | 196.4M | 34.3G | 224 | 0.9 | bicubic | google/baidu(nd2f) |
swin_l_384_22kto1k | 87.14 | 98.23 | 196.4M | 100.9G | 384 | 1.0 | bicubic | google/baidu(5g5e) |
deit_tiny_distilled_224 | 74.52 | 91.90 | 5.9M | 1.1G | 224 | 0.875 | bicubic | google/baidu(rhda) |
deit_small_distilled_224 | 81.17 | 95.41 | 22.4M | 4.3G | 224 | 0.875 | bicubic | google/baidu(pv28) |
deit_base_distilled_224 | 83.32 | 96.49 | 87.2M | 17.0G | 224 | 0.875 | bicubic | google/baidu(5f2g) |
deit_base_distilled_384 | 85.43 | 97.33 | 87.2M | 49.9G | 384 | 1.0 | bicubic | google/baidu(qgj2) |
volo_d1_224 | 84.12 | 96.78 | 26.6M | 6.6G | 224 | 1.0 | bicubic | google/baidu(xaim) |
volo_d1_384 | 85.24 | 97.21 | 26.6M | 19.5G | 384 | 1.0 | bicubic | google/baidu(rr7p) |
volo_d2_224 | 85.11 | 97.19 | 58.6M | 13.7G | 224 | 1.0 | bicubic | google/baidu(d82f) |
volo_d2_384 | 86.04 | 97.57 | 58.6M | 40.7G | 384 | 1.0 | bicubic | google/baidu(9cf3) |
volo_d3_224 | 85.41 | 97.26 | 86.2M | 19.8G | 224 | 1.0 | bicubic | google/baidu(a5a4) |
volo_d3_448 | 86.50 | 97.71 | 86.2M | 80.3G | 448 | 1.0 | bicubic | google/baidu(uudu) |
volo_d4_224 | 85.89 | 97.54 | 192.8M | 42.9G | 224 | 1.0 | bicubic | google/baidu(vcf2) |
volo_d4_448 | 86.70 | 97.85 | 192.8M | 172.5G | 448 | 1.0 | bicubic | google/baidu(nd4n) |
volo_d5_224 | 86.08 | 97.58 | 295.3M | 70.6G | 224 | 1.0 | bicubic | google/baidu(ymdg) |
volo_d5_448 | 86.92 | 97.88 | 295.3M | 283.8G | 448 | 1.0 | bicubic | google/baidu(qfcc) |
volo_d5_512 | 87.05 | 97.97 | 295.3M | 371.3G | 512 | 1.15 | bicubic | google/baidu(353h) |
cswin_tiny_224 | 82.81 | 96.30 | 22.3M | 4.2G | 224 | 0.9 | bicubic | google/baidu(4q3h) |
cswin_small_224 | 83.60 | 96.58 | 34.6M | 6.5G | 224 | 0.9 | bicubic | google/baidu(gt1a) |
cswin_base_224 | 84.23 | 96.91 | 77.4M | 14.6G | 224 | 0.9 | bicubic | google/baidu(wj8p) |
cswin_base_384 | 85.51 | 97.48 | 77.4M | 43.1G | 384 | 1.0 | bicubic | google/baidu(rkf5) |
cswin_large_224 | 86.52 | 97.99 | 173.3M | 32.5G | 224 | 0.9 | bicubic | google/baidu(b5fs) |
cswin_large_384 | 87.49 | 98.35 | 173.3M | 96.1G | 384 | 1.0 | bicubic | google/baidu(6235) |
cait_xxs24_224 | 78.38 | 94.32 | 11.9M | 2.2G | 224 | 1.0 | bicubic | google/baidu(j9m8) |
cait_xxs36_224 | 79.75 | 94.88 | 17.2M | 33.1G | 224 | 1.0 | bicubic | google/baidu(nebg) |
cait_xxs24_384 | 80.97 | 95.64 | 11.9M | 6.8G | 384 | 1.0 | bicubic | google/baidu(2j95) |
cait_xxs36_384 | 82.20 | 96.15 | 17.2M | 10.1G | 384 | 1.0 | bicubic | google/baidu(wx5d) |
cait_s24_224 | 83.45 | 96.57 | 46.8M | 8.7G | 224 | 1.0 | bicubic | google/baidu(m4pn) |
cait_xs24_384 | 84.06 | 96.89 | 26.5M | 15.1G | 384 | 1.0 | bicubic | google/baidu(scsv) |
cait_s24_384 | 85.05 | 97.34 | 46.8M | 26.5G | 384 | 1.0 | bicubic | google/baidu(dnp7) |
cait_s36_384 | 85.45 | 97.48 | 68.1M | 39.5G | 384 | 1.0 | bicubic | google/baidu(e3ui) |
cait_m36_384 | 86.06 | 97.73 | 270.7M | 156.2G | 384 | 1.0 | bicubic | google/baidu(r4hu) |
cait_m48_448 | 86.49 | 97.75 | 355.8M | 287.3G | 448 | 1.0 | bicubic | google/baidu(imk5) |
pvtv2_b0 | 70.47 | 90.16 | 3.7M | 0.6G | 224 | 0.875 | bicubic | google/baidu(dxgb) |
pvtv2_b1 | 78.70 | 94.49 | 14.0M | 2.1G | 224 | 0.875 | bicubic | google/baidu(2e5m) |
pvtv2_b2 | 82.02 | 95.99 | 25.4M | 4.0G | 224 | 0.875 | bicubic | google/baidu(are2) |
pvtv2_b2_linear | 82.06 | 96.04 | 22.6M | 3.9G | 224 | 0.875 | bicubic | google/baidu(a4c8) |
pvtv2_b3 | 83.14 | 96.47 | 45.2M | 6.8G | 224 | 0.875 | bicubic | google/baidu(nc21) |
pvtv2_b4 | 83.61 | 96.69 | 62.6M | 10.0G | 224 | 0.875 | bicubic | google/baidu(tthf) |
pvtv2_b5 | 83.77 | 96.61 | 82.0M | 11.5G | 224 | 0.875 | bicubic | google/baidu(9v6n) |
shuffle_vit_tiny | 82.39 | 96.05 | 28.5M | 4.6G | 224 | 0.875 | bicubic | google/baidu(8a1i) |
shuffle_vit_small | 83.53 | 96.57 | 50.1M | 8.8G | 224 | 0.875 | bicubic | google/baidu(xwh3) |
shuffle_vit_base | 83.95 | 96.91 | 88.4M | 15.5G | 224 | 0.875 | bicubic | google/baidu(1gsr) |
t2t_vit_7 | 71.68 | 90.89 | 4.3M | 1.0G | 224 | 0.9 | bicubic | google/baidu(1hpa) |
t2t_vit_10 | 75.15 | 92.80 | 5.8M | 1.3G | 224 | 0.9 | bicubic | google/baidu(ixug) |
t2t_vit_12 | 76.48 | 93.49 | 6.9M | 1.5G | 224 | 0.9 | bicubic | google/baidu(qpbb) |
t2t_vit_14 | 81.50 | 95.67 | 21.5M | 4.4G | 224 | 0.9 | bicubic | google/baidu(c2u8) |
t2t_vit_19 | 81.93 | 95.74 | 39.1M | 7.8G | 224 | 0.9 | bicubic | google/baidu(4in3) |
t2t_vit_24 | 82.28 | 95.89 | 64.0M | 12.8G | 224 | 0.9 | bicubic | google/baidu(4in3) |
t2t_vit_t_14 | 81.69 | 95.85 | 21.5M | 4.4G | 224 | 0.9 | bicubic | google/baidu(4in3) |
t2t_vit_t_19 | 82.44 | 96.08 | 39.1M | 7.9G | 224 | 0.9 | bicubic | google/baidu(mier) |
t2t_vit_t_24 | 82.55 | 96.07 | 64.0M | 12.9G | 224 | 0.9 | bicubic | google/baidu(6vxc) |
t2t_vit_14_384 | 83.34 | 96.50 | 21.5M | 13.0G | 384 | 1.0 | bicubic | google/baidu(r685) |
cross_vit_tiny_224 | 73.20 | 91.90 | 6.9M | 1.3G | 224 | 0.875 | bicubic | google/baidu(scvb) |
cross_vit_small_224 | 81.01 | 95.33 | 26.7M | 5.2G | 224 | 0.875 | bicubic | google/baidu(32us) |
cross_vit_base_224 | 82.12 | 95.87 | 104.7M | 20.2G | 224 | 0.875 | bicubic | google/baidu(jj2q) |
cross_vit_9_224 | 73.78 | 91.93 | 8.5M | 1.6G | 224 | 0.875 | bicubic | google/baidu(mjcb) |
cross_vit_15_224 | 81.51 | 95.72 | 27.4M | 5.2G | 224 | 0.875 | bicubic | google/baidu(n55b) |
cross_vit_18_224 | 82.29 | 96.00 | 43.1M | 8.3G | 224 | 0.875 | bicubic | google/baidu(xese) |
cross_vit_9_dagger_224 | 76.92 | 93.61 | 8.7M | 1.7G | 224 | 0.875 | bicubic | google/baidu(58ah) |
cross_vit_15_dagger_224 | 82.23 | 95.93 | 28.1M | 5.6G | 224 | 0.875 | bicubic | google/baidu(qwup) |
cross_vit_18_dagger_224 | 82.51 | 96.03 | 44.1M | 8.7G | 224 | 0.875 | bicubic | google/baidu(qtw4) |
cross_vit_15_dagger_384 | 83.75 | 96.75 | 28.1M | 16.4G | 384 | 1.0 | bicubic | google/baidu(w71e) |
cross_vit_18_dagger_384 | 84.17 | 96.82 | 44.1M | 25.8G | 384 | 1.0 | bicubic | google/baidu(99b6) |
beit_base_patch16_224_pt22k | 85.21 | 97.66 | 87M | 12.7G | 224 | 0.9 | bicubic | google/baidu(fshn) |
beit_base_patch16_384_pt22k | 86.81 | 98.14 | 87M | 37.3G | 384 | 1.0 | bicubic | google/baidu(arvc) |
beit_large_patch16_224_pt22k | 87.48 | 98.30 | 304M | 45.0G | 224 | 0.9 | bicubic | google/baidu(2ya2) |
beit_large_patch16_384_pt22k | 88.40 | 98.60 | 304M | 131.7G | 384 | 1.0 | bicubic | google/baidu(qtrn) |
beit_large_patch16_512_pt22k | 88.60 | 98.66 | 304M | 234.0G | 512 | 1.0 | bicubic | google/baidu(567v) |
Focal-T | 82.03 | 95.86 | 28.9M | 4.9G | 224 | 0.875 | bicubic | google/baidu(i8c2) |
Focal-T (use conv) | 82.70 | 96.14 | 30.8M | 4.9G | 224 | 0.875 | bicubic | google/baidu(smrk) |
Focal-S | 83.55 | 96.29 | 51.1M | 9.4G | 224 | 0.875 | bicubic | google/baidu(dwd8) |
Focal-S (use conv) | 83.85 | 96.47 | 53.1M | 9.4G | 224 | 0.875 | bicubic | google/baidu(nr7n) |
Focal-B | 83.98 | 96.48 | 89.8M | 16.4G | 224 | 0.875 | bicubic | google/baidu(8akn) |
Focal-B (use conv) | 84.18 | 96.61 | 93.3M | 16.4G | 224 | 0.875 | bicubic | google/baidu(5nfi) |
mobilevit_xxs | 70.31 | 89.68 | 1.32M | 0.44G | 256 | 1.0 | bicubic | google/baidu(axpc) |
mobilevit_xs | 74.47 | 92.02 | 2.33M | 0.95G | 256 | 1.0 | bicubic | google/baidu(hfhm) |
mobilevit_s | 76.74 | 93.08 | 5.59M | 1.88G | 256 | 1.0 | bicubic | google/baidu(34bg) |
mobilevit_s $\dag$ | 77.83 | 93.83 | 5.59M | 1.88G | 256 | 1.0 | bicubic | google/baidu(92ic) |
vip_s7 | 81.50 | 95.76 | 25.1M | 7.0G | 224 | 0.875 | bicubic | google/baidu(mh9b) |
vip_m7 | 82.75 | 96.05 | 55.3M | 16.4G | 224 | 0.875 | bicubic | google/baidu(hvm8) |
vip_l7 | 83.18 | 96.37 | 87.8M | 24.5G | 224 | 0.875 | bicubic | google/baidu(tjvh) |
xcit_nano_12_p16_224_dist | 72.32 | 90.86 | 0.6G | 3.1M | 224 | 1.0 | bicubic | google/baidu(7qvz) |
xcit_nano_12_p16_384_dist | 75.46 | 92.70 | 1.6G | 3.1M | 384 | 1.0 | bicubic | google/baidu(1y2j) |
xcit_large_24_p16_224_dist | 84.92 | 97.13 | 35.9G | 189.1M | 224 | 1.0 | bicubic | google/baidu(kfv8) |
xcit_large_24_p16_384_dist | 85.76 | 97.54 | 105.5G | 189.1M | 384 | 1.0 | bicubic | google/baidu(ffq3) |
xcit_nano_12_p8_224_dist | 76.33 | 93.10 | 2.2G | 3.0M | 224 | 1.0 | bicubic | google/baidu(jjs7) |
xcit_nano_12_p8_384_dist | 77.82 | 94.04 | 6.3G | 3.0M | 384 | 1.0 | bicubic | google/baidu(dmc1) |
xcit_large_24_p8_224_dist | 85.40 | 97.40 | 141.4G | 188.9M | 224 | 1.0 | bicubic | google/baidu(y7gw) |
xcit_large_24_p8_384_dist | 85.99 | 97.69 | 415.5G | 188.9M | 384 | 1.0 | bicubic | google/baidu(9xww) |
pit_ti | 72.91 | 91.40 | 4.8M | 0.5G | 224 | 0.9 | bicubic | google/baidu(ydmi) |
pit_ti_distill | 74.54 | 92.10 | 5.1M | 0.5G | 224 | 0.9 | bicubic | google/baidu(7k4s) |
pit_xs | 78.18 | 94.16 | 10.5M | 1.1G | 224 | 0.9 | bicubic | google/baidu(gytu) |
pit_xs_distill | 79.31 | 94.36 | 10.9M | 1.1G | 224 | 0.9 | bicubic | google/baidu(ie7s) |
pit_s | 81.08 | 95.33 | 23.4M | 2.4G | 224 | 0.9 | bicubic | google/baidu(kt1n) |
pit_s_distill | 81.99 | 95.79 | 24.0M | 2.5G | 224 | 0.9 | bicubic | google/baidu(hhyc) |
pit_b | 82.44 | 95.71 | 73.5M | 10.6G | 224 | 0.9 | bicubic | google/baidu(uh2v) |
pit_b_distill | 84.14 | 96.86 | 74.5M | 10.7G | 224 | 0.9 | bicubic | google/baidu(3e6g) |
halonet26t | 79.10 | 94.31 | 12.5M | 3.2G | 256 | 0.95 | bicubic | google/baidu(ednv) |
halonet50ts | 81.65 | 95.61 | 22.8M | 5.1G | 256 | 0.94 | bicubic | google/baidu(3j9e) |
poolformer_s12 | 77.24 | 93.51 | 11.9M | 1.8G | 224 | 0.9 | bicubic | google/baidu(zcv4) |
poolformer_s24 | 80.33 | 95.05 | 21.3M | 3.4G | 224 | 0.9 | bicubic | google/baidu(nedr) |
poolformer_s36 | 81.43 | 95.45 | 30.8M | 5.0G | 224 | 0.9 | bicubic | google/baidu(fvpm) |
poolformer_m36 | 82.11 | 95.69 | 56.1M | 8.9G | 224 | 0.95 | bicubic | google/baidu(whfp) |
poolformer_m48 | 82.46 | 95.96 | 73.4M | 11.8G | 224 | 0.95 | bicubic | google/baidu(374f) |
botnet50 | 77.38 | 93.56 | 20.9M | 5.3G | 224 | 0.875 | bicubic | google/baidu(wh13) |
CvT-13-224 | 81.59 | 95.67 | 20M | 4.5G | 224 | 0.875 | bicubic | google/baidu(vev9) |
CvT-21-224 | 82.46 | 96.00 | 32M | 7.1G | 224 | 0.875 | bicubic | google/baidu(t2rv) |
CvT-13-384 | 83.00 | 96.36 | 20M | 16.3G | 384 | 1.0 | bicubic | google/baidu(wswt) |
CvT-21-384 | 83.27 | 96.16 | 32M | 24.9G | 384 | 1.0 | bicubic | google/baidu(hcem) |
CvT-13-384-22k | 83.26 | 97.09 | 20M | 16.3G | 384 | 1.0 | bicubic | google/baidu(c7m9) |
CvT-21-384-22k | 84.91 | 97.62 | 32M | 24.9G | 384 | 1.0 | bicubic | google/baidu(9jxe) |
CvT-w24-384-22k | 87.58 | 98.47 | 277M | 193.2G | 384 | 1.0 | bicubic | google/baidu(bbj2) |
HVT-Ti-1 | 69.45 | 89.28 | 5.7M | 0.6G | 224 | 0.875 | bicubic | google/baidu(egds) |
HVT-S-0 | 80.30 | 95.15 | 22.0M | 4.6G | 224 | 0.875 | bicubic | google/baidu(hj7a) |
HVT-S-1 | 78.06 | 93.84 | 22.1M | 2.4G | 224 | 0.875 | bicubic | google/baidu(tva8) |
HVT-S-2 | 77.41 | 93.48 | 22.1M | 1.9G | 224 | 0.875 | bicubic | google/baidu(bajp) |
HVT-S-3 | 76.30 | 92.88 | 22.1M | 1.6G | 224 | 0.875 | bicubic | google/baidu(rjch) |
HVT-S-4 | 75.21 | 92.34 | 22.1M | 1.6G | 224 | 0.875 | bicubic | google/baidu(ki4j) |
mlp_mixer_b16_224 | 76.60 | 92.23 | 60.0M | 12.7G | 224 | 0.875 | bicubic | google/baidu(xh8x) |
mlp_mixer_l16_224 | 72.06 | 87.67 | 208.2M | 44.9G | 224 | 0.875 | bicubic | google/baidu(8q7r) |
resmlp_24_224 | 79.38 | 94.55 | 30.0M | 6.0G | 224 | 0.875 | bicubic | google/baidu(jdcx) |
resmlp_36_224 | 79.77 | 94.89 | 44.7M | 9.0G | 224 | 0.875 | bicubic | google/baidu(33w3) |
resmlp_big_24_224 | 81.04 | 95.02 | 129.1M | 100.7G | 224 | 0.875 | bicubic | google/baidu(r9kb) |
resmlp_12_distilled_224 | 77.95 | 93.56 | 15.3M | 3.0G | 224 | 0.875 | bicubic | google/baidu(ghyp) |
resmlp_24_distilled_224 | 80.76 | 95.22 | 30.0M | 6.0G | 224 | 0.875 | bicubic | google/baidu(sxnx) |
resmlp_36_distilled_224 | 81.15 | 95.48 | 44.7M | 9.0G | 224 | 0.875 | bicubic | google/baidu(vt85) |
resmlp_big_24_distilled_224 | 83.59 | 96.65 | 129.1M | 100.7G | 224 | 0.875 | bicubic | google/baidu(4jk5) |
resmlp_big_24_22k_224 | 84.40 | 97.11 | 129.1M | 100.7G | 224 | 0.875 | bicubic | google/baidu(ve7i) |
gmlp_s16_224 | 79.64 | 94.63 | 19.4M | 4.5G | 224 | 0.875 | bicubic | google/baidu(bcth) |
ff_only_tiny (linear_tiny) | 61.28 | 84.06 | 224 | 0.875 | bicubic | google/baidu(mjgd) | ||
ff_only_base (linear_base) | 74.82 | 91.71 | 224 | 0.875 | bicubic | google/baidu(m1jc) | ||
repmlp_res50_light_224 | 77.01 | 93.46 | 87.1M | 3.3G | 224 | 0.875 | bicubic | google/baidu(b4fg) |
cyclemlp_b1 | 78.85 | 94.60 | 15.1M | 224 | 0.9 | bicubic | google/baidu(mnbr) | |
cyclemlp_b2 | 81.58 | 95.81 | 26.8M | 224 | 0.9 | bicubic | google/baidu(jwj9) | |
cyclemlp_b3 | 82.42 | 96.07 | 38.3M | 224 | 0.9 | bicubic | google/baidu(v2fy) | |
cyclemlp_b4 | 82.96 | 96.33 | 51.8M | 224 | 0.875 | bicubic | google/baidu(fnqd) | |
cyclemlp_b5 | 83.25 | 96.44 | 75.7M | 224 | 0.875 | bicubic | google/baidu(s55c) | |
convmixer_1024_20 | 76.94 | 93.35 | 24.5M | 9.5G | 224 | 0.96 | bicubic | google/baidu(qpn9) |
convmixer_768_32 | 80.16 | 95.08 | 21.2M | 20.8G | 224 | 0.96 | bicubic | google/baidu(m5s5) |
convmixer_1536_20 | 81.37 | 95.62 | 51.8M | 72.4G | 224 | 0.96 | bicubic | google/baidu(xqty) |
convmlp_s | 76.76 | 93.40 | 9.0M | 2.4G | 224 | 0.875 | bicubic | google/baidu(3jz3) |
convmlp_m | 79.03 | 94.53 | 17.4M | 4.0G | 224 | 0.875 | bicubic | google/baidu(vyp1) |
convmlp_l | 80.15 | 95.00 | 42.7M | 10.0G | 224 | 0.875 | bicubic | google/baidu(ne5x) |
Model | backbone | box_mAP | Model |
---|---|---|---|
DETR | ResNet50 | 42.0 | google/baidu(n5gk) |
DETR | ResNet101 | 43.5 | google/baidu(bxz2) |
Mask R-CNN | Swin-T 1x | 43.7 | google/baidu(qev7) |
Mask R-CNN | Swin-T 3x | 46.0 | google/baidu(m8fg) |
Mask R-CNN | Swin-S 3x | 48.4 | google/baidu(hdw5) |
Mask R-CNN | pvtv2_b0 | 38.3 | google/baidu(3kqb) |
Mask R-CNN | pvtv2_b1 | 41.8 | google/baidu(k5aq) |
Mask R-CNN | pvtv2_b2 | 45.2 | google/baidu(jh8b) |
Mask R-CNN | pvtv2_b2_linear | 44.1 | google/baidu(8ipt) |
Mask R-CNN | pvtv2_b3 | 46.9 | google/baidu(je4y) |
Mask R-CNN | pvtv2_b4 | 47.5 | google/baidu(n3ay) |
Mask R-CNN | pvtv2_b5 | 47.4 | google/baidu(jzq1) |
Model | Backbone | Batch_size | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|
SETR_Naive | ViT_large | 16 | 52.06 | 52.57 | google/baidu(owoj) | google/baidu(xdb8) | config |
SETR_PUP | ViT_large | 16 | 53.90 | 54.53 | google/baidu(owoj) | google/baidu(6sji) | config |
SETR_MLA | ViT_Large | 8 | 54.39 | 55.16 | google/baidu(owoj) | google/baidu(wora) | config |
SETR_MLA | ViT_large | 16 | 55.01 | 55.87 | google/baidu(owoj) | google/baidu(76h2) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
SETR_Naive | ViT_Large | 8 | 40k | 76.71 | 79.03 | google/baidu(owoj) | google/baidu(g7ro) | config |
SETR_Naive | ViT_Large | 8 | 80k | 77.31 | 79.43 | google/baidu(owoj) | google/baidu(wn6q) | config |
SETR_PUP | ViT_Large | 8 | 40k | 77.92 | 79.63 | google/baidu(owoj) | google/baidu(zmoi) | config |
SETR_PUP | ViT_Large | 8 | 80k | 78.81 | 80.43 | google/baidu(owoj) | baidu(f793) | config |
SETR_MLA | ViT_Large | 8 | 40k | 76.70 | 78.96 | google/baidu(owoj) | baidu(qaiw) | config |
SETR_MLA | ViT_Large | 8 | 80k | 77.26 | 79.27 | google/baidu(owoj) | baidu(6bgj) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
SETR_Naive | ViT_Large | 16 | 160k | 47.57 | 48.12 | google/baidu(owoj) | baidu(lugq) | config |
SETR_PUP | ViT_Large | 16 | 160k | 49.12 | 49.51 | google/baidu(owoj) | baidu(udgs) | config |
SETR_MLA | ViT_Large | 8 | 160k | 47.80 | 49.34 | google/baidu(owoj) | baidu(mrrv) | config |
DPT | ViT_Large | 16 | 160k | 47.21 | - | google/baidu(owoj) | baidu(ts7h) | config |
Segmenter | ViT_Tiny | 16 | 160k | 38.45 | - | TODO | baidu(1k97) | config |
Segmenter | ViT_Small | 16 | 160k | 46.07 | - | TODO | baidu(i8nv) | config |
Segmenter | ViT_Base | 16 | 160k | 49.08 | - | TODO | baidu(hxrl) | config |
Segmenter | ViT_Large | 16 | 160k | 51.82 | - | TODO | baidu(wdz6) | config |
Segmenter_Linear | DeiT_Base | 16 | 160k | 47.34 | - | TODO | baidu(5dpv) | config |
Segmenter | DeiT_Base | 16 | 160k | 49.27 | - | TODO | baidu(3kim) | config |
Segformer | MIT-B0 | 16 | 160k | 38.37 | - | TODO | baidu(ges9) | config |
Segformer | MIT-B1 | 16 | 160k | 42.20 | - | TODO | baidu(t4n4) | config |
Segformer | MIT-B2 | 16 | 160k | 46.38 | - | TODO | baidu(h5ar) | config |
Segformer | MIT-B3 | 16 | 160k | 48.35 | - | TODO | baidu(g9n4) | config |
Segformer | MIT-B4 | 16 | 160k | 49.01 | - | TODO | baidu(e4xw) | config |
Segformer | MIT-B5 | 16 | 160k | 49.73 | - | TODO | baidu(uczo) | config |
UperNet | Swin_Tiny | 16 | 160k | 44.90 | 45.37 | - | baidu(lkhg) | config |
UperNet | Swin_Small | 16 | 160k | 47.88 | 48.90 | - | baidu(vvy1) | config |
UperNet | Swin_Base | 16 | 160k | 48.59 | 49.04 | - | baidu(y040) | config |
UperNet | CSwin_Tiny | 16 | 160k | 49.46 | baidu(l1cp) | baidu(y1eq) | config | |
UperNet | CSwin_Small | 16 | 160k | 50.88 | baidu(6vwk) | baidu(fz2e) | config | |
UperNet | CSwin_Base | 16 | 160k | 50.64 | baidu(0ys7) | baidu(83w3) | config |
Model | Backbone | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint | ConfigFile |
---|---|---|---|---|---|---|---|---|
Trans2seg_Medium | Resnet50c | 16 | 16k | 75.97 | - | google/baidu(4dd5) | google/baidu(w25r) | config |
Model | FID | Image Size | Crop_pct | Interpolation | Model |
---|---|---|---|---|---|
styleformer_cifar10 | 2.73 | 32 | 1.0 | lanczos | google/baidu(ztky) |
styleformer_stl10 | 15.65 | 48 | 1.0 | lanczos | google/baidu(i973) |
styleformer_celeba | 3.32 | 64 | 1.0 | lanczos | google/baidu(fh5s) |
styleformer_lsun | 9.68 | 128 | 1.0 | lanczos | google/baidu(158t) |
*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.
To use the model with pretrained weights, go to the specific subfolder e.g., /image_classification/ViT/
, then download the .pdparam
weight file and change related file paths in the following python scripts. The model config files are located in ./configs
.
Assume the downloaded weight file is stored in ./vit_base_patch16_224.pdparams
, to use the vit_base_patch16_224
model in python:
from config import get_config
from visual_transformer import build_vit as build_model
# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights
model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
model.set_dict(model_state_dict)
🤖 See the README file in each model folder for detailed usages.
To evaluate ViT model performance on ImageNet2012 with a single GPU, run the following script using command line:
sh run_eval.sh
or
CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
-cfg=./configs/vit_base_patch16_224.yaml \
-dataset=imagenet2012 \
-batch_size=16 \
-data_path=/path/to/dataset/imagenet/val \
-eval \
-pretrained=/path/to/pretrained/model/vit_base_patch16_224 # .pdparams is NOT needed
sh run_eval_multi.sh
or
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
-cfg=./configs/vit_base_patch16_224.yaml \
-dataset=imagenet2012 \
-batch_size=16 \
-data_path=/path/to/dataset/imagenet/val \
-eval \
-pretrained=/path/to/pretrained/model/vit_base_patch16_224 # .pdparams is NOT needed
To train the ViT model on ImageNet2012 with single GPU, run the following script using command line:
sh run_train.sh
or
CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
-cfg=./configs/vit_base_patch16_224.yaml \
-dataset=imagenet2012 \
-batch_size=32 \
-data_path=/path/to/dataset/imagenet/train
sh run_train_multi.sh
or
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
-cfg=./configs/vit_base_patch16_224.yaml \
-dataset=imagenet2012 \
-batch_size=16 \
-data_path=/path/to/dataset/imagenet/train