METADATA 39 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574
  1. Metadata-Version: 2.1
  2. Name: timm
  3. Version: 1.0.26
  4. Summary: PyTorch Image Models
  5. Keywords: pytorch,image-classification
  6. Author-Email: Ross Wightman <ross@huggingface.co>
  7. License: Apache-2.0
  8. Classifier: Development Status :: 5 - Production/Stable
  9. Classifier: Intended Audience :: Education
  10. Classifier: Intended Audience :: Science/Research
  11. Classifier: License :: OSI Approved :: Apache Software License
  12. Classifier: Programming Language :: Python :: 3.8
  13. Classifier: Programming Language :: Python :: 3.9
  14. Classifier: Programming Language :: Python :: 3.10
  15. Classifier: Programming Language :: Python :: 3.11
  16. Classifier: Programming Language :: Python :: 3.12
  17. Classifier: Topic :: Scientific/Engineering
  18. Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
  19. Classifier: Topic :: Software Development
  20. Classifier: Topic :: Software Development :: Libraries
  21. Classifier: Topic :: Software Development :: Libraries :: Python Modules
  22. Project-URL: homepage, https://github.com/huggingface/pytorch-image-models
  23. Project-URL: documentation, https://huggingface.co/docs/timm/en/index
  24. Project-URL: repository, https://github.com/huggingface/pytorch-image-models
  25. Requires-Python: >=3.8
  26. Requires-Dist: torch
  27. Requires-Dist: torchvision
  28. Requires-Dist: pyyaml
  29. Requires-Dist: huggingface_hub
  30. Requires-Dist: safetensors
  31. Description-Content-Type: text/markdown
  32. # PyTorch Image Models
  33. - [What's New](#whats-new)
  34. - [Introduction](#introduction)
  35. - [Models](#models)
  36. - [Features](#features)
  37. - [Results](#results)
  38. - [Getting Started (Documentation)](#getting-started-documentation)
  39. - [Train, Validation, Inference Scripts](#train-validation-inference-scripts)
  40. - [Awesome PyTorch Resources](#awesome-pytorch-resources)
  41. - [Licenses](#licenses)
  42. - [Citing](#citing)
  43. ## What's New
  44. ## March 23, 2026
  45. * Improve pickle checkpoint handling security. Default all loading to `weights_only=True`, add safe_global for ArgParse.
  46. * Improve attention mask handling for core ViT/EVA models & layers. Resolve bool masks, pass `is_causal` through for SSL tasks.
  47. * Fix class & register token uses with ViT and no pos embed enabled.
  48. * Add Patch Representation Refinement (PRR) as a pooling option in ViT. Thanks Sina (https://github.com/sinahmr).
  49. * Improve consistency of output projection / MLP dimensions for attention pooling layers.
  50. * Hiera model F.SDPA optimization to allow Flash Attention kernel use.
  51. * Caution added to SGDP optimizer.
  52. * Release 1.0.26. First maintenance release since my departure from Hugging Face.
  53. ## Feb 23, 2026
  54. * Add token distillation training support to distillation task wrappers
  55. * Remove some torch.jit usage in prep for official deprecation
  56. * Caution added to AdamP optimizer
  57. * Call reset_parameters() even if meta-device init so that buffers get init w/ hacks like init_empty_weights
  58. * Tweak Muon optimizer to work with DTensor/FSDP2 (clamp_ instead of clamp_min_, alternate NS branch for DTensor)
  59. * Release 1.0.25
  60. ## Jan 21, 2026
  61. * **Compat Break**: Fix oversight w/ QKV vs MLP bias in `ParallelScalingBlock` (& `DiffParallelScalingBlock`)
  62. * Does not impact any trained `timm` models but could impact downstream use.
  63. ## Jan 5 & 6, 2026
  64. * Release 1.0.24
  65. * Add new benchmark result csv files for inference timing on all models w/ RTX Pro 6000, 5090, and 4090 cards w/ PyTorch 2.9.1
  66. * Fix moved module error in deprecated timm.models.layers import path that impacts legacy imports
  67. * Release 1.0.23
  68. ## Dec 30, 2025
  69. * Add better NAdaMuon trained `dpwee`, `dwee`, `dlittle` (differential) ViTs with a small boost over previous runs
  70. * https://huggingface.co/timm/vit_dlittle_patch16_reg1_gap_256.sbb_nadamuon_in1k (83.24% top-1)
  71. * https://huggingface.co/timm/vit_dwee_patch16_reg1_gap_256.sbb_nadamuon_in1k (81.80% top-1)
  72. * https://huggingface.co/timm/vit_dpwee_patch16_reg1_gap_256.sbb_nadamuon_in1k (81.67% top-1)
  73. * Add a ~21M param `timm` variant of the CSATv2 model at 512x512 & 640x640
  74. * https://huggingface.co/timm/csatv2_21m.sw_r640_in1k (83.13% top-1)
  75. * https://huggingface.co/timm/csatv2_21m.sw_r512_in1k (82.58% top-1)
  76. * Factor non-persistent param init out of `__init__` into a common method that can be externally called via `init_non_persistent_buffers()` after meta-device init.
  77. ## Dec 12, 2025
  78. * Add CSATV2 model (thanks https://github.com/gusdlf93) -- a lightweight but high res model with DCT stem & spatial attention. https://huggingface.co/Hyunil/CSATv2
  79. * Add AdaMuon and NAdaMuon optimizer support to existing `timm` Muon impl. Appears more competitive vs AdamW with familiar hparams for image tasks.
  80. * End of year PR cleanup, merge aspects of several long open PR
  81. * Merge differential attention (`DiffAttention`), add corresponding `DiffParallelScalingBlock` (for ViT), train some wee vits
  82. * https://huggingface.co/timm/vit_dwee_patch16_reg1_gap_256.sbb_in1k
  83. * https://huggingface.co/timm/vit_dpwee_patch16_reg1_gap_256.sbb_in1k
  84. * Add a few pooling modules, `LsePlus` and `SimPool`
  85. * Cleanup, optimize `DropBlock2d` (also add support to ByobNet based models)
  86. * Bump unit tests to PyTorch 2.9.1 + Python 3.13 on upper end, lower still PyTorch 1.13 + Python 3.10
  87. ## Dec 1, 2025
  88. * Add lightweight task abstraction, add logits and feature distillation support to train script via new tasks.
  89. * Remove old APEX AMP support
  90. ## Nov 4, 2025
  91. * Fix LayerScale / LayerScale2d init bug (init values ignored), introduced in 1.0.21. Thanks https://github.com/Ilya-Fradlin
  92. * Release 1.0.22
  93. ## Oct 31, 2025 🎃
  94. * Update imagenet & OOD variant result csv files to include a few new models and verify correctness over several torch & timm versions
  95. * EfficientNet-X and EfficientNet-H B5 model weights added as part of a hparam search for AdamW vs Muon (still iterating on Muon runs)
  96. ## Oct 16-20, 2025
  97. * Add an impl of the Muon optimizer (based on https://github.com/KellerJordan/Muon) with customizations
  98. * extra flexibility and improved handling for conv weights and fallbacks for weight shapes not suited for orthogonalization
  99. * small speedup for NS iterations by reducing allocs and using fused (b)add(b)mm ops
  100. * by default uses AdamW (or NAdamW if `nesterov=True`) updates if muon not suitable for parameter shape (or excluded via param group flag)
  101. * like torch impl, select from several LR scale adjustment fns via `adjust_lr_fn`
  102. * select from several NS coefficient presets or specify your own via `ns_coefficients`
  103. * First 2 steps of 'meta' device model initialization supported
  104. * Fix several ops that were breaking creation under 'meta' device context
  105. * Add device & dtype factory kwarg support to all models and modules (anything inherting from nn.Module) in `timm`
  106. * License fields added to pretrained cfgs in code
  107. * Release 1.0.21
  108. ## Sept 21, 2025
  109. * Remap DINOv3 ViT weight tags from `lvd_1689m` -> `lvd1689m` to match (same for `sat_493m` -> `sat493m`)
  110. * Release 1.0.20
  111. ## Sept 17, 2025
  112. * DINOv3 (https://arxiv.org/abs/2508.10104) ConvNeXt and ViT models added. ConvNeXt models were mapped to existing `timm` model. ViT support done via the EVA base model w/ a new `RotaryEmbeddingDinoV3` to match the DINOv3 specific RoPE impl
  113. * HuggingFace Hub: https://huggingface.co/collections/timm/timm-dinov3-68cb08bb0bee365973d52a4d
  114. * MobileCLIP-2 (https://arxiv.org/abs/2508.20691) vision encoders. New MCI3/MCI4 FastViT variants added and weights mapped to existing FastViT and B, L/14 ViTs.
  115. * MetaCLIP-2 Worldwide (https://arxiv.org/abs/2507.22062) ViT encoder weights added.
  116. * SigLIP-2 (https://arxiv.org/abs/2502.14786) NaFlex ViT encoder weights added via timm NaFlexViT model.
  117. * Misc fixes and contributions
  118. ## July 23, 2025
  119. * Add `set_input_size()` method to EVA models, used by OpenCLIP 3.0.0 to allow resizing for timm based encoder models.
  120. * Release 1.0.18, needed for PE-Core S & T models in OpenCLIP 3.0.0
  121. * Fix small typing issue that broke Python 3.9 compat. 1.0.19 patch release.
  122. ## July 21, 2025
  123. * ROPE support added to NaFlexViT. All models covered by the EVA base (`eva.py`) including EVA, EVA02, Meta PE ViT, `timm` SBB ViT w/ ROPE, and Naver ROPE-ViT can be now loaded in NaFlexViT when `use_naflex=True` passed at model creation time
  124. * More Meta PE ViT encoders added, including small/tiny variants, lang variants w/ tiling, and more spatial variants.
  125. * PatchDropout fixed with NaFlexViT and also w/ EVA models (regression after adding Naver ROPE-ViT)
  126. * Fix XY order with grid_indexing='xy', impacted non-square image use in 'xy' mode (only ROPE-ViT and PE impacted).
  127. ## July 7, 2025
  128. * MobileNet-v5 backbone tweaks for improved Google Gemma 3n behaviour (to pair with updated official weights)
  129. * Add stem bias (zero'd in updated weights, compat break with old weights)
  130. * GELU -> GELU (tanh approx). A minor change to be closer to JAX
  131. * Add two arguments to layer-decay support, a min scale clamp and 'no optimization' scale threshold
  132. * Add 'Fp32' LayerNorm, RMSNorm, SimpleNorm variants that can be enabled to force computation of norm in float32
  133. * Some typing, argument cleanup for norm, norm+act layers done with above
  134. * Support Naver ROPE-ViT (https://github.com/naver-ai/rope-vit) in `eva.py`, add RotaryEmbeddingMixed module for mixed mode, weights on HuggingFace Hub
  135. |model |img_size|top1 |top5 |param_count|
  136. |--------------------------------------------------|--------|------|------|-----------|
  137. |vit_large_patch16_rope_mixed_ape_224.naver_in1k |224 |84.84 |97.122|304.4 |
  138. |vit_large_patch16_rope_mixed_224.naver_in1k |224 |84.828|97.116|304.2 |
  139. |vit_large_patch16_rope_ape_224.naver_in1k |224 |84.65 |97.154|304.37 |
  140. |vit_large_patch16_rope_224.naver_in1k |224 |84.648|97.122|304.17 |
  141. |vit_base_patch16_rope_mixed_ape_224.naver_in1k |224 |83.894|96.754|86.59 |
  142. |vit_base_patch16_rope_mixed_224.naver_in1k |224 |83.804|96.712|86.44 |
  143. |vit_base_patch16_rope_ape_224.naver_in1k |224 |83.782|96.61 |86.59 |
  144. |vit_base_patch16_rope_224.naver_in1k |224 |83.718|96.672|86.43 |
  145. |vit_small_patch16_rope_224.naver_in1k |224 |81.23 |95.022|21.98 |
  146. |vit_small_patch16_rope_mixed_224.naver_in1k |224 |81.216|95.022|21.99 |
  147. |vit_small_patch16_rope_ape_224.naver_in1k |224 |81.004|95.016|22.06 |
  148. |vit_small_patch16_rope_mixed_ape_224.naver_in1k |224 |80.986|94.976|22.06 |
  149. * Some cleanup of ROPE modules, helpers, and FX tracing leaf registration
  150. * Preparing version 1.0.17 release
  151. ## June 26, 2025
  152. * MobileNetV5 backbone (w/ encoder only variant) for [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters) image encoder
  153. * Version 1.0.16 released
  154. ## June 23, 2025
  155. * Add F.grid_sample based 2D and factorized pos embed resize to NaFlexViT. Faster when lots of different sizes (based on example by https://github.com/stas-sl).
  156. * Further speed up patch embed resample by replacing vmap with matmul (based on snippet by https://github.com/stas-sl).
  157. * Add 3 initial native aspect NaFlexViT checkpoints created while testing, ImageNet-1k and 3 different pos embed configs w/ same hparams.
  158. | Model | Top-1 Acc | Top-5 Acc | Params (M) | Eval Seq Len |
  159. |:---|:---:|:---:|:---:|:---:|
  160. | [naflexvit_base_patch16_par_gap.e300_s576_in1k](https://hf.co/timm/naflexvit_base_patch16_par_gap.e300_s576_in1k) | 83.67 | 96.45 | 86.63 | 576 |
  161. | [naflexvit_base_patch16_parfac_gap.e300_s576_in1k](https://hf.co/timm/naflexvit_base_patch16_parfac_gap.e300_s576_in1k) | 83.63 | 96.41 | 86.46 | 576 |
  162. | [naflexvit_base_patch16_gap.e300_s576_in1k](https://hf.co/timm/naflexvit_base_patch16_gap.e300_s576_in1k) | 83.50 | 96.46 | 86.63 | 576 |
  163. * Support gradient checkpointing for `forward_intermediates` and fix some checkpointing bugs. Thanks https://github.com/brianhou0208
  164. * Add 'corrected weight decay' (https://arxiv.org/abs/2506.02285) as option to AdamW (legacy), Adopt, Kron, Adafactor (BV), Lamb, LaProp, Lion, NadamW, RmsPropTF, SGDW optimizers
  165. * Switch PE (perception encoder) ViT models to use native timm weights instead of remapping on the fly
  166. * Fix cuda stream bug in prefetch loader
  167. ## June 5, 2025
  168. * Initial NaFlexVit model code. NaFlexVit is a Vision Transformer with:
  169. 1. Encapsulated embedding and position encoding in a single module
  170. 2. Support for nn.Linear patch embedding on pre-patchified (dictionary) inputs
  171. 3. Support for NaFlex variable aspect, variable resolution (SigLip-2: https://arxiv.org/abs/2502.14786)
  172. 4. Support for FlexiViT variable patch size (https://arxiv.org/abs/2212.08013)
  173. 5. Support for NaViT fractional/factorized position embedding (https://arxiv.org/abs/2307.06304)
  174. * Existing vit models in `vision_transformer.py` can be loaded into the NaFlexVit model by adding the `use_naflex=True` flag to `create_model`
  175. * Some native weights coming soon
  176. * A full NaFlex data pipeline is available that allows training / fine-tuning / evaluating with variable aspect / size images
  177. * To enable in `train.py` and `validate.py` add the `--naflex-loader` arg, must be used with a NaFlexVit
  178. * To evaluate an existing (classic) ViT loaded in NaFlexVit model w/ NaFlex data pipe:
  179. * `python validate.py /imagenet --amp -j 8 --model vit_base_patch16_224 --model-kwargs use_naflex=True --naflex-loader --naflex-max-seq-len 256`
  180. * The training has some extra args features worth noting
  181. * The `--naflex-train-seq-lens'` argument specifies which sequence lengths to randomly pick from per batch during training
  182. * The `--naflex-max-seq-len` argument sets the target sequence length for validation
  183. * Adding `--model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24` will enable random patch size selection per-batch w/ interpolation
  184. * The `--naflex-loss-scale` arg changes loss scaling mode per batch relative to the batch size, `timm` NaFlex loading changes the batch size for each seq len
  185. ## May 28, 2025
  186. * Add a number of small/fast models thanks to https://github.com/brianhou0208
  187. * SwiftFormer - [(ICCV2023) SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://github.com/Amshaker/SwiftFormer)
  188. * FasterNet - [(CVPR2023) Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks](https://github.com/JierunChen/FasterNet)
  189. * SHViT - [(CVPR2024) SHViT: Single-Head Vision Transformer with Memory Efficient](https://github.com/ysj9909/SHViT)
  190. * StarNet - [(CVPR2024) Rewrite the Stars](https://github.com/ma-xu/Rewrite-the-Stars)
  191. * GhostNet-V3 [GhostNetV3: Exploring the Training Strategies for Compact Models](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv3_pytorch)
  192. * Update EVA ViT (closest match) to support Perception Encoder models (https://arxiv.org/abs/2504.13181) from Meta, loading Hub weights but I still need to push dedicated `timm` weights
  193. * Add some flexibility to ROPE impl
  194. * Big increase in number of models supporting `forward_intermediates()` and some additional fixes thanks to https://github.com/brianhou0208
  195. * DaViT, EdgeNeXt, EfficientFormerV2, EfficientViT(MIT), EfficientViT(MSRA), FocalNet, GCViT, HGNet /V2, InceptionNeXt, Inception-V4, MambaOut, MetaFormer, NesT, Next-ViT, PiT, PVT V2, RepGhostNet, RepViT, ResNetV2, ReXNet, TinyViT, TResNet, VoV
  196. * TNT model updated w/ new weights `forward_intermediates()` thanks to https://github.com/brianhou0208
  197. * Add `local-dir:` pretrained schema, can use `local-dir:/path/to/model/folder` for model name to source model / pretrained cfg & weights Hugging Face Hub models (config.json + weights file) from a local folder.
  198. * Fixes, improvements for onnx export
  199. ## Feb 21, 2025
  200. * SigLIP 2 ViT image encoders added (https://huggingface.co/collections/timm/siglip-2-67b8e72ba08b09dd97aecaf9)
  201. * Variable resolution / aspect NaFlex versions are a WIP
  202. * Add 'SO150M2' ViT weights trained with SBB recipes, great results, better for ImageNet than previous attempt w/ less training.
  203. * `vit_so150m2_patch16_reg1_gap_448.sbb_e200_in12k_ft_in1k` - 88.1% top-1
  204. * `vit_so150m2_patch16_reg1_gap_384.sbb_e200_in12k_ft_in1k` - 87.9% top-1
  205. * `vit_so150m2_patch16_reg1_gap_256.sbb_e200_in12k_ft_in1k` - 87.3% top-1
  206. * `vit_so150m2_patch16_reg4_gap_256.sbb_e200_in12k`
  207. * Updated InternViT-300M '2.5' weights
  208. * Release 1.0.15
  209. ## Feb 1, 2025
  210. * FYI PyTorch 2.6 & Python 3.13 are tested and working w/ current main and released version of `timm`
  211. ## Jan 27, 2025
  212. * Add Kron Optimizer (PSGD w/ Kronecker-factored preconditioner)
  213. * Code from https://github.com/evanatyourservice/kron_torch
  214. * See also https://sites.google.com/site/lixilinx/home/psgd
  215. ## Jan 19, 2025
  216. * Fix loading of LeViT safetensor weights, remove conversion code which should have been deactivated
  217. * Add 'SO150M' ViT weights trained with SBB recipes, decent results, but not optimal shape for ImageNet-12k/1k pretrain/ft
  218. * `vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k_ft_in1k` - 86.7% top-1
  219. * `vit_so150m_patch16_reg4_gap_384.sbb_e250_in12k_ft_in1k` - 87.4% top-1
  220. * `vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k`
  221. * Misc typing, typo, etc. cleanup
  222. * 1.0.14 release to get above LeViT fix out
  223. ## Jan 9, 2025
  224. * Add support to train and validate in pure `bfloat16` or `float16`
  225. * `wandb` project name arg added by https://github.com/caojiaolong, use arg.experiment for name
  226. * Fix old issue w/ checkpoint saving not working on filesystem w/o hard-link support (e.g. FUSE fs mounts)
  227. * 1.0.13 release
  228. ## Jan 6, 2025
  229. * Add `torch.utils.checkpoint.checkpoint()` wrapper in `timm.models` that defaults `use_reentrant=False`, unless `TIMM_REENTRANT_CKPT=1` is set in env.
  230. ## Dec 31, 2024
  231. * `convnext_nano` 384x384 ImageNet-12k pretrain & fine-tune. https://huggingface.co/models?search=convnext_nano%20r384
  232. * Add AIM-v2 encoders from https://github.com/apple/ml-aim, see on Hub: https://huggingface.co/models?search=timm%20aimv2
  233. * Add PaliGemma2 encoders from https://github.com/google-research/big_vision to existing PaliGemma, see on Hub: https://huggingface.co/models?search=timm%20pali2
  234. * Add missing L/14 DFN2B 39B CLIP ViT, `vit_large_patch14_clip_224.dfn2b_s39b`
  235. * Fix existing `RmsNorm` layer & fn to match standard formulation, use PT 2.5 impl when possible. Move old impl to `SimpleNorm` layer, it's LN w/o centering or bias. There were only two `timm` models using it, and they have been updated.
  236. * Allow override of `cache_dir` arg for model creation
  237. * Pass through `trust_remote_code` for HF datasets wrapper
  238. * `inception_next_atto` model added by creator
  239. * Adan optimizer caution, and Lamb decoupled weight decay options
  240. * Some feature_info metadata fixed by https://github.com/brianhou0208
  241. * All OpenCLIP and JAX (CLIP, SigLIP, Pali, etc) model weights that used load time remapping were given their own HF Hub instances so that they work with `hf-hub:` based loading, and thus will work with new Transformers `TimmWrapperModel`
  242. ## Introduction
  243. Py**T**orch **Im**age **M**odels (`timm`) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.
  244. The work of many others is present here. I've tried to make sure all source material is acknowledged via links to github, arxiv papers, etc in the README, documentation, and code docstrings. Please let me know if I missed anything.
  245. ## Features
  246. ### Models
  247. All model architecture families include variants with pretrained weights. There are specific model variants without any weights, it is NOT a bug. Help training new or better weights is always appreciated.
  248. * Aggregating Nested Transformers - https://arxiv.org/abs/2105.12723
  249. * BEiT - https://arxiv.org/abs/2106.08254
  250. * BEiT-V2 - https://arxiv.org/abs/2208.06366
  251. * BEiT3 - https://arxiv.org/abs/2208.10442
  252. * Big Transfer ResNetV2 (BiT) - https://arxiv.org/abs/1912.11370
  253. * Bottleneck Transformers - https://arxiv.org/abs/2101.11605
  254. * CaiT (Class-Attention in Image Transformers) - https://arxiv.org/abs/2103.17239
  255. * CoaT (Co-Scale Conv-Attentional Image Transformers) - https://arxiv.org/abs/2104.06399
  256. * CoAtNet (Convolution and Attention) - https://arxiv.org/abs/2106.04803
  257. * ConvNeXt - https://arxiv.org/abs/2201.03545
  258. * ConvNeXt-V2 - http://arxiv.org/abs/2301.00808
  259. * ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://arxiv.org/abs/2103.10697
  260. * CspNet (Cross-Stage Partial Networks) - https://arxiv.org/abs/1911.11929
  261. * DeiT - https://arxiv.org/abs/2012.12877
  262. * DeiT-III - https://arxiv.org/pdf/2204.07118.pdf
  263. * DenseNet - https://arxiv.org/abs/1608.06993
  264. * DLA - https://arxiv.org/abs/1707.06484
  265. * DPN (Dual-Path Network) - https://arxiv.org/abs/1707.01629
  266. * EdgeNeXt - https://arxiv.org/abs/2206.10589
  267. * EfficientFormer - https://arxiv.org/abs/2206.01191
  268. * EfficientFormer-V2 - https://arxiv.org/abs/2212.08059
  269. * EfficientNet (MBConvNet Family)
  270. * EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252
  271. * EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665
  272. * EfficientNet (B0-B7) - https://arxiv.org/abs/1905.11946
  273. * EfficientNet-EdgeTPU (S, M, L) - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html
  274. * EfficientNet V2 - https://arxiv.org/abs/2104.00298
  275. * FBNet-C - https://arxiv.org/abs/1812.03443
  276. * MixNet - https://arxiv.org/abs/1907.09595
  277. * MNASNet B1, A1 (Squeeze-Excite), and Small - https://arxiv.org/abs/1807.11626
  278. * MobileNet-V2 - https://arxiv.org/abs/1801.04381
  279. * Single-Path NAS - https://arxiv.org/abs/1904.02877
  280. * TinyNet - https://arxiv.org/abs/2010.14819
  281. * EfficientViT (MIT) - https://arxiv.org/abs/2205.14756
  282. * EfficientViT (MSRA) - https://arxiv.org/abs/2305.07027
  283. * EVA - https://arxiv.org/abs/2211.07636
  284. * EVA-02 - https://arxiv.org/abs/2303.11331
  285. * FasterNet - https://arxiv.org/abs/2303.03667
  286. * FastViT - https://arxiv.org/abs/2303.14189
  287. * FlexiViT - https://arxiv.org/abs/2212.08013
  288. * FocalNet (Focal Modulation Networks) - https://arxiv.org/abs/2203.11926
  289. * GCViT (Global Context Vision Transformer) - https://arxiv.org/abs/2206.09959
  290. * GhostNet - https://arxiv.org/abs/1911.11907
  291. * GhostNet-V2 - https://arxiv.org/abs/2211.12905
  292. * GhostNet-V3 - https://arxiv.org/abs/2404.11202
  293. * gMLP - https://arxiv.org/abs/2105.08050
  294. * GPU-Efficient Networks - https://arxiv.org/abs/2006.14090
  295. * Halo Nets - https://arxiv.org/abs/2103.12731
  296. * HGNet / HGNet-V2 - TBD
  297. * HRNet - https://arxiv.org/abs/1908.07919
  298. * InceptionNeXt - https://arxiv.org/abs/2303.16900
  299. * Inception-V3 - https://arxiv.org/abs/1512.00567
  300. * Inception-ResNet-V2 and Inception-V4 - https://arxiv.org/abs/1602.07261
  301. * Lambda Networks - https://arxiv.org/abs/2102.08602
  302. * LeViT (Vision Transformer in ConvNet's Clothing) - https://arxiv.org/abs/2104.01136
  303. * MambaOut - https://arxiv.org/abs/2405.07992
  304. * MaxViT (Multi-Axis Vision Transformer) - https://arxiv.org/abs/2204.01697
  305. * MetaFormer (PoolFormer-v2, ConvFormer, CAFormer) - https://arxiv.org/abs/2210.13452
  306. * MLP-Mixer - https://arxiv.org/abs/2105.01601
  307. * MobileCLIP - https://arxiv.org/abs/2311.17049
  308. * MobileNet-V3 (MBConvNet w/ Efficient Head) - https://arxiv.org/abs/1905.02244
  309. * FBNet-V3 - https://arxiv.org/abs/2006.02049
  310. * HardCoRe-NAS - https://arxiv.org/abs/2102.11646
  311. * LCNet - https://arxiv.org/abs/2109.15099
  312. * MobileNetV4 - https://arxiv.org/abs/2404.10518
  313. * MobileOne - https://arxiv.org/abs/2206.04040
  314. * MobileViT - https://arxiv.org/abs/2110.02178
  315. * MobileViT-V2 - https://arxiv.org/abs/2206.02680
  316. * MViT-V2 (Improved Multiscale Vision Transformer) - https://arxiv.org/abs/2112.01526
  317. * NASNet-A - https://arxiv.org/abs/1707.07012
  318. * NesT - https://arxiv.org/abs/2105.12723
  319. * Next-ViT - https://arxiv.org/abs/2207.05501
  320. * NFNet-F - https://arxiv.org/abs/2102.06171
  321. * NF-RegNet / NF-ResNet - https://arxiv.org/abs/2101.08692
  322. * PE (Perception Encoder) - https://arxiv.org/abs/2504.13181
  323. * PNasNet - https://arxiv.org/abs/1712.00559
  324. * PoolFormer (MetaFormer) - https://arxiv.org/abs/2111.11418
  325. * Pooling-based Vision Transformer (PiT) - https://arxiv.org/abs/2103.16302
  326. * PVT-V2 (Improved Pyramid Vision Transformer) - https://arxiv.org/abs/2106.13797
  327. * RDNet (DenseNets Reloaded) - https://arxiv.org/abs/2403.19588
  328. * RegNet - https://arxiv.org/abs/2003.13678
  329. * RegNetZ - https://arxiv.org/abs/2103.06877
  330. * RepVGG - https://arxiv.org/abs/2101.03697
  331. * RepGhostNet - https://arxiv.org/abs/2211.06088
  332. * RepViT - https://arxiv.org/abs/2307.09283
  333. * ResMLP - https://arxiv.org/abs/2105.03404
  334. * ResNet/ResNeXt
  335. * ResNet (v1b/v1.5) - https://arxiv.org/abs/1512.03385
  336. * ResNeXt - https://arxiv.org/abs/1611.05431
  337. * 'Bag of Tricks' / Gluon C, D, E, S variations - https://arxiv.org/abs/1812.01187
  338. * Weakly-supervised (WSL) Instagram pretrained / ImageNet tuned ResNeXt101 - https://arxiv.org/abs/1805.00932
  339. * Semi-supervised (SSL) / Semi-weakly Supervised (SWSL) ResNet/ResNeXts - https://arxiv.org/abs/1905.00546
  340. * ECA-Net (ECAResNet) - https://arxiv.org/abs/1910.03151v4
  341. * Squeeze-and-Excitation Networks (SEResNet) - https://arxiv.org/abs/1709.01507
  342. * ResNet-RS - https://arxiv.org/abs/2103.07579
  343. * Res2Net - https://arxiv.org/abs/1904.01169
  344. * ResNeSt - https://arxiv.org/abs/2004.08955
  345. * ReXNet - https://arxiv.org/abs/2007.00992
  346. * ROPE-ViT - https://arxiv.org/abs/2403.13298
  347. * SelecSLS - https://arxiv.org/abs/1907.00837
  348. * Selective Kernel Networks - https://arxiv.org/abs/1903.06586
  349. * Sequencer2D - https://arxiv.org/abs/2205.01972
  350. * SHViT - https://arxiv.org/abs/2401.16456
  351. * SigLIP (image encoder) - https://arxiv.org/abs/2303.15343
  352. * SigLIP 2 (image encoder) - https://arxiv.org/abs/2502.14786
  353. * StarNet - https://arxiv.org/abs/2403.19967
  354. * SwiftFormer - https://arxiv.org/pdf/2303.15446
  355. * Swin S3 (AutoFormerV2) - https://arxiv.org/abs/2111.14725
  356. * Swin Transformer - https://arxiv.org/abs/2103.14030
  357. * Swin Transformer V2 - https://arxiv.org/abs/2111.09883
  358. * TinyViT - https://arxiv.org/abs/2207.10666
  359. * Transformer-iN-Transformer (TNT) - https://arxiv.org/abs/2103.00112
  360. * TResNet - https://arxiv.org/abs/2003.13630
  361. * Twins (Spatial Attention in Vision Transformers) - https://arxiv.org/pdf/2104.13840.pdf
  362. * VGG - https://arxiv.org/abs/1409.1556
  363. * Visformer - https://arxiv.org/abs/2104.12533
  364. * Vision Transformer - https://arxiv.org/abs/2010.11929
  365. * ViTamin - https://arxiv.org/abs/2404.02132
  366. * VOLO (Vision Outlooker) - https://arxiv.org/abs/2106.13112
  367. * VovNet V2 and V1 - https://arxiv.org/abs/1911.06667
  368. * Xception - https://arxiv.org/abs/1610.02357
  369. * Xception (Modified Aligned, Gluon) - https://arxiv.org/abs/1802.02611
  370. * Xception (Modified Aligned, TF) - https://arxiv.org/abs/1802.02611
  371. * XCiT (Cross-Covariance Image Transformers) - https://arxiv.org/abs/2106.09681
  372. ### Optimizers
  373. To see full list of optimizers w/ descriptions: `timm.optim.list_optimizers(with_description=True)`
  374. Included optimizers available via `timm.optim.create_optimizer_v2` factory method:
  375. * `adabelief` an implementation of AdaBelief adapted from https://github.com/juntang-zhuang/Adabelief-Optimizer - https://arxiv.org/abs/2010.07468
  376. * `adafactor` adapted from [FAIRSeq impl](https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py) - https://arxiv.org/abs/1804.04235
  377. * `adafactorbv` adapted from [Big Vision](https://github.com/google-research/big_vision/blob/main/big_vision/optax.py) - https://arxiv.org/abs/2106.04560
  378. * `adahessian` by [David Samuel](https://github.com/davda54/ada-hessian) - https://arxiv.org/abs/2006.00719
  379. * `adamp` and `sgdp` by [Naver ClovAI](https://github.com/clovaai) - https://arxiv.org/abs/2006.08217
  380. * `adamuon` and `nadamuon` as per https://github.com/Chongjie-Si/AdaMuon - https://arxiv.org/abs/2507.11005
  381. * `adan` an implementation of Adan adapted from https://github.com/sail-sg/Adan - https://arxiv.org/abs/2208.06677
  382. * `adopt` ADOPT adapted from https://github.com/iShohei220/adopt - https://arxiv.org/abs/2411.02853
  383. * `kron` PSGD w/ Kronecker-factored preconditioner from https://github.com/evanatyourservice/kron_torch - https://sites.google.com/site/lixilinx/home/psgd
  384. * `lamb` an implementation of Lamb and LambC (w/ trust-clipping) cleaned up and modified to support use with XLA - https://arxiv.org/abs/1904.00962
  385. * `laprop` optimizer from https://github.com/Z-T-WANG/LaProp-Optimizer - https://arxiv.org/abs/2002.04839
  386. * `lars` an implementation of LARS and LARC (w/ trust-clipping) - https://arxiv.org/abs/1708.03888
  387. * `lion` and implementation of Lion adapted from https://github.com/google/automl/tree/master/lion - https://arxiv.org/abs/2302.06675
  388. * `lookahead` adapted from impl by [Liam](https://github.com/alphadl/lookahead.pytorch) - https://arxiv.org/abs/1907.08610
  389. * `madgrad` an implementation of MADGRAD adapted from https://github.com/facebookresearch/madgrad - https://arxiv.org/abs/2101.11075
  390. * `mars` MARS optimizer from https://github.com/AGI-Arena/MARS - https://arxiv.org/abs/2411.10438
  391. * `muon` MUON optimizer from https://github.com/KellerJordan/Muon with numerous additions and improved non-transformer behaviour
  392. * `nadam` an implementation of Adam w/ Nesterov momentum
  393. * `nadamw` an implementation of AdamW (Adam w/ decoupled weight-decay) w/ Nesterov momentum. A simplified impl based on https://github.com/mlcommons/algorithmic-efficiency
  394. * `novograd` by [Masashi Kimura](https://github.com/convergence-lab/novograd) - https://arxiv.org/abs/1905.11286
  395. * `radam` by [Liyuan Liu](https://github.com/LiyuanLucasLiu/RAdam) - https://arxiv.org/abs/1908.03265
  396. * `rmsprop_tf` adapted from PyTorch RMSProp by myself. Reproduces much improved Tensorflow RMSProp behaviour
  397. * `sgdw` and implementation of SGD w/ decoupled weight-decay
  398. * `fused<name>` optimizers by name with [NVIDIA Apex](https://github.com/NVIDIA/apex/tree/master/apex/optimizers) installed
  399. * `bnb<name>` optimizers by name with [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) installed
  400. * `cadamw`, `clion`, and more 'Cautious' optimizers from https://github.com/kyleliang919/C-Optim - https://arxiv.org/abs/2411.16085
  401. * `adam`, `adamw`, `rmsprop`, `adadelta`, `adagrad`, and `sgd` pass through to `torch.optim` implementations
  402. * `c` suffix (eg `adamc`, `nadamc` to implement 'corrected weight decay' in https://arxiv.org/abs/2506.02285)
  403. ### Augmentations
  404. * Random Erasing from [Zhun Zhong](https://github.com/zhunzhong07/Random-Erasing/blob/master/transforms.py) - https://arxiv.org/abs/1708.04896)
  405. * Mixup - https://arxiv.org/abs/1710.09412
  406. * CutMix - https://arxiv.org/abs/1905.04899
  407. * AutoAugment (https://arxiv.org/abs/1805.09501) and RandAugment (https://arxiv.org/abs/1909.13719) ImageNet configurations modeled after impl for EfficientNet training (https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py)
  408. * AugMix w/ JSD loss, JSD w/ clean + augmented mixing support works with AutoAugment and RandAugment as well - https://arxiv.org/abs/1912.02781
  409. * SplitBachNorm - allows splitting batch norm layers between clean and augmented (auxiliary batch norm) data
  410. ### Regularization
  411. * DropPath aka "Stochastic Depth" - https://arxiv.org/abs/1603.09382
  412. * DropBlock - https://arxiv.org/abs/1810.12890
  413. * Blur Pooling - https://arxiv.org/abs/1904.11486
  414. ### Other
  415. Several (less common) features that I often utilize in my projects are included. Many of their additions are the reason why I maintain my own set of models, instead of using others' via PIP:
  416. * All models have a common default configuration interface and API for
  417. * accessing/changing the classifier - `get_classifier` and `reset_classifier`
  418. * doing a forward pass on just the features - `forward_features` (see [documentation](https://huggingface.co/docs/timm/feature_extraction))
  419. * these makes it easy to write consistent network wrappers that work with any of the models
  420. * All models support multi-scale feature map extraction (feature pyramids) via create_model (see [documentation](https://huggingface.co/docs/timm/feature_extraction))
  421. * `create_model(name, features_only=True, out_indices=..., output_stride=...)`
  422. * `out_indices` creation arg specifies which feature maps to return, these indices are 0 based and generally correspond to the `C(i + 1)` feature level.
  423. * `output_stride` creation arg controls output stride of the network by using dilated convolutions. Most networks are stride 32 by default. Not all networks support this.
  424. * feature map channel counts, reduction level (stride) can be queried AFTER model creation via the `.feature_info` member
  425. * All models have a consistent pretrained weight loader that adapts last linear if necessary, and from 3 to 1 channel input if desired
  426. * High performance [reference training, validation, and inference scripts](https://huggingface.co/docs/timm/training_script) that work in several process/GPU modes:
  427. * NVIDIA DDP w/ a single GPU per process, multiple processes with APEX present (AMP mixed-precision optional)
  428. * PyTorch DistributedDataParallel w/ multi-gpu, single process (AMP disabled as it crashes when enabled)
  429. * PyTorch w/ single GPU single process (AMP optional)
  430. * A dynamic global pool implementation that allows selecting from average pooling, max pooling, average + max, or concat([average, max]) at model creation. All global pooling is adaptive average by default and compatible with pretrained weights.
  431. * A 'Test Time Pool' wrapper that can wrap any of the included models and usually provides improved performance doing inference with input images larger than the training size. Idea adapted from original DPN implementation when I ported (https://github.com/cypw/DPNs)
  432. * Learning rate schedulers
  433. * Ideas adopted from
  434. * [AllenNLP schedulers](https://github.com/allenai/allennlp/tree/master/allennlp/training/learning_rate_schedulers)
  435. * [FAIRseq lr_scheduler](https://github.com/pytorch/fairseq/tree/master/fairseq/optim/lr_scheduler)
  436. * SGDR: Stochastic Gradient Descent with Warm Restarts (https://arxiv.org/abs/1608.03983)
  437. * Schedulers include `step`, `cosine` w/ restarts, `tanh` w/ restarts, `plateau`
  438. * Space-to-Depth by [mrT23](https://github.com/mrT23/TResNet/blob/master/src/models/tresnet/layers/space_to_depth.py) (https://arxiv.org/abs/1801.04590)
  439. * Adaptive Gradient Clipping (https://arxiv.org/abs/2102.06171, https://github.com/deepmind/deepmind-research/tree/master/nfnets)
  440. * An extensive selection of channel and/or spatial attention modules:
  441. * Bottleneck Transformer - https://arxiv.org/abs/2101.11605
  442. * CBAM - https://arxiv.org/abs/1807.06521
  443. * Effective Squeeze-Excitation (ESE) - https://arxiv.org/abs/1911.06667
  444. * Efficient Channel Attention (ECA) - https://arxiv.org/abs/1910.03151
  445. * Gather-Excite (GE) - https://arxiv.org/abs/1810.12348
  446. * Global Context (GC) - https://arxiv.org/abs/1904.11492
  447. * Halo - https://arxiv.org/abs/2103.12731
  448. * Involution - https://arxiv.org/abs/2103.06255
  449. * Lambda Layer - https://arxiv.org/abs/2102.08602
  450. * Non-Local (NL) - https://arxiv.org/abs/1711.07971
  451. * Squeeze-and-Excitation (SE) - https://arxiv.org/abs/1709.01507
  452. * Selective Kernel (SK) - (https://arxiv.org/abs/1903.06586
  453. * Split (SPLAT) - https://arxiv.org/abs/2004.08955
  454. * Shifted Window (SWIN) - https://arxiv.org/abs/2103.14030
  455. ## Results
  456. Model validation results can be found in the [results tables](results/README.md)
  457. ## Getting Started (Documentation)
  458. The official documentation can be found at https://huggingface.co/docs/hub/timm. Documentation contributions are welcome.
  459. [Getting Started with PyTorch Image Models (timm): A Practitioner’s Guide](https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055-2/) by [Chris Hughes](https://github.com/Chris-hughes10) is an extensive blog post covering many aspects of `timm` in detail.
  460. [timmdocs](http://timm.fast.ai/) is an alternate set of documentation for `timm`. A big thanks to [Aman Arora](https://github.com/amaarora) for his efforts creating timmdocs.
  461. [paperswithcode](https://paperswithcode.com/lib/timm) is a good resource for browsing the models within `timm`.
  462. ## Train, Validation, Inference Scripts
  463. The root folder of the repository contains reference train, validation, and inference scripts that work with the included models and other features of this repository. They are adaptable for other datasets and use cases with a little hacking. See [documentation](https://huggingface.co/docs/timm/training_script).
  464. ## Awesome PyTorch Resources
  465. One of the greatest assets of PyTorch is the community and their contributions. A few of my favourite resources that pair well with the models and components here are listed below.
  466. ### Object Detection, Instance and Semantic Segmentation
  467. * Detectron2 - https://github.com/facebookresearch/detectron2
  468. * Segmentation Models (Semantic) - https://github.com/qubvel/segmentation_models.pytorch
  469. * EfficientDet (Obj Det, Semantic soon) - https://github.com/rwightman/efficientdet-pytorch
  470. ### Computer Vision / Image Augmentation
  471. * Albumentations - https://github.com/albumentations-team/albumentations
  472. * Kornia - https://github.com/kornia/kornia
  473. ### Knowledge Distillation
  474. * RepDistiller - https://github.com/HobbitLong/RepDistiller
  475. * torchdistill - https://github.com/yoshitomo-matsubara/torchdistill
  476. ### Metric Learning
  477. * PyTorch Metric Learning - https://github.com/KevinMusgrave/pytorch-metric-learning
  478. ### Training / Frameworks
  479. * fastai - https://github.com/fastai/fastai
  480. * lightly_train - https://github.com/lightly-ai/lightly-train
  481. ### Deployment
  482. * timmx (Export timm models to ONNX, CoreML, LiteRT, TensorRT, and more) - https://github.com/Boulaouaney/timmx
  483. ## Licenses
  484. ### Code
  485. The code here is licensed Apache 2.0. I've taken care to make sure any third party code included or adapted has compatible (permissive) licenses such as MIT, BSD, etc. I've made an effort to avoid any GPL / LGPL conflicts. That said, it is your responsibility to ensure you comply with licenses here and conditions of any dependent licenses. Where applicable, I've linked the sources/references for various components in docstrings. If you think I've missed anything please create an issue.
  486. ### Pretrained Weights
  487. So far all of the pretrained weights available here are pretrained on ImageNet with a select few that have some additional pretraining (see extra note below). ImageNet was released for non-commercial research purposes only (https://image-net.org/download). It's not clear what the implications of that are for the use of pretrained weights from that dataset. Any models I have trained with ImageNet are done for research purposes and one should assume that the original dataset license applies to the weights. It's best to seek legal advice if you intend to use the pretrained weights in a commercial product.
  488. #### Pretrained on more than ImageNet
  489. Several weights included or references here were pretrained with proprietary datasets that I do not have access to. These include the Facebook WSL, SSL, SWSL ResNe(Xt) and the Google Noisy Student EfficientNet models. The Facebook models have an explicit non-commercial license (CC-BY-NC 4.0, https://github.com/facebookresearch/semi-supervised-ImageNet1K-models, https://github.com/facebookresearch/WSL-Images). The Google models do not appear to have any restriction beyond the Apache 2.0 license (and ImageNet concerns). In either case, you should contact Facebook or Google with any questions.
  490. ## Citing
  491. ### BibTeX
  492. ```bibtex
  493. @misc{rw2019timm,
  494. author = {Ross Wightman},
  495. title = {PyTorch Image Models},
  496. year = {2019},
  497. publisher = {GitHub},
  498. journal = {GitHub repository},
  499. doi = {10.5281/zenodo.4414861},
  500. howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
  501. }
  502. ```
  503. ### Latest DOI
  504. [![DOI](https://zenodo.org/badge/168799526.svg)](https://zenodo.org/badge/latestdoi/168799526)