How the Vision Transformer (ViT) works in 10 minutes: an image is worth 16×16 words
In this article you will learn how the vision transformer works for image classification problems. We distill all the important details you need to grasp along with reasons it can work very well given enough data for pretraining.