CaptionsMaker
.com
How transformer took over computer vision? CNN's struggle with long range dependency
Edit Subtitles
Download Subtitles
SRT
TXT
Title:
Description:
Why do we need transformers for vision? To answer this, we first revisit Convolutional Neural Networks (CNNs) – the models that powered computer vision breakthroughs for almost a decade. CNNs have been the backbone of image classification, segmentation, and detection tasks, driving successes in models like AlexNet, VGG, ResNet, and beyond. 📌 In this lecture you will learn: - How CNNs work using convolution operations, filters, and feature maps. - Why convolutions are so powerful for extracting local patterns in images. - The intuition behind kernels, stride, and receptive fields. - The limitations of CNNs – difficulty in modeling global context, reliance on local patterns, and inefficiency when scaling to larger images. - Why these shortcomings created the need for a new architecture. We then discuss the motivation for transformers in vision. Unlike CNNs, transformers can capture long-range dependencies and global context more effectively, making them a natural fit for tasks where relationships across the entire image matter. From there, we introduce the Vision Transformer (ViT) – the groundbreaking architecture from 2020 that reshaped how we think about computer vision. You will see how images can be split into patches, treated like tokens, and processed with self-attention, just like text sequences in large language models. 📖 We also look at the original ViT paper, its reception in the research community, and the massive impact it had with 65k+ citations. By the end of this lecture, you will have: - A clear understanding of how CNNs operate. - A structured view of their strengths and weaknesses. - A solid motivation for why transformers are now essential for vision. - A first look at the Vision Transformer architecture that will be explored in detail in upcoming lectures. This lecture sets the stage for the entire bootcamp – where we will move step by step from CNNs, to transformers, to advanced architectures for detection, segmentation, video understanding, multimodal learning, and generative vision models. 🔥 Two Versions of the Bootcamp Free Version (YouTube Playlist) – You can follow along with every lecture, uploaded sequentially, in a dedicated playlist. Pro Version (https://vision-transformer.vizuara.ai) – Includes everything from the free version plus: - Detailed handwritten notes (Miro) - Private GitHub repository with all code - Private Discord community for collaboration and doubt clearance - PDF e-book on Transformers for Vision & Multimodal LLMs - Hands-on assignments with grading - Official course certificate - Email support from Team Vizuara 👉 Join the Pro Bootcamp here: http://vision-transformer.vizuara.ai/
YouTube url:
https://youtu.be/P1pqJ3NlTdU?si=kI_wpXbvVKCsqHLI
Created:
2. 10. 2025 05:31:56