Not logged in.

Quick Search - Contribution

Contribution Details

Type	Master's Thesis
Scope	Discipline-based scholarship
Title	Improving Vision Transformers by Incorporating Spatial Priors and Sparse Computation
Organization Unit	Robotics and Perception Group (Davide Scaramuzza)
Authors	Yifei Liu
Supervisors	Mathias Gehrig Nico Messikommer Davide Scaramuzza
Language	English
Institution	University of Zurich
Faculty	Faculty of Business, Economics and Informatics
Date	2023
Abstract Text	Vision Transformers (ViTs) are powerful deep learning models and have recently made impressive strides in the computer vision field. However, vision transformers are not data efficient, and their high computational cost, quadratic in the number of tokens, currently limits their adoption in power- and computation-constrained applications. To improve the data and inference efficiency of ViTs, we explore two different paths. First, we notice that the tokens in ViTs do not take any inductive bias. We extract more fine-grained tokens (dubbed subtokens) from each token by expanding its channel dimension to spatial dimensions, and introduce convolutions or shifting on the subtokens to insert intra-token spatial priors. The subtoken convolution improves the classification accuracy for ViTs training from scratch by 2.21% on small datasets (Cifar100) and 1.14% on larger datasets (ImageNet-1K), and also shows faster convergence speed. Secondly, recent studies have shown that not all tokens are helpful for the final task, and ViTs can be made more efficient by pruning redundant tokens. However, active research is mostly focusing on high-level tasks like image classification. To extend the token pruning methods to more complex downstream tasks, we revisit the designs of token pruning and find three key components that lead to better performance: (1) the token selection should not be based on the class token, (2) a dynamic pruning rate is better than a static pruning rate, (3) preserving the feature map of all tokens is better than dropping tokens for all later layers. To this end, we propose SViT, a simple yet effective dynamic token selection scheme that selects and processes highly informative tokens while preserving a structured feature map, thus maintaining compatibility with downstream tasks. On the image classification task (ImageNet-1K), we improve the throughput of DeiT-S by 49% with only 0.4% accuracy drop. On object detection and instance segmentation tasks(COCO), we improve the inference speed by 32.5% with -0.3 box AP and no drop in mask AP.
PDF File	Download
Export	BibTeX