初步应用之Short-video Face Parsing我们将Shuffle Transformer应用到短视频人脸分割竞赛(The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021 Short-video Face Parsing Track)中取得了第一名的成绩。在该项竞赛中我们使用Shuffle Transformer作为encoder,配合改进的Alignseg[8][code]的decoder结构获得了非常不错的分割结果。同时我们对比了当前被广泛应用的HRNet[9],实验结果表明我们所提出的网络结构取得了更高的结果,特别是在眼影、其他皮肤区域等难以清晰区域边界的标签上获得了更好的效果,具体示例如下所示。关于人脸分割模型更加详细信息请前往 Arxiv(https://arxiv.org/abs/2106.08650) 。后续我们将Shuffle Transformer应用到更多的下游视觉任务中。
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. NIPS, 2017.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[J]. arXiv preprint arXiv:2103.14030, 2021.Vaswani A, Ramachandran P, Srinivas A, et al. Scaling local self-attention for parameter efficient visual backbones. CVPR 2021.Han Q, Fan Z, Dai Q, et al. Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight[J]. arXiv preprint arXiv:2106.04263, 2021.Zhang X, Zhou X, Lin M, et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices[C]. CVPR, 2018: 6848-6856.Ting Z, Guo-Jun Q, Bin X, et al. Interleaved group convolutions for deep neural networks[C]. ICCV, 2017.Huang Z, Wei Y, Wang X, et al. Alignseg: Feature-aligned segmentation networks[J]. T-PAMI, 2021.Wang J, Sun K, Cheng T, et al. Deep high-resolution representation learning for visual recognition[J]. T-PAMI, 2020.Huang Z, Ben Y, Luo G, et al. Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer[J]. arXiv preprint arXiv:2106.03650, 2021.