其他
CSIG-广东省CVPR 2022论文分享会
会议背景
国际计算机视觉与模式识别会议(IEEE/CVF International Conference on Computer Vision and Pattern Recognition,CVPR)是IEEE一年一度的学术权威会议,是世界顶级计算机视觉会议之一。CVPR会议的主要内容涵盖计算机视觉、模式识别、图像处理、人工智能等各方面前沿理论与技术。在中国计算机学会推荐国际学术会议名单中,CVPR为人工智能领域的A类会议。在Google学术指标(Google Scholar Metrics)榜单中, CVPR排名全球学术出版物第4(仅次于Nature、Science、The New England Journal of Medicine),在计算机科学及工程、计算机视觉及模式识别子学科类别出版物中,CVPR排名第一。
为了给本领域研究者、技术开发人员和研究生介绍计算机视觉部分前沿理论方法和最新进展,我们邀请了18位广东省在此领域部分优秀团队的青年学子,介绍他们今年被CVPR 2022录用论文的研究成果。广东省CVPR 2022论文分享学术报告会定于2022年5月7日(星期六)在线举办。
会议信息
中山大学计算机学院三、直播支持:极市平台、蔻享学术四、会议时间与观看方式:时间:2022年5月7日 星期六
直播链接:
https://live.bilibili.com/3344545https://m.koushare.com/lives/room/333717直播二维码:
会议日程
(长按可保存图片)
报告题目、讲者和论文摘要
recognition. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter. Using a transformer encoder with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism to explicitly guide text localization through recognition loss. The straightforward design results in a concise framework that requires neither additional rectification module nor character-level annotation for the arbitrarily-shaped text. Qualitative and quantitative experiments on multi-oriented datasets RoIC13 and ICDAR 2015, arbitrarily-shaped datasets Total-Text and CTW1500, and multi-lingual datasets ReCTS (Chinese) and VinText(Vietnamese) demonstrate SwinTextSpotter significantly outperforms existing methods. Code is available at https://github.com/mxin262/SwinTextSpotter. 报告3. Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism报告人:杨斌斌摘要:Continual learning is a challenging real-world problem for constructing a mature AI system when data are provided in a streaming fashion. Despite recent progress in continual classification, the researches of continual object detection are impeded by the diverse sizes and numbers of objects in each image. Different from previous works that tune the whole network for all tasks, in this work, we present a simple and flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTing mechAnism (ROSETTA). Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks. In this way, various knowledge can be successively memorized by storing their corresponding sub-model weights in this system. To make ROSETTA automatically determine which experience is available and useful, a prototypical task correlation guided Gating Diversity Controller (GDC) is introduced to adaptively adjust the diversity of gates for the new task based on class-specific prototypes. GDC module computes class-to-class correlation matrix to depict the cross-task correlation, and hereby activates more exclusive gates for the new task if a significant domain gap is observed. Comprehensive experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance on both task-based and class-based continual object detection.
报告8. C2AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation报告人:谢金衡摘要:While class activation map (CAM) generated by image classification network has been widely used for weakly supervised object localization (WSOL) and semantic segmentation (WSSS), such classifiers usually focus on discriminative object regions. In this paper, we propose Contrastive learning for Class-agnostic Activation Map (C$^2$AM) generation only using unlabeled image data, without the involvement of image-level supervision. The core idea comes from the observation that i) semantic information of foreground objects usually differs from their backgrounds; ii) foreground objects with similar appearance or background with similar color/texture have similar representations in the feature space. We form the positive and negative pairs based on the above relations and force the network to disentangle foreground and background with a class-agnostic activation map using a novel contrastive loss. As the network is guided to discriminate cross-image foreground-background, the class-agnostic activation maps learned by our approach generate more complete object regions. We successfully extracted from C$^2$AM class-agnostic object bounding boxes for object localization and background cues to refine CAM generated by classification network for semantic segmentation. Extensive experiments on CUB-200-2011, ImageNet-1K, and PASCAL VOC2012 datasets show that both WSOL and WSSS can benefit from the proposed C$^2$AM. Code will be available. 报告9. Learning to Imagine: Diversify Memory for Incremental Learning using Unlabeled Data报告人:汤裕明摘要:Deep neural network (DNN) suffers from catastrophic forgetting when learning incrementally, which greatly limits its applications. Although maintaining a handful of samples (called “exemplars”) of each task could alleviate forgetting to some extent, existing methods are still limited by the small number of exemplars since these exemplars are too few to carry enough task-specific knowledge, and therefore the forgetting remains. To overcome this problem, we propose to “imagine” diverse counterparts of given exemplars referring to the abundant semantic-irrelevant information from unlabeled data. Specifically, we develop a learnable feature generator to diversify exemplars by adaptively generating diverse counterparts of exemplars based on semantic information from exemplars and semantically-irrelevant information from unlabeled data. We introduce semantic contrastive learning to enforce the generated samples to be semantic consistent with exemplars and perform semantic-decoupling contrastive learning to encourage diversity of generated samples. The diverse generated samples could effectively prevent DNN from forgetting when learning new tasks. Our method does not bring any extra inference cost and outperforms state-of-the-art methods on two benchmarks CIFAR-100 and ImageNet-Subset by a clear margin. 报告10. Look Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator报告人:孔宇昕摘要:Automatic font generation remains a challenging research issue due to the large amounts of characters with complicated structures. Typically, only a few samples can serve as the style/content reference (termed few-shot learning), which further increases the difficulty to preserve local style patterns or detailed glyph structures. We investigate the drawbacks of previous studies and find that a coarse-grained discriminator is insufficient for supervising a font generator. To this end, we propose a novel Component-Aware Module (CAM), which supervises the generator to decouple content and style at a more fine-grained level, i.e., the component level. Different from previous studies struggling to increase the complexity of generators, we aim to perform more effective supervision for a relatively simple generator to achieve its full potential, which is a brand new perspective for font generation. The whole framework achieves remarkable results by coupling component-level supervision with adversarial learning, hence we call it Component-Guided GAN, shortly CG-GAN. Extensive experiments show that our approach outperforms state-of-the-art one-shot font generation methods. Furthermore, it can be applied to handwritten word synthesis and scene text image editing, suggesting the generalization of our approach. 报告11. Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition报告人:韩鸣飞摘要:Learning spatial-temporal relation among multiple actors is crucial for group activity recognition. Different group activities often show the diversified interactions between actors in the video. Hence, it is often difficult to model complex group activities from a single view of spatial-temporal actor evolution. To tackle this problem, we propose a distinct Dual-path Actor Interaction (Dual-AI) framework, which flexibly arranges spatial and temporal transformers in two complementary orders, enhancing actor relations by integrating merits from different spatio-temporal paths. Moreover, we introduce a novel Multi-scale Actor Contrastive Loss (MAC-Loss) between two interactive paths of Dual-AI. Via self-supervised actor consistency in both frame and video levels, MAC-Loss can effectively distinguish individual actor representations to reduce action confusion among different actors. Consequently, our Dual-AI can boost group activity recognition by fusing such discriminative features of different actors. To evaluate the proposed approach, we conduct extensive experiments on the widely used benchmarks, including Volleyball, Collective Activity, and NBA datasets. The proposed Dual-AI achieves state-of-the-art performance on all these datasets. It is worth noting the proposed Dual-AI with 50% training data outperforms a number of recent approaches with 100% training data. This confirms the generalization power of Dual-AI for group activity recognition, even under the challenging scenarios of limited supervision. 报告12. M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining报告人:董晓摘要:Despite the potential of multi-modal pre-training to learn highly discriminative feature representations from complementary data modalities, current progress is being slowed by the lack of large-scale modality-diverse datasets. By leveraging the natural suitability of E-commerce, where different modalities capture complementary semantic information, we contribute a large-scale multi-modal pre-training dataset M5Product. The dataset comprises 5 modalities (image, text, table, video, and audio), covers over 6,000 categories and 5,000 attributes, and is 500 times larger than the largest publicly available dataset with a similar number of modalities. Furthermore, M5Product contains incomplete modality pairs and noise while also having a long-tailed distribution, resembling most real-world problems. We further propose Self-harmonized ContrAstive LEarning (SCALE), a novel pretraining framework that integrates the different modalities into a unified model through an adaptive feature fusion mechanism, where the importance of each modality is learned directly from the modality embeddings and impacts the inter-modality contrastive learning and masked tasks within a multi-modal transformer model. We evaluate the current multi-modal pre-training state-of-the-art approaches and benchmark their ability to learn from unlabeled data when faced with the large number of modalities in the M5Product dataset. We conduct extensive experiments on four downstream tasks and demonstrate the superiority of our SCALE model, providing insights into the importance of dataset scale and diversity. 报告13. CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation报告人:谢金衡摘要:It has been widely known that CAM (Class Activation Map) usually only activates discriminative object regions and falsely includes lots of object-related backgrounds. As only a fixed set of image-level object labels are available to the WSSS (weakly supervised semantic segmentation) model, it could be very difficult to suppress those diverse background regions consisting of open set objects. In this paper, we propose a novel Cross Language Image Matching (CLIMS) framework, based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress closely-related open background regions. In particular, we design object, background region and text label matching losses to guide the model to excite more reasonable object regions for CAM of each category. In addition, we design a co-occurring background suppression loss to prevent the model from activating closely-related background regions, with a predefined set of class-related background text descriptions. These designs enable the proposed CLIMS to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC2012 dataset show that our CLIMS significantly outperforms the previous state-of-the-art methods. 报告14. HL-Net: Heterophily Learning Network for Scene Graph Generation报告人:林鑫摘要:Scene graph generation (SGG) aims to detect objects and predict their pairwise relationships within an image. Current SGG methods typically utilize graph neural networks (GNNs) to acquire context information between objects/relationships. Despite their effectiveness, however, current SGG methods only assume scene graph homophily while ignoring heterophily. Accordingly, in this paper, we propose a novel Heterophily Learning Network (HL-Net) to comprehensively explore the homophily and heterophily between objects/relationships in scene graphs. More specifically, HL-Net comprises the following 1) an adaptive reweighting transformer module, which adaptively integrates the information from different layers to exploit both the heterophily and homophily in objects; 2) a relationship feature propagation module that efficiently explores the connections between relationships by considering heterophily in order to refine the relationship representation; 3) a heterophily-aware message-passing scheme to further distinguish the heterophily and homophily between objects/relationships, thereby facilitating improved message passing in graphs. We conducted extensive experiments on two public datasets: Visual Genome (VG) and Open Images (OI). The experimental results demonstrate the superiority of our proposed HL-Net over existing state-of-the-art approaches. In more detail, HL-Net outperforms the second-best competitors by 2.1$\%$ on the VG dataset for scene graph classification and 1.2$\%$ on the IO dataset for the final score. 报告15. Deep Generalized Unfolding Networks for Image Restoration报告人:牟冲摘要:Deep neural networks (DNN) have achieved great success in image restoration. However, most DNN methods are designed as a black box, lacking transparency and interpretability. Although some methods are proposed to combine traditional optimization algorithms with DNN, they usually demand pre-defined degradation processes or handcrafted assumptions, making it difficult to deal with complex and real-world applications. In this paper, we propose a Deep Generalized Unfolding Network (DGUNet) for image restoration. Concretely, without loss of interpretability, we integrate a gradient estimation strategy into the gradient descent step of the Proximal Gradient Descent (PGD) algorithm, driving it to deal with complex and real-world image degradation. In addition, we design inter-stage information pathways across proximal mapping in different PGD iterations to rectify the intrinsic information loss in most deep unfolding networks (DUN) through a multi-scale and spatial-adaptive way. By integrating the flexible gradient descent and informative proximal mapping, we unfold the iterative PGD algorithm into a trainable DNN. Extensive experiments on various image restoration tasks demonstrate the superiority of our method in terms of state-of-the-art performance, interpretability, and generalizability. 报告16. Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection报告人:屈贤摘要:Transformer-based methods have achieved great success in the field of human-object interaction (HOI) detection. However, these models tend to adopt semantically ambiguous
queries, which lowers the transformer’s representation learning power. Moreover, there are a very limited number of labeled human-object pairs for most images in existing datasets, which constrains the transformer’s set prediction power. To handle the first problem, we propose an efficient knowledge distillation model, named Distillation using Oracle Queries (DOQ), which shares parameters between teacher and student networks. The teacher network adopts oracle queries that are semantically clear and generates high-quality decoder embeddings. By mimicking both the attention maps and decoder embeddings of the teacher network, the representation learning power of the student network is significantly promoted. To address the second problem, we introduce an efficient data augmentation method, named Context-Consistent Stitching (CCS), which generates complicated images online. Each new image is obtained by stitching labeled human-object pairs cropped from multiple training images. By selecting source images with similar context, the new synthesized image is made visually realistic. Our methods significantly promote both the accuracy and training efficiency of transformer-based HOI detection models. Experimental results show that our proposed approach consistently outperforms state-of-theart methods on three benchmarks: HICO-DET, HOI-A, and V-COCO. Code is available at https://github.com/SherlockHolmes221/DOQ. 报告17. Shunted Self-Attention via Multi-Scale Token Aggregation报告人:任苏成摘要:Recent Vision Transformer~(ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to its competence in modeling long-range dependencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention~(SSA), that allows ViTs to model the attentions at hybrid scales per attention layer. The key idea of SSA is to inject heterogeneous receptive field sizes into tokens: before computing the self-attention matrix, it selectively merges tokens to represent larger object features while keeping certain tokens to preserve fine-grained features. This novel merging scheme enables the self-attention to learn relationships between objects with different sizes, and simultaneously reduces the token numbers and the computational cost. Extensive experiments across various tasks demonstrate the superiority of SSA. Specifically, the SSA-based transformer achieve 84.0\% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet with only half of the model size and computation cost, and surpasses Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar parameter and computation cost. Code has been released at https://github.com/OliverRensu/Shunted-Transformer. 报告18. Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations报告人:钟智炫摘要:This paper presents a Generative prior ReciprocAted Invertible rescaling Network (GRAIN) for generating faithful high-resolution (HR) images from low-resolution (LR) invertible images with an extreme upscaling factor (64×). Previous researches have leveraged the prior knowledge of a pretrained GAN model to generate high-quality upscaling results. However, they fail to produce pixel-accurate results due to the highly ambiguous extreme mapping process. We remedy this problem by introducing a reciprocated invertible image rescaling process, in which high-resolution information can be delicately embedded into an invertible low-resolution image and generative prior for a faithful HR reconstruction. In particular, the invertible LR features not only carry significant HR semantics, but also are trained to predict scale-specific latent codes, yielding a preferable utilization of generative features. On the other hand, the enhanced generative prior is re-injected to the rescaling process, compensating the lost details of the invertible rescaling. Our reciprocal mechanism perfectly integrates the advantages of invertible encoding and generative prior, leading to the first feasible extreme rescaling solution. Extensive experiments demonstrate superior performance against state-of-the-art upscaling methods.
1. 本次在线会议免费参加,不收取任何注册费,不用提前报名注册。2. 普通听众请通过直播网址参加会议。3. 特邀讲者及嘉宾、CSIG理事、GDSIG理事及CSIG文档图像分析识别专委会委员、CSIG机器视觉专委会委员可通过腾讯会议系统参加,会议ID另行通知。
往期精彩内容回顾
[CVPR 2022] 通过字符上下文解耦的开放集文本识别新方法(有源码)
[CVPR 2022]SimAN: 基于生成式模型的文本图像自监督表征学习
[AAAI 2022] 用于场景文本识别的视觉语义辅助文本推理(有源码)
[ACM MM 2021] JokerGAN: 低参数量的具有文本行感知的手写文本生成模型
[CVPR 2022] 基于场景文字知识挖掘的细粒度图像识别算法(有源码)
[CVPR 2022] 特征采样与分组:基于Transformer的场景文字检测方法
[AAAI 2022 Oral] ABM: 基于注意力聚合和双向交互学习的手写数学公式识别(有源码)
[CVPR 2022] SwinTextSpotter: 基于文本检测与识别更好协同的场景文本识别(有源码)
[ACM MM 2021]显示、阅读和推理:基于灵活上下文聚合器的表格结构识别
[ACM MM2021] 基于Transformer的文档图像几何矫正和光照恢复方法
欢迎加入中国图象图形学学会!(附入会攻略)
扫描二维码,关注我们: