模型效果CLIP4Clip 从模型上来说并没有特别大的创新,但是作者通过大量的实验来实现并验证基于ptrtrained好的图文CLIP 模型,通过迁移学习,或者说finetune来完成视频检索的任务,取得了sota的结果,论文中有大量的对比实验,大家可以在论文中详细查看,这里主要介绍下几点结论:1) One single image is far from enough for video encoding for video-text retrieval.2) Post-pretraining at a large-scale video-text dataset on the CLIP4Clip model is required and can improve the performance, especially for zeroshot prediction by a large margin.3) With the powerful pre-trained CLIP, it is better not to introduce new parameters and adopt a mean pooling mechanism on video frames for small datasets. At the same time, it is better to introduce more parameters, e.g., the self-attention layer, to learn the temporal dependency for large datasets.4) We carefully study the hyper-parameters and report the best setting.