Official account , Find out CV The beauty of Technology
The details are as follows ：
Thesis link ：https://arxiv.org/abs/2205.13313
Project links ：https://github.com/guoshengcv/CACL
In this paper , The author proposes a new cross architecture contrastive learning for self supervised video representation learning （cross-architecture contrastive learning,CACL） frame .CACL By a 3D CNN And a video Transformer form , They are used in parallel to generate various alignments for comparative learning . This enables the model to represent Xi Qiang from these different but meaningful aspects .
Besides , The author introduces a time self - supervised learning module , The module can explicitly predict the editing distance between two video sequences in time order , This enables the model to learn rich temporal representations . The author's comments on the method in this paper UCF101 and HMDB51 The video retrieval and motion recognition tasks on the dataset are evaluated , The results show that this method has achieved excellent performance , Much more than Video MoCo and MoCo+BE And other state-of-the-art methods .
Video representation learning is a basic task of video understanding , Because it plays an important role in various tasks , For example, action recognition 、 Video Retrieval . Recent work has focused on improving the performance of deep neural networks by using supervised learning , This usually requires a large-scale video dataset with very expensive human annotations , Such as Sports1M、Dynamics、HACS and MultiSports. The huge annotation cost inevitably limits the potential of deep network in learning video representation . therefore , It is important to improve this task with unlabeled video that is easy to access on a large scale .
In recent years , Self supervised learning has made great progress in learning strong image representation . It has also been extended to the field of video , Contrastive learning has been widely used in the field of video . for example , In recent work , Introducing contrast learning to capture the differences between two video instances , This enables contrastive learning to learn the representation in each video instance . However , In these methods , Contrastive learning mainly focuses on learning the global space-time representation of video , It is difficult to capture meaningful time details , These details usually provide important clues for distinguishing different video instances . therefore , Different from learning image representation , Modeling time information is very important for video representation . In this work , A new self supervised video representation method is proposed , This method can simultaneously perform video level contrast learning and time modeling in a unique framework .
By exploring the sequence nature of video , You can create a monitoring signal for learning time information , So as to realize self supervised learning . Some recent methods follow this research route , An excuse task for self-monitoring time prediction is created （ pretext task）. In this work ,shuffle. This enables the model to clearly quantify the degree of time difference in editing distance , However, the existing self-monitoring methods are usually limited to estimating the approximate difference between two videos in the time domain . for example , Previous methods often created an excuse task to predict whether the speed or playback speed of two video sequences were the same , But it ignores the details of this time difference .
Although most self supervised contrastive learning methods use a variety of data enhancements to generate positive alignments , These data enhancements provide different views of the instance , But the author developed a new method , Be able to get stronger representation from different structures through comparative learning .3D CNN The series has achieved remarkable performance in various video tasks , Include C3D、R3D、R(2+1)D etc. . because CNN Inherent characteristics of , They can capture local correlations in the time domain . however CNN The effective receptive field may limit its ability to model long-term dependence .
On the other hand ,transformer The architecture can naturally capture such long-distance dependencies using a self - attention mechanism , Each of them token Can learn to pay attention to the whole sequence , Thus meaningful context information is encoded into the video representation . Besides , When training on large enough data ,CNN The inductive bias of may limit its performance , Due to the dynamic weighting of self attention , This restriction may not apply to Transformer Occur in the .
The author thinks that , Modeling local and global dependencies is crucial for video understanding ;CNN Inductive bias and Transformer The capacity of can compensate each other . In this work , The author proposes a new cross architecture contrastive learning for self supervised video representation learning （CACL） frame .CACL Can from 3D CNN And video Transformer Generate a variety of more meaningful comparisons to learn from . The author proved that the video Transformer Can be greatly enhanced by 3D CNN Generated video representation . It produces rich high-level contextual features , And encourage 3D CNN Capture more details . This allows the two structures to work together , This is the key to improving performance .
The main contributions of this paper are summarized as follows ：
The author designed a new cross architecture comparative learning （CACL） frame , For self supervised video representation learning .CACL Use 3DCNN and Transformer Collaborative generation of diverse but meaningful alignments , So as to achieve more effective comparative representation learning .
By explicitly measuring video and its time self-shuffle Edit distance between , A new self supervised time learning method is introduced . This helps to learn a wealth of time information , To supplement from CACL The learned expression .
The author verifies the method in this paper on two downstream video tasks ： Video recognition and Motion Retrieval . stay UCF101 and HMDB51 The result of the experiment shows that , The proposed CACL Can be significantly better than existing methods , Such as VideoMoCo and MoCo+BE.
The author deals with video representation learning in a self supervised way . In this section , Firstly, the general framework of the proposed method is introduced . Then the proposed contrastive learning method is described in detail , And self supervised time learning based on frame level disorder prediction .
The above table shows the overall framework of this approach , The framework of this paper consists of two paths , Including a transformer Video encoder and a 3D CNN Video encoder . The self supervised learning signal is calculated by two tasks ： Segment level contrast learning and frame level time prediction .
In this work , use 3D CNN As the main video encoder , It's also used for reasoning . whatever 3D CNN Architecture can be applied to the framework of this article . Combine the original clip with shuffle The output characteristics of the fragment concat get up , Then enter the comparison header and classification header . Both heads are fully connected feedforward networks .
transformer Encoder by 2D CNN and transformer Architecture Composition , As shown in the figure above . First , adopt 2D CNN Calculate each image frame of the video clip , The CNN Perform feature extraction to obtain frame level token Sequence . then , Output CNN The feature is projected to through the full connection layer 768-D Frame of token. Then the frames are sorted in chronological order token concat get up , And at frame token Add learnable embeddedness to the sequence .
Last , One 6 layer 6 head Transformer The model takes segment level feature sequence as input , The embedded output can be learned as a video representation . It is worth noting that , The feature extraction network is achieved by using a self supervised method MoCo, Use UCF101 Training set of video frames for pre training ResNet50, Its weight is frozen during the self supervised video presentation learning .
The goal of self supervised contrastive learning in this paper is to maximize the similarity between video clips with the same context , At the same time, minimize the similarity between clips from different videos . It is different from the previous contrastive learning methods , In this paper, the CACL Better joint capture of local and remote dependencies using cross architecture contrast learning signals .
The fundamental problem of contrastive learning lies in the design of positive and negative samples . Previous work on self supervised contrastive learning usually used various data enhancements to generate different versions of specific instances , So as to form a positive . In this work , The author enriches the antithesis from two perspectives ： Embedded layer （ Use different network structures ） And the data layer .
From the perspective of the Internet ,CACL Take advantage of 3D CNN and Transformer The advantages of . Given an input video clip , Each video segment generates a video representation , Compared with the previous method , This will double the number of positive samples . In the data layer , The author of the original fragment x Random in time dimension shuffle, And get a shuffle Video clip . These two examples then cancat together .
Pictured 1 Shown , By using different data enhancement and encoder , Maximizes the similarity of the four positive pairs generated from each video clip . Express different data enhancements as ,Transformer The encoder is represented as , The three dimensional CNN The encoder is represented as , Four feature representations can then be generated for the video clip .
Clips from different videos are considered negative samples . Author use MoCo The proposed momentum encoder and memory dictionary queue It further enhances comparative learning , It provides more meaningful negative samples to improve the performance of contrastive learning .
The author performs data enhancement in the spatial and temporal domains of the input video clip . Be careful , Spatial enhancement is performed consistently on all frames within the clip . therefore , The author maximizes three kinds of similarity ：（1） The similarity between segments calculated by the same network but performing different data extensions ;（2） The similarity between fragments with the same data expansion but calculated by different networks ;（3） Using different networks and different data to enhance the similarity between fragments .
Formally , The author considers a case by N Random sampling of different video instances batch, Then extract a segment from each video . This will result in a batch In all N A fragment （C）. The author randomly shuffle The order of each segment , Create a new set of N A fragment （）. Then put each fragment and its shuffle edition concat get up , And use data enhancement for further processing .
This will generate two with different data enhancements concat Video clip . The generated clips are processed by different video coders ： be based on 3D-CNN Video encoder based on Transformer The encoder . therefore , The author generates four segment level video representations for each video instance ：, It is used to construct a positive alignment during comparative learning . The author makes use of InfoNCE The case discrimination idea of contrast loss ：
Where is the similarity measure between two vectors . And are two kinds of characteristics .τ It's an adjustable parameter . In this work , The author extends the contrastive learning of video representation learning to ：
among , Is from a queue of size m Of memory dictionary queue The negative sample of . As shown in the above formula , In this paper, the CACL Be able to generate more alignments than standard contrastive learning .
The goal of this article is to learn time - sensitive video representation . So , The author tries to predict video clips and their shuffle Time differences between versions to train the network . The authors believe that this time prediction task requires motion and appearance cues . This enables the model to learn meaningful time details , Thus, it is beneficial to downstream tasks . In this work , The author proposes to use the minimum editing distance （MED） To measure video clips versus shuffle The degree of time difference between versions .
MED Provides a way to measure two strings by calculating the minimum operand required to convert one string to another （ For example, words ） The difference between the methods . Mathematically speaking , Two strings a,b Between Levenshtein Distance is represented by , among ：
among , It's an indicator function , Then equal to 0, Otherwise 1. In this work , take shuffle The prediction task is described as a classification problem , The cross entropy loss is used to calculate the three-dimensional CNN Model training . Given a video clip and its shuffle edition , You can calculate ：
among m It's all shuffle Number of videos .
Given a 16 Video clips of frames , The author carried out a random shuffle, And calculate the original clip and shuffle Between fragments MED. The author finds that in this example ,MED It is a slave. 0 To 16 Of discrete integers （1 With the exception of ）, This allows the author to put MED The regression problem of prediction is reformulated as a classification task . However , The distribution of these discrete integers is not uniform , This may lead to classification imbalance , Make the training process unstable . Technically speaking , The author first sampled a random sample from the uniform distribution MED Count , Then a random shuffle Video clip , Until it meets the requirements of sampling MED Count . This operation makes the model well balance the label distribution in classification , This is very important for time modeling and joint learning .
Compared with earlier methods , Such as Shuffle&Learn、OPN and VCOP. The method in this paper focuses on degree perception , Not sequential prediction / verification , This naturally leads to the following characteristics . It can learn more meaningful time information by increasing the number of frames , The previous method is usually limited to a very small number of frames . Because with the frame / An increase in the number of fragments , The number of sequences will increase rapidly . This method can capture more detailed and meaningful differences between video clips , This enables the model to learn more abundant temporal characteristics .
The author has studied shuffle degree prediction（SDP） The ability to learn time information from video , And compare it with the recently developed VCOP and PRP Made a comparison . The above table compares the results , Among them, the SDP Significantly better than VCOP, And achieved with PRP Quite a result .
As shown in the table above , Express 3D CNN Right opposite of , With different data enhancements , This is equivalent to using SDP Execute the original... On the video MoCo. Use all possible alignments
In order to further study the influence of different positive samples on self supervised contrastive learning , The author calculated UCF101 test split 1 Average similarity of positive sample pairs in score , As shown in the table above .
In the above table , The author shows the comparison of retrieval results between this method and different self supervised learning methods in video retrieval task , It can be seen that this method has obvious advantages .
In the above table , The author shows the comparison of the retrieval results of this method and different self supervised learning methods in action recognition task .
In this paper , A new self supervised video representation learning framework is proposed CACL. By introducing Transformer Video encoder , Designed a framework of comparative learning , It's three-dimensional CNN The comparative learning of provides a wealth of comparative samples . The author also introduces a new pretext Task to train a predictive video shuffle A model of degree . In order to verify the effectiveness of this method , The author has conducted extensive experiments on two different downstream tasks across three network architectures . Experimental results show that , In this paper, the shuffle degree prediction and transformer Video coders can encourage models to learn portable video representations , Compared with the method based on comparative learning , The features learned are heterogeneous .
Welcome to join 「 Self supervision 」 Exchange group notes ：SSL