Error message here!

Hide Error message here!

忘记密码?

Error message here!

请输入正确邮箱

Hide Error message here!

密码丢失?请输入您的电子邮件地址。您将收到一个重设密码链接。

Error message here!

返回登录

Close

Two photos can be transferred to video! Flim frame interpolation model proposed by Google

Doctor of artificial intelligence 2022-03-20 14:29:15 阅读数:327 评论数:0 点赞数:0 收藏数:0

above Artificial intelligence algorithms and Python big data Get more dry goods

On the top right  ···  Set to star  *, Get resources the first time

Just for academic sharing , If there is any infringement , Contact deletion

Reproduced in : The new intellectual yuan

Frame interpolation (Frame Interpolation) It is a key task in the field of computer vision , The model needs to be based on a given two frames , To predict the 、 Synthesize a smooth intermediate image , It also has great application value in the real world .

0337ccefa10fafb4c47f7dfb483d910e.gif

The common application scenario of frame interpolation is to improve some videos with insufficient frame rate , Some devices are equipped with special hardware to sample the frame rate of the input video , The low frame rate video can also be played smoothly on the high frame rate display , no need 「 Blink to mend the frame 」 了 .

As the deep learning model becomes more and more powerful , Frame interpolation technology can synthesize slow motion video from video at normal frame rate , That is to synthesize more intermediate images .

With the increasing popularity of smart phones , Digital photography also has a new demand for frame interpolation technology .

Under normal circumstances , We usually take pictures in a few seconds , Then choose a better one from these photos 「 According to cheat 」.

This kind of picture has one characteristic : The scene is basically repeated , The main character has only a few movements 、 Expression change .

If you interpolate frames in such pictures, it will produce a magical effect : The picture moved , Into a video ! In general, videos have more sense of substitution and moment than photos .

Is there a kind of 「 Live photos 」 The feeling of .

a881d8c26b282ac84e04496f4f761e98.gif

However, one of the main problems of frame interpolation is that it can not effectively deal with the motion of large scenes .

The traditional frame interpolation is to up sample the frame rate , It's basically interpolation of almost repeated photos , If the time interval between two pictures exceeds 1 second , Even more , Then we need the frame interpolation model to understand the motion law of the object , It is also the main research content of frame interpolation model at present .

a576e9d996b172ee36ee03015eb34d38.gif

lately ,Google Research The team proposed a new frame interpolation model FLIM, It can interpolate two pictures with large motion difference .

dfae20491f0360436f6c6e009d718e35.gif

The previous frame interpolation model is often very complex , Multiple networks are needed to estimate optical flow (optical flow) Or depth , A separate network is also needed for frame synthesis . and FLIM Just a unified network , Use a multi-scale feature extractor , Share trainable weights on all scales , And you can train with just one frame , No optical flow or depth data is required .

FLIM The experimental results also prove that it is better than the previous research results , Able to synthesize high-quality images , And the generated video is more coherent . Both the code and the pre training model are open source .

c850030bb0cb001dd1a6b804dce13c23.png

Address of thesis :https://arxiv.org/pdf/2202.04901

Code address :https://github.com/google-research/frame-interpolation

Model architecture

FLIM The architecture of the model consists of three main stages .

74685c0a10d2a9670b49ab4d6d7c06f9.png

1. Scale independent feature extraction (scale-agnostic feature extraction)

FLIM The main feature of the feature extractor is in the flow prediction stage (flow prediction stage) Weight sharing , The weight can be obtained at both coarse-grained and fine-grained resolutions .

First, create an image pyramid for two input images , Then use a shared image pyramid on each layer UNet The encoder builds a feature pyramid , And the convolution layer is used to extract 4 It's a scale feature .

It should be noted that , At the pyramid level of the same depth , All use the same convolution weights to create compatible multiscale features (compatible multiscale features).

The last step of the feature extractor is to connect feature maps with different depths but the same spatial dimension , A scale independent feature pyramid is constructed . The most fine-grained features can only aggregate one feature graph , Sub fine grained is two , The rest can aggregate three shared feature graphs .

2. motion / Flow estimation (motion/flow estimation)

After extracting the feature pyramid , They need to be used to calculate the two-way motion of each pyramid , Same as previous studies , Motion estimation starts from the coarsest layer . Unlike other methods ,FLIM From middle frame to input , Direct prediction of task oriented flows .

If you follow the routine training method , Use ground truth It is impossible to calculate the optical flow between two input frames by optical flow , Because the optical flow cannot be predicted from the intermediate frame to be calculated . But in the end-to-end frame interpolation system , In fact, the network has been able to predict well based on the input frame and the corresponding feature pyramid .

Therefore, computing task oriented optical flow at each level is the sum of residual and up sampled flows predicted from coarser granularity .

Last ,FLIM In the middle of time t Create a feature pyramid .

3. The fusion : Output result image (fusion)

FILM The final phase of the will time at each pyramid level t The scale independent feature map is connected with two-way motion , Then send it to UNet-like The decoder synthesizes the final intermediate frame .

In the design of loss function ,FLIM Use only image synthesis loss (image synthesis losses) To monitor the final output of training , No auxiliary loss items are used in the intermediate stage .

First, use a L1 Refactoring loss , Minimize the pixel level between the inserted frame and the standard frame RGB The difference between . But if only L1 Loss , The generated insertion frames are usually fuzzy , Using other similar loss function training will produce similar results .

therefore FLIM Added a second loss function, perceived loss (perceptual loss) To add detail to the image , Use VGG-19 High level features L1 Regular representation . Because the receptive area of each layer , Perceptual loss enforces structural similarity in a small range around each output pixel , Experiments also show that perceptual loss is helpful to reduce blur artifacts in various image synthesis tasks (blurry artifacts).

67b1ae959f1666a4c03f908fcdaec3e7.png

The third loss is the loss of style (Style loss), Also known as Gram Matrix loss , Can further expand VGG Advantage in loss .

cd1eb201a4cee6060ab3a39886520f53.png

FLIM And the first one to Gram Matrix loss is applied to the work of frame interpolation . Researchers have found that this loss can effectively solve the sharpness of the image , And preserving image details when opaque , It can also eliminate a large amount of interference in the sequence .

In order to achieve high reference score and high-quality intermediate frame synthesis , The final loss Use three loss weighted sums at the same time , Specific each loss The weight of is set empirically by researchers . before 150 The weight of ten thousand iterations is (1, 1, 0), After 150 The weight of ten thousand iterations is (1, 0.25, 40) , Hyperparametric pass grid search Automatic parameter adjustment to obtain .

82b1bfd01f7bf293bcfeef7ccac7615a.png

Experimental part

The researchers evaluated the index from two aspects: quantification and generation quality FLIM The Internet .

The datasets used include Vimeo-90K , UCF101 and Middle- bury, And the recently proposed large motion data set Xiph. Used by researchers Vimeo-90K As a training dataset .

The quantitative index includes peak signal-to-noise ratio (PSNR) And structural similarity images (SSIM), The higher the score, the better the effect .

f07929945191c5a0f037d6c89eeaaae9.png

perception - The distortion tradeoff shows that , By minimizing the distortion index , Such as PSNR or SSIM, It will have an adverse impact on the perceived quality . The multiple goal of frame interpolation research is to achieve low distortion 、 High perceptual quality and temporally coherent video . therefore , The researchers used the proposed method based on Gram Matrix loss LS To optimize the model , Good for distortion and sensory quality .

When including the loss of sensitivity to perception ,FILM stay Vimeo-90K Better than the most advanced SoftSplat. stay Middlebury and UCF101 I also got the highest score in .

0809c3bdefe5b355c33b7862c183b770.png

In terms of quality comparison , First, from sharpness (Sharpness) Look at , In order to evaluate based on Gram The effectiveness of the loss function of the matrix in maintaining image clarity , take FLIM The generated results are visually compared with the images presented by other methods . Compared with other methods ,FLIM The result of synthesis is very good , Facial image details are clear , And preserve the joints of the fingers .

4109fef3f7bccdb5ce9f72794b8b3a9e.png

In frame interpolation , Most of the occluded pixels should be visible in the input frame . Some pixels , Depending on the complexity of the motion , May not be available from the input . therefore , In order to effectively mask pixels , The model must learn the appropriate motion or generate new pixels . The result can be seen , Compared with other methods ,FILM The pixels are drawn correctly while maintaining clarity . It also retains the structure of the object , For example, red toy car . and SoftSplat Is deformed ,ABME Produced a blurred picture in picture

f8d0ef92f610d7f54fb77884e99e0b39.png

Big movement (large motion) It is one of the most difficult parts of frame interpolation . In order to expand the range of motion search , The model usually adopts multi-scale method or dense feature map to increase the neural ability of the model . Other methods are implemented by training large sports data sets . The experimental results show that ,SoftSplat and ABME The dog's nose can catch the movement nearby , But they create a lot of artifacts on the ground .FILM The advantage of is that it can capture the motion well and keep the background details .

98f2d6da24e4db35374a47307210a4ea.png

Reference material :

https://film-net.github.io/

------------------

Statement : This content comes from the Internet , The copyright belongs to the original author

Picture source network , It does not represent the position of the official account . If there is any infringement , Contact deletion

AI Doctor's personal wechat , There are still a few vacancies

f4e77fe3d1b06de1f45a0eda12c892dd.png

5d954265b32b2aae13104b73fe3f48d8.gif

How to draw a beautiful deep learning model ?

How to draw a beautiful neural network diagram ?

Read all kinds of convolutions in deep learning

Let's have a look and support b4b68ad1230c8c0d1b9673b75ff20a13.png309bc8e080f61ad01ea9ab6079ef6fc1.png

Copyright statement
In this paper,the author:[Doctor of artificial intelligence],Reprint please bring the original link, thank you

飞链云3D数字艺术品
30万现金开奖等你来领