Multi-Resolution Multi-Reference Talking Head Synthesis via Implicit Warping

Abstract

Generating high-resolution frames of an individual’s face from a low-resolution version is useful in a number of applications like low-bitrate video conferencing, video enhancement, and de-blurring. A common technique to improve fidelity to high-frequency content is to use one or more example or reference images at higher resolution. However, typical optical flow based models do not easily scale to multiple example images, and are constrained to generating good frames only for poses that are close to the reference pose. We propose a novel multi-resolution-attention architecture that encodes information from the reference images into a set of key-value pairs at multiple resolutions that are attended to for frame synthesis. Notably, once trained, our model can effortlessly adapt to varying numbers of reference images during inference, while during training, it efficiently requires only a single reference frame. On a highly curated dataset, we show that even with a single reference frame, our multi-resolution architecture improves the PSNR, SSIM and LPIPS of synthesized images on average by 1.71 dB, 1.13 dB and 0.06 respectively over a single-resolution architecture. Further, using multiple references provides an additional improvement of 0.15 dB in PSNR, 0.07 dB in SSIM, and 0.01 in LPIPS. Moreover, on a more diverse dataset, our approach exhibits a remarkable boost in reconstruction quality – ∼1 dB PSNR and SSIM, and 20% in LPIPS – when comparing the use of ten references to a single reference frame.