6 Papers to Watch at CVPR 2022
CVPR 2022 is happening soon. The previous CVPR saw groundbreaking discoveries that pushed the boundaries of computer vision. This year is no exception. With topics running the gamut from deep learning architecture to image synthesis, CVPR 2022 will be a culmination of cutting-edge ideas in all areas of computer vision.
Here, we present six papers to watch from CVPR 2022:
1. SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
Recent advancements in Generative Adversarial Networks (GANs) gave rise to impressive progress in image synthesis. Acting as powerful generative priors, GANs lay a solid foundation for downstream synthesis tasks.
One well-known work is the StyleGAN series, where each generated image is conditioned on a set of latent codes. These latent codes are highly informative but also difficult to interpret. Yet, researchers remain steadfast in their exploration of these codes, in hopes of discovering methods of improving the controllability of GANs.
Such exploration yielded methods to control the global attributes of the generated images from StyleGANs. Others found that training GANs from scratch with additional supervision resulted in better control.
Despite these refinements, practitioners lamented the limited control they had over the local features generated by GANs. This is where SemanticStyleGAN shines.
SemanticStyleGAN provides more controllable image synthesis as its latent space is factorized based on semantic parts (like face, hair, or nose). Put simply, SemanticStyleGAN’s local latent codes can be used to control the structure of and texture of features like face, hair, and nose. (Figure 1)
Figure 1. A comparison of (a) StyleGAN against (b) SemanticStyleGAN. Notice the latent codes of SemanticStyleGAN are factorized over different regions while those of StyleGAN are associated with different granularities. (Source)
SemanticStyleGAN achieved exceptional disentanglement between different spatial areas. In the context of generating a face, spatial areas include the face, eyes, mouth, and hair (Figures 2 and 3). Users of SemanticStyleGAN could also disentangle background and background, allowing for precise control over the location and size of the foreground.
Figure 2. SemanticStyleGAN achieves exceptional disentanglement between different spatial areas.
Figure 3. SemanticStyleGAN holds precise control over the location and size of the foreground (Source)
StyleGAN could mix styles between generated images at different granularities. SemanticStyleGAN one-upped StyleGAN by allowing the transfer of styles of different local areas, instead of granularity (Figure 4).
Figure 4. SemanticStyleGAN could transfer styles locally. The style of the images from the first row (e.g. face, eyes, eyebrows, mouth, and hair) are transferred to those in the first column. Notice how only the style of a specific feature (not the entire image) is transferred.
SemanticStyleGAN represents a step toward more controllable image synthesis. Having discovered the means to have greater control over image synthesis tasks, we are inching towards a fully customizable and scalable image generator.
Meet us at booth #813 at CVPR 2022. Schedule a meeting here.
2. StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2
The realm of image generation has seen unprecedented growth within the short span of a few years. Unfortunately, the video generation community saw far fewer advancements. The challenge lies with the complexity of the underlying video dataset. It also does not help that modern video generators treat videos as discrete images and thus have intensive resource demands.
StyleGan-V challenges that status quo by thinking of videos as time-continuous signals. The researchers represented continuous motion with positional embeddings that have time-varying parameters. With that, they argued (and demonstrated) that a video with as few as two frames is sufficient to build a meaningful video generator.
Built on top of the image-based StyleGAN2, StyleGAN-V can produce long videos at a high frame rate efficiently. In particular, StyleGAN-V is only 5% more expensive to train than StyleGAN2 with only slightly worse (~10%) image quality. As a result, StyleGAN-V could be trained directly on 1024×1024 videos, a feat unparalleled by previous state-of-the-art video generators.
The authors compared their results with a few benchmarks. StyleGAN-V managed to produce a realistic hour-long video from the RainbowJelly dataset. On the other hand, other video generators like VideoGPT, DIGAN, and MoCoGAN-HD struggle to produce meaningful results after one hour. (Figure 5)
Figure 5. A snapshot of the videos generated by different video generators at the 1-hour mark using Rainbow Jelly 256×256 benchmark
On the MEAD 10242 dataset, StyleGAN-V could preserve the identity of a speaker, while MoCoGAN-HD produced noisy results after 1 second. (Figure 6)
Figure 6. Given the face image of a speaker from MEAD 10242, StyleGAN-V synthesized an hour-long video that preserved the identity of the speaker (bottom row). MoCoGAN-HD could not do the same (top row) [Source]
The output of StyleGAN-V can be controlled by tuning the latent space. To the delight of the researchers, StyleGAN-V inherits the perks of StyleGAN2, including the ability of semantic manipulation. To do so, the researchers edited the first frame of a video with CLIP and animated it with the StyleGAN-V (Figure 7).
Figure 7. StyleGAN-V, like StyleGAN2, has the ability of semantic manipulation.
Admittedly, StyleGAN-V still has some limitations to overcome. In particular, the model might produce repeated periodic motions at times, and handle certain new content poorly. That said, StyleGAN-V’s is still a solid foundation for even more powerful video generators in the future.
3. Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data
Many facial recognition techniques assume that the input facial images do not have occlusions. When that assumption is not met, the performance of the model degrades. One of such occlusions is eyeglasses, and being able to remove portrait eyeglasses could handle such cases. (Figure 8)
Figure 8. Portrait Eyeglasses and Shadow Removal [Source]
The task of removing eyeglasses (and their shadows) is not trivial. Not only do the facial features have to be recovered, but the lighting effects (like shadows, reflections, and distortions) associated with the glasses also have to be removed. Conceivably, modeling these effects are computationally expensive and difficult.
Thus, the author proposed a detect-then-remove algorithm that first detects the presence of the eyeglasses before removing them.
Due to a lack of real-world data, the authors trained both the detection and removal algorithm with a synthetic portrait dataset. The benefit of using a synthetic dataset is the ability to generate a diverse dataset spanning various identities, expressions, eyeglasses, and illuminations. As a result, the model can generalize across different lighting conditions, head poses, and eyeglasses (Figure 9).
Figure 9. Removing glasses across various settings [Source]
Notably, the authors also used synthetic images that cannot be captured in the real world for intermediate supervision. In particular, portraits with eyeglasses but no shadows (and vice versa) were used to train the de-glassing and de-shadow networks.
To bridge the gap between synthetic and real data, the authors use the Domain Adaptation network. It converts the input images – whether synthetic or real – into a uniform feature map.
The results from this paper are promising. A quick comparison of the authors’ results with state-of-the-art eyeglasses removal methods (Figure 10) reveals the limitations of current methods. For example, existing methods like CycleGAN, StarGAN, and pix2pix cannot completely remove the eyeglasses for some shadows.
Figure 10. Qualitative comparisons of different methods across the multiple images. [Source]
The authors achieved state-of-the-art results despite the lack of real-world data. We would even argue that the variety and customizability of synthetic data is an important ingredient in addressing the weakness of previous models. For many tasks beyond eyeglasses removal, synthetic data might be the key to pushing the boundary of other computer vision tasks.
4. Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data
Many computer vision models rely on pre-trained models trained on large real-world datasets. As demonstrated by the paper Exploring the Limits of Weakly Supervised Training, the downstream performance of such pre-trained models increases with the size of their training sets. It is not a surprise then that researchers are training models with ever-larger datasets in hopes of pushing the boundary of the state-of-the-art.
Yet, real-world data comes with strings attached. It is expensive and laced with privacy and ethical issues. An alternative to real-world data, synthetic data could be used to pre-train vision models in a scalable and privacy-preserving manner.
One question remains–what kind of synthetic data is best suited for a specific computer vision task? That was the question that a group of researchers from Boston University, Watson AI Lab, and Cornell University aimed to answer.
These researchers found that each downstream task has a favorable configuration of simulation parameters (like lighting, object pose, and backgrounds), and no one-size-fits-all solution exists. Not only that, they proposed the Task2Sim model, which is a unified model that finds the optimal simulation parameters for generating pre-training data for specific tasks.
The results from Task2Sim are impressive. Vision models trained on task-adaptive pre-training data generated by Task2Sim performed significantly better than those trained on non-adapative data at 20 distinct classification tasks.
The researchers demonstrated that Task2Sim is essentially a “one-shot learner”. Specifically, once trained on a set of seen tasks, Task2Sim can then predict parameters for novel unseen tasks. Future-gazing, Task2Sim could be used as part of a pipeline to automatically generate pre-training data when given some downstream examples.
We have covered the technical implementation of Task2Sim in detail in another blog post. There, we described its architecture, training process, and performance on various tasks.
5. BANMo: Building Animatable 3D Neural Models from Many Casual Videos
3D shape reconstruction often relies on specialized sensors. This is especially true for deformable objects, like humans or animals.
With the rise of virtual reality comes the burgeoning demand for creating animatable assets from images and videos captured by a regular camera. Doing so for humans was difficult without the help of specialized sensors. The same can be said for animals and any other 3D deformable objects (as opposed to rigid scenes like buildings).
Researchers from Meta AI and Carnegie Mellon University pondered on that problem and proposed BANMo, a Builder of Animatable 3D Neural Models. Instead of using specialized sensors to animate humans and animals. BANMo requires only casual videos of an object (Figure 11).
Figure 11. Input videos and BANMo’s reconstructions (Source)
When given multiple videos of a deformable object (e.g. a cat), BANMo could reconstruct an animatable 3D model without the need for any predefined shapes (Figure 12). The object can then be reconstructed with different poses.
Figure 12. An overview of BANMo’s reconstruction process
More concretely, BANMo consolidates 2D cues from thousands of images into a fixed canonical space. The output of the BANMo model is a neural blend skinning, which could be animated by manipulating bone transformations. The use of neural blending skinning is superior to using dynamic NeRF approaches in terms of the handling of pose variations and deformations.
From the experiments, the researchers found that BANMo could reconstruct humans and animals from real and synthetic datasets with high fidelity. By taking better advantage of the large number of frames from multiple videos, BANMo produced finer 3D shape details and motions than state-of-the-art approaches like ViSER. (Figure 13)
Figure 13. A comparison of BANMo with the state-of-the-art (ViSER)
6. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields
Two years since its inception in 2020, Neural Radiance Fields (NeRF) is still one of the leading methods to synthesize novel views of complex scenes. Given a set of sparse input images of a scene, NeRF optimizes a volumetric representation of the scene. With that, it achieved state-of-the-art results in generating complicated geometry and appearance.
That said, NeRF is not perfect. In particular, it struggles with “unbounded scenes” where the scene could extend very far into the distance from any angle. Under such settings, the renderings are often blurry. Such a challenge arises due to three critical issues – parameterization, efficiency, and ambiguity.
Thus, Google and Harvard researchers present an extension of mip-NeRF (a variant of NeRF) called mip-NeRF 360. It uses a combination of techniques including non-linear scene parameterization, online distillation, and a distortion-based regulariser to overcome such challenges. Overall, the mip-NeRF 360 could synthesize unbounded scenes with high fidelity.
Kubric is an open-source Python package that uses Blender and PyBullet to generate realistic scenes and rich annotations. It is scalable as it can distribute the computational load over thousands of machines. The paper demonstrated Kubric’s ability to generate datasets for a wide range of tasks, including optical flow estimation and studying 3D NeRF models (Figure 14). Interested practitioners can use the Kubric package which is currently available on Github.
Figure 14. An example of a scene and annotations generated by Kubric
Real-world datasets often fail to capture edge cases. Thus, vision models trained on such datasets might fail to recognize such rare occurrences and cause harm to their users. For example, an autonomous vehicle that fails to recognize occluded stop signs might fail to react appropriately.
Synthetic datasets could address such robustness issues. That is the motivation of the researchers behind SHIFT, which is the largest synthetic dataset for training self-driving cars. (Figure 15)
Figure 15. The SHIFT dataset
The dataset includes scenarios from multiple weather conditions (cloudy, rainy, or foggy), time of day, and pedestrian intensity. It also includes a comprehensive sensor suite and annotations, which makes it suitable for a range of perception tasks. Further, SHIFT’s videos also are longer than those in real-world datasets, making them an ideal candidate for test-time learning.
CVPR 2022 will not disappoint
If there is one thing we can learn from CVPR 2022, it is that the state of synthetic image generation is rapidly advancing. In particular, we see the refinement of technologies like NeRF and GAN in producing higher fidelity synthetic images. We also witness an influx of papers that are using synthetic data to produce state-of-the-art results on a multitude of vision tasks. Looking forward, we foresee advancements to translate into tangible, practical use cases of synthetic data in real-world applications.
What we’ve covered here is only the tip of the iceberg of what CVPR 2022 has to offer. There are dozens of papers to pursue; hundreds of workshops and tutorials to engage with; plenty of computer vision experts to learn from. If you’re eager to learn more, you can study the conference proceedings which will be available via CVF and IEEE Xplore after the conference.
Meet us at booth #813 at CVPR 2022. Schedule a meeting here.