## 1 Introduction

Single image super-resolution (SR) aims at estimating the mapping from low-resolution (LR) to high-resolution (HR) spaces

[super02, low01, joint03]. Taking into account that HR images will lose many details in the high-to-low degradation process, the SR problem is naturally underdetermined, which brings the fact that there exists multiple HR images correspond to one input LR image. Although these HR images may have the same low-frequency information in LR space, their high-frequency information, including textures and details, can be significantly different. This ill-posed nature makes SR task a challenging problem to solve.Recently, the learning-based approaches have made great progress due to its robust ability of recovering details [real01, accelerate01, deep03, real02, image02, deep02]

. When early deep learning methods focus on promoting computational metrics like PSNR and SSIM

[wide01, accurate01, enhanced01], methods proposed later pay more attention to the progress of SR application in real-world. CARN [2018Fast] proposed a lightweight network to speed up training and inference, and Meta-SR [2020Meta] develops a up-sampling module capable of dealing with arbitrary scale factor. In 2017, SRGAN [photo01] introduces adversarial training strategy into super-resolution, since then, many GAN based SR methods are aiming to obtain SR images with better perceptual quality [2018ESRGAN]. However, these SR methods only use LR-HR image pairs to approximate a deterministic mapping, thus ignoring the ill-pose nature of SR problems.The development of SR methods from deterministic mapping to stochastic mapping lies on the transformation from fitting single HR output to fitting the conditional distribution of HR images given the LR input. In order to explore the relationship between low-resolution images and the corresponding diverse high-resolution images, recently published stochastic super-resolution methods [lugmayr2020srflow, BuhlerRT20DeepSEE, BahatM20Explorable, VarSR] reformulated a challenging goal of learning the conditional distribution. Taking into account that HR images share the same low-frequency information, current stochastic SR methods introduce additional latent variable to affect the high-frequency information of HR image [lugmayr2020srflow, BuhlerRT20DeepSEE, BahatM20Explorable, VarSR], therefore, by sampling different latent variable, these methods can generate diverse HR images with interpretability.

The NTIRE 2021 ^{2}^{2}2https://data.vision.ee.ethz.ch/cvl/ntire21/ raised one challenge of learning the super-resolution space. The difficulty in this challenge is from three aspect. First, each individual SR prediction should reach high perceptual quality. Second, the proposed method should be able to sample an arbitrary number of SR images and fully explore the uncertainty induced by the ill-posed nature. Moreover, each individual SR prediction should be consistent with in the LR space, which restricts the performance of many GAN based SR methods.

In this work, we develop a Variational Sparse framework for Super-Resolution (VSpSR) via neural networks to solve the problems in NTIRE 2021 challenge on learning the super-resolution space. Overall, we assume that the HR image contains the deterministic part and the stochastic part. As for the deterministic part, we can use any deterministic SR method to obtain from the LR input. For the explorable part, as it has been widely utilized in traditional SR methods, we apply the sparse representation into deep-learning, using the diversity of image representation coefficients to control the diversity of HR image.

Specifically, we design a two-branch module named VSpM to capture the stochastic mapping of details in HR images. Using the LR image as the input, the basis branch of VSpM outputs patch-level basis in SR space, and the coefficients branch infers pixel-wise variational distributions with respect to the sparse coefficients. Therefore, by repeatedly sampling coefficients, we could obtain infinite sparse representations, and thus generate diverse HR images. Our experiments show that the variational sparse framework leads to larger SR space, and the VSpM module has the potential to cooperate with other deterministic SR methods to enhance their exploration ability. Our methods ranked 7-th in the NTIRE 2021 challenge on learning the super-resolution space according to the preliminary results [lugmayr2021ntire].

The rest of this work is organized as follows. Section 2 shows the related works about the deterministic and stochastic SR. We develop a variational sparse framework for explorable super-resolution (VSpSR) via neural networks and gives the details of training strategies in Section 3. The description of our experiments is presented in Section 4. We discuss the proposed method in Section 5 and conclude this work in Section 6.

## 2 Related works

Single image SR: SR problem is naturally underdetermined due to the information loss in the high-to-low degradation process. Many traditional SR methods have noticed this fundamental fact, but they tend to further regularize the problem and finally output one SR prediction [2008Image, 2014Anchored]. Recently, DNN has been widely applied in image SR due to its ability of simulating complex mappings. Dong [image02]

first proposed to approximate the mapping from LR to HR image pairs using a three layers convolutional neural network. Since then, other architectures, such as RNN

[conv01, under01, recurrent01], ResNet [deep04, 2018Fast], and GAN [gan01, 2018ESRGAN], have been applied in image SR. However, previous deep-learning super-resolution methods often use the deterministic mapping to model the process of recovering the high-resolution image from a given low-resolution image [enhanced01, 2018RCAN], neglecting the ill-posed nature of this problem.Stochastic SR: In order to explore the relationship between low-resolution images and the corresponding diverse high-resolution images, recently published stochastic super-resolution methods [lugmayr2020srflow, BuhlerRT20DeepSEE, BahatM20Explorable, VarSR] reformulated a challenging goal of learning the conditional distribution . Considering that the low-frequency information of the generated HR image is consistent with that of LR image, current stochastic SR methods tend to introduce additional latent variable to affect the high-frequency information of HR image to achieve diversity, [lugmayr2020srflow, BuhlerRT20DeepSEE, BahatM20Explorable, VarSR]. In this formulation, the exploration of latent variable controls the diversity of HR image , even the mapping itself can be deterministic. These stochastic SR methods adopt different strategy to enlarge the SR space. SRFlow [lugmayr2020srflow] adopts the framework of the conditional normalizing flows, using invertible network to restrict the distribution of the latent variable. Yuval Bahat [BahatM20Explorable] propose structure loss and map loss to enhance the affect of control signal, and DeepSEE [BuhlerRT20DeepSEE] generates the latent variable from the semantic information of other high-resolution face images, and thus provides guidance for the generated HR image.

Motivated by the idea of Conditional Variational AutoEncoder (CVAE)

[2015Learning] and sparse representation [2008Image], we propose a variational sparse representation framework, and its details is presented in Section 3.## 3 Methodology

Exploring SR space is significant since multiple HR images could be degraded to the same LR images. However, much attention has been paid to estimate the deterministic mapping from LR to HR images, while few works have been done to explore SR space. To rectify the weakness, we proposed a variational sparse representation framework, i.e., VSpSR, as Figure 1 shows to estimate stochastic mappings from single LR image to multiple SR images. Concretely, we first assume that a HR patch could decomposed into the sum of a low-frequency part and a high-frequency component , where, could be deterministic given a LR patch , but is often stochastic. Then, we sparsely represent via a set of deterministic basis and a group of stochastic coefficients . Moreover, we give a sparse prior to , and infer the distribution of from

via variational Bayesian inference. Finally, we repeatedly sample the sparse coefficients from the variational distribution, and thus could generate diverse SR patches. The statistical model of VSpSR is described in Section

3.1, and the detailed network architecture of VSpSR is presented in Section 3.2.### 3.1 Variational sparse representation

Inspired by CVAE [2015Learning] , our method aims to extract latent variables representing the parameters of the corresponding HR image distribution from the LR image itself first, and then sample super-resolutions from this conditional distribution. However, due to the information loss during the degradation process, it is difficult to directly infer the pixel-level HR distribution just from a single LR image, especially when the scale factor is large (). While VarSR [VarSR] extracts latent variables in LR space to ease this problem, our method works from another perspective. In order to enhance the expression ability of network, we exploit the non-local self-similarity nature of natural images, which indicates that every patch in a HR image can be well approximated by the sparse representation of atoms in an over-complete dictionary and has been widely utilized in traditional SR methods [2008Image, 2014Anchored]. In other words, when the atoms (basis) are fixed, by sampling different coefficients, we can fully explore the diversity of HR patches, thus generate different HR images.

Suppose HR , LR , , where denotes the scale factor. We assume that x follows a Gaussian distribution, . As the Figure 1 shows, the deterministic part can be obtained through a deterministic mapping , for example, bicubic up-sampling or any other deterministic SR method like EDSR [enhanced01], RCAN [2018RCAN]. As for the stochastic part , we formulate it as the aggregation of small patches with , and each small patch can be represented by coefficients under the basis , where , . In order to hold the sparsity of , we set the gamma prior for , let , :

(1) |

where, and

denote the parameters of gamma distribution.

### 3.2 Network architecture

We designed a variational sparse representation guided explorable module VSpM with two branches, i.e., the basis branch and the coefficients branch, as shown in Figure 2. The basis branch outputs basis , where C denotes the number of basis. The coefficients branch outputs parameters and , inferring pixel-wise variational distributions with respect to the sparse coefficients.

The basis branch is mainly consist of three parts. Firstly, the LR input goes through blocks to generate feature :

(2) |

And the in (2) represents the operation of the -th block. Then we use a global pooling to obtain global information of the image. After that, from the operation of deconvolution, we can upsample to the size of :

(3) |

The coefficients branch is simple, which only has a few convolutional layer with layer normalization and ReLU activation. We first infer the parameters

and to estimate the pixel-wise variational distributions with respect to the sparse coefficients. Then we can sample coefficients from the Gaussian distribution at both training and inference stages. Finally, the stochastic part is restored as the results of matrix multiplication between basis and coefficients . Note that We perform the VSpM in parallel for RGB channels. Also, we adopted the consistency enforcing module (CEM) [BahatM20Explorable] to further enhance the lr-consistency.### 3.3 Training strategies

We train the network by minimizing the negative log-likelihood of :

(4) |

In order to restrict the distance between the distribution of sampled and the prior distribution as (1) shows, we minimize the following KL divergence:

(5) |

Where, represents the -th elements of the variational parameter with respect to

. Finally, we introduce adversarial loss and perceptual loss to enhance the visual quality of SR outputs. Therefore, our total loss function is:

(6) |

Note that in can be absorbed into , and .

## 4 Experiments

In this section, we first showed the datasets and metrics used for training and evaluating VSpSR. Then, we studied the effect of different settings to VSpSR. Finally, we tested the performance of VSpSR on the tasks of SISR and , and discussed the advantages and limitations of VSpSR.

Model | #Basis | Upsampling | Stochastic | Stochastic | LPIPS | LR PSNR | Div. Score | ||

#1 | 256 | Bilinear | 0.01 | No | No | 0.5 | 0.223 | 47.87 | 0 |

#2 | No | Yes | 0.239 | 47.68 | 9.161 | ||||

#3 | Yes | Yes | 0.309 | 48.25 | 9.136 | ||||

#4 | 256 | None | 0.01 | No | Yes | 0.5 | 0.320 | 47.60 | 1.847 |

#5 | 64 | Bilinear | 0.308 | 47.92 | 11.612 | ||||

#6 | 256 | Bilinear | 0.01 | No | Yes | 1 | 0.237 | 47.92 | 13.251 |

#7 | 0.1 | 0.280 | 47.47 | 17.895 | |||||

#8 | 1 | 0.254 | 47.05 | 13.325 | |||||

#9 | 256 | Bilinear | 0.1 | No | Yes | 0.1 | 0.220 | 47.75 | 11.350 |

### 4.1 Dataset and metrics

The DIV2K dataset is composed of 800 training images, 100 validation images, and 100 testing images. We will test the performance of our method on the validation dataset since the ground truth of testing dataset is not public. In order to better measure the comprehensive performance of the SR methods, NTIRE 2021 challenge on learning the super-resolution space proposed three metrics to test from three aspects. Before the evaluation, we first generate 10 SR predictions for each LR input in DIV2K validation dataset.

LPIPS. It is very difficult to automatically assess the image perceptual quality. To assess the photo-realism, The challenge will perform a human study on the test set for the final submission. As the SR challenge suggests, in the experiment, we use the Learned Perceptual Image Patch Similarity (LPIPS) [2018LIPIS] distance instead to roughly measure the perceptual quality.

Diversity score. As mentioned in NTIRE 2021 challenge on learning the super-resolution space ^{3}^{3}3https://github.com/andreas128/NTIRE21_Learning_SR_Space, we can use the diversity score to measure the spanning of the SR Space:

(7) |

where, the local best is obtained by first select pixel-wise best LPIPS score of 10 SR predictions, then compute the average; and the global best is obtained by averaging the whole pixel scores and selecting the best.

LR PSNR. This metric measures the similarity between the SR prediction and the LR image in low-resolution space, which reflects how much the information is preserved during super-resolution. To compute LR PSNR, we should first down-sample the SR predictions and then calculate PSNR. In NTIRE 2021 challenge on learning the super-resolution space, the goal of this metric is to reach 45dB.

### 4.2 Implementation details

There is a data pre-processing before training. To be specific, for SR tasks, we crop small patches from LR images, and extract corresponding patches from HR images in DIV2K training dataset. For tasks, we set the LR patch size to and HR patch size to

. To demonstrate the advantages of the VSpM module, we only use interpolation (bicubic/bilinear) method to generate the deterministic part

of SR predictions. As for the stochastic part , we set the number of basis to 256, making all the patch-level basis compose a dictionary as over-complete as possible. Besides, the distribution parameters of gamma prior are , . Figure 3 shows that this setting can well restrict the sparsity of coefficients .Method | Bicubic | EDSR [enhanced01] | RCAN [2018RCAN] | SRGAN [photo01] | ESRGAN [2018ESRGAN] | VSpSR |

LPIPS | 0.409 | 0.257 | 0.254 | 0.158 | 0.115 | 0.277 |

LR PSNR | 38.70 | 54.11 | 54.24 | 35.49 | 42.61 | 47.15 |

Div. Score | 0 | 0 | 0 | 0 | 0 | 16.120 |

In training process, all methods are trained by the ADAM optimizer, and settings of parameters are , , and

. The training of the baseline model is up to 300 epochs, with the initial learning rate of

, and . We let learning rate decreases to 10 percent every 100 epochs. After that, we fine-tune our baseline with settings of andfor extra 100 epochs to improve the performance of models. Finally, we implement our networks with Pytorch and train our models on a device with 40 Intel Xeon 2.20 Ghz CPUs and 4 GTX 1080 Ti GPUs. The whole training process of VSpSR cost about 30 hours on a single GPU. During testing, to evaluate the performance of our method, the metrics mentioned in Section

4.1 are used.### 4.3 Ablation study

In this section, we studied the effect of different settings to the performance of VSpSR, including the number of basis (#Basis), the upsampling manner, the choice of , whether to set the basis and the coefficients to be stochastic, and the selection of . Note that since the evaluation on the whole validation data is computationally expensive, we computed the metrics on the first 20 images of the validation dataset.

Baseline comparison. We trained 5 models for comparisons (#1 to #5), and the first three of which is to study the effect of setting and to be stochastic. The Div. score of model #1 is zero since the basis and coefficients of VSpSR are deterministic. The comparison between model #2 and model #3 shows that setting to be stochastic does enlarge the spanning of SR space, but setting to be deterministic and to be stochastic could achieve lower LPIPS value, which results in better Div. score. Thus we adopted such setting in the following studies.

Besides, we trained models #4 and #5 to respectively study the effect of estimating the deterministic and that of the #basis. The comparison between #2 and #4 shows that estimating could greatly improve the Div. score, this is because that the VSpM module should pay more attention to capture the coarse information from the LR input without the low-frequency information serves. However, model #4 tells that the VSpM module alone is still capable of learning SR mapping. Moreover, the comparison between models #2 and #5 shows that the larger number of basis could lead to the promotion of the LPIPS, at the cost of increasing the number of total parameters in VSpSR, from 0.27M to 4.4M.

Fine-tuning comparison. Using the #2 model as the baseline, we fine-tuned 4 models(#6 to #9) for extra 100 epochs to study the effect of w.r.t. adversarial loss and w.r.t. Gamma distribution. The comparisons between models #6, #7, #8, and #9 shows that and are appropriate choices in terms of Div. score.

Method | Bicubic | RCAN [2018RCAN] | VSpSR |

LPIPS | 0.584 | 0.404 | 0.508 |

LR PSNR | 37.16 | 48.65 | 46.64 |

Div. Score | 0 | 0 | 13.708 |

### 4.4 Learning SR space

In this section, we evaluated the performance of VSpSR on the task of SISR . Firstly, we adopted the same settings as model #7 in Table 1 for VSpSR, and cropped paired patches of size from training images for training. Then, we augmented training patches via flipping and rotation, and minimized the loss function as (6) shows to train VSpSR. Finally, we evaluated the performance of VSpSR on the validation dataset of DIV2K via computing LPIPS, LR PSNR, and Div. score, and compare VSpSR with four state-of-the-art SR methods, including two PSNR-oriented, i.e., EDSR and RCAN, and two perceptual-quality-oriented, namely, SRGAN and ESRGAN.

Table 2 shows the quantitative results of compared methods. Since RCAN is PSNR-oriented while ESRGAN is perceptual-quality-oriented, they achieve the best LR PSNR and LPIPS among all methods, respectively. However, EDSR [enhanced01], RCAN [2018RCAN], SRGAN [photo01], and ESRGAN [2018ESRGAN] are deterministic models, and their diversity score are zeros. Being different from these methods, our VSpSR could generate diverse SR image from a single LR image, since we made the coefficients to be stochastic by the variational sparse representation. To qualitatively evaluate the performance of VSpSR, we visualized three typical examples in Figure 4. This figure shows that the perceptual-quality-oriented methods, i.e., SRGAN and ESRGAN could generate more details, which is consistent with the quantitative results.

### 4.5 Learning SR space

In this section, we evaluated the performance of VSpSR on the task of SISR . Firstly, we adopted the same settings as model #7 in Table 1 for VSpSR, and cropped paired patches of size from training images. Then, we augmented training patches via flipping and rotation, and minimized the loss function as (6) shows to train VSpSR. Finally, we evaluated the performance of VSpSR on the validation dataset of DIV2K via computing LPIPS, LR PSNR, and Div. score. Since RCAN [2018RCAN] released the model for SISR , we compared it with our VSpSR.

Table 3 shows the quantitative results of compared methods, including bicubic, RCAN, and our VSpSR. Although RCAN could achieve the best LPIPS and LR PSNR, its diversity score is zero since RCAN is a deterministic model. Being different from RCAN, our VSpSR could reconstruct diverse SR images, since the coefficients of proposed variational sparse representation are stochastic. To qualitatively evaluate the performance of VSpSR, we visualized three typical examples from the validation dataset in Figure 5. This figure shows that RCAN could generate higher quality images than VSpSR. Besides, VSpSR could introduce “patch-effect”, since we do not explicitly consider the dependency among patches, and that will be further discussed in Section 5. Although VSpSR cannot perform robustly as RCAN in reconstructing details, it has the advantage of generating diverse SR images which are consistent with a single LR image, and that is one of the keys of learning SR space.

## 5 Discussion

The advantage of VSpSR is that it could greatly expand SR space compared with the deterministic models, but “patch-effect” is introduced due to the patch-level sparse representation. Concretely, the conventional sparse representation is aimed at building an dictionary, such that each small patch could be sparsely represented by the dictionary. The expanded space with respect to sparse representation is determined by the dictionary, and therefore an over-complete dictionary is required. However, such representation is deterministic and computationally expensive, and thus cannot be applied to learn SR space. To tackle the difficulty, we proposed the variational sparse representation framework, i.e., VSpSR, whose coefficients follow a sparse prior and could be repeatedly sampled from a variational distribution. To further understand VSpSR, we shows the distribution of coefficients and visualizes the basis inferred from a typical LR image in Figure 6. For VSpSR, the basis determines the expanded SR space, while a sample of sparse coefficients is corresponding to a SR image in the space. That means increasing the number of basis could rise the diversity of SR space, but that would also increase the computational complexity. Therefore, exploring more efficiency methods of increasing the diversity of SR space is required. Besides, we only study the patch-wise sparse representation, and do not explicitly model the dependency among patches. That would introduce “patch-effect” as Figure 5 shows for big scaling factors, e.g., . In reality, different patches may be highly similar, and thus explicitly modeling such dependency is appealing.

LPIPS | LR PSNR | Div. Score | |
---|---|---|---|

Team | |||

svnit_ntnu | 0.355 | 27.52 | 1.871 |

SYSU-FVL | 0.244 | 49.33 | 8.735 |

nanbeihuishi | 0.161 | 50.46 | 12.447 |

SSS | 0.110 | 44.70 | 13.285 |

Ours | 0.273 | 47.20 | 16.450 |

FutureReference | 0.165 | 37.51 | 19.636 |

SR_DL | 0.234 | 39.80 | 20.508 |

CIPLAB | 0.121 | 50.70 | 23.091 |

BeWater | 0.137 | 49.59 | 23.948 |

Deepest | 0.117 | 50.54 | 26.041 |

njtech&seu | 0.149 | 46.74 | 26.924 |

LPIPS | LR PSNR | Div. Score | |
---|---|---|---|

Team | |||

svnit_ntnu | 0.481 | 25.55 | 4.516 |

SYSU-FVL | 0.415 | 47.27 | 8.778 |

SSS | 0.237 | 37.43 | 13.548 |

Ours | 0.496 | 46.78 | 14.287 |

SR_DL | 0.311 | 42.28 | 14.817 |

FutureReference | 0.291 | 36.51 | 17.985 |

CIPLAB | 0.266 | 50.86 | 23.320 |

BeWater | 0.297 | 49.63 | 23.700 |

Deepest | 0.259 | 48.64 | 26.941 |

njtech&seu | 0.366 | 29.65 | 28.193 |

## 6 Conclusion

The NTIRE 2021 challenge on learning the super-resolution space is difficult since inference of SR space instead of single SR prediction increases the amount of details to restore from a single LR input. Besides this, it is more difficult to hold the balance between the spanning of SR space and the consistency in LR space, when promoting visual quality as much as possible. To tackle these difficulties, we have proposed a variational sparse framework implemented via neural network to solve the SR challenge raised in NTIRE 2021. Specifically, we design a two-branch module, i.e., VSpM, to explore the SR space. The basis branch of VSpM extracts patch-level basis from the LR input, and the coefficients branch infers pixel-wise variational distributions with respect to the sparse coefficients. Therefore, we could obtain different sparse representations by repeatedly sampling coefficients, and thus generate diverse HR images. Finally, we have tested the performance of VSpSR in Section 4 to show its effectiveness in conducting explorable super-resolution, and discussed the advantages and limitations of VSpSR in Section 5. According to the preliminary results as Tables 4 and 5 show, our team ranks 7-th in terms of released Div. scores [lugmayr2021ntire].

Acknowledgement. This work was funded by the National Natural Science Foundation of China (grant no. 61971142 and 62011540404) and the development fund for Shanghai talents (no.2020015).

Comments

There are no comments yet.