The structural similarity (SSIM) index is a method to automatically predict perceived quality of images and videos. It has been getting extremely popular in the past few years. In academia, the original SSIM paper published in 2004 has received over 35,000 Google Scholar citations so far, perhaps more than any paper in the literature of video engineering. It also received an IEEE Signal Processing Society Best Paper Award, one of the most prestigious paper awards in signal processing. In industry, the SSIM algorithm received the Primetime Engineering Emmy Award, one of the most prestigious technology awards in the TV industry. In the citation by Television Academy, it says,

“SSIM is now a widely used perceptual video quality measure, used to test and refine video quality throughout the global cable and satellite TV industry, and directly affects the viewing experiences of tens of millions of viewers daily”

So what is the science behind SSIM? What made it spread so quickly and so widely?

One good reason is that there is a very strong demand for image/video quality measures, perhaps much more than most people realize. No matter what image/video processing problem you are working on, the same issues repeatedly come up: How should I evaluate the images generated from my algorithms/systems? How do I know my algorithm/system is creating an improvement between the input and output images, and by how much? How can I compare the performance of two algorithms/systems which produce different outcome images? What should my algorithms/systems optimize for? With the rapidly increasing volume of image/video data, these issues become impossible to be addressed promptly by subjective visual testing. Only a trustworthy objective image/video quality measure that can be computed instantly can resolve the problems.

While the reasoning above is sound, it does not explain what’s special about SSIM, because any good objective quality measure would do the job. To understand it better, we’d need to go back to the first few years after the new millennium when the SSIM idea was initialized. At that time, there had already been some significant work done in the area, but a common belief was that predicting visual perception of image quality is an extremely complicated problem. To achieve the goal, one has to have a comprehensive understanding of the computational mechanisms in the visual pathway, to which a vast psychophysical and physiological vision literature has been dedicated to, but could still only understand very little (even now). In the engineering world, the computational models used to assess video quality were so complicated, such that people needed to make ASIC chips to perform the computation, and the equipment could cost 10s or 100s of thousands of dollars, but could assess only small samples of video segments, and not in real-time. Moreover, even for these complicated vision-based models, many people were still questioning if they are providing valuable visual quality predictions at all. A good example is the first independent test done by video quality experts group (VQEG) in 2000, where all advanced models performed equivalently to MSE/PSNR. At that time, it seemed that the only way to improve was either to make the models even more complicated, so as to capture more visual features in more precise ways, or to reduce the problem to specific applications, so that a number of objective models may be developed, each targeting at only a specific type of distortions.

Having the above in mind, it becomes much easier to understand why SSIM surprised people when it was first published. First, SSIM is not constructed to directly implement any psychological or physiological vision model. Instead, it makes a simple assumption about the overall functionality of the visual system, i.e., to extract structural information from the visual scene. It then attempts to capture structural and non-structural distortions separately before combing them. Before SSIM, very few efforts had been made to challenge the general principle in the design of image quality models, and most people did not believe predicting image quality is ever possible without knowing how the neurons work. Second, the SSIM formula looks quite different from any image quality assessment method or any biological vision model at that time, and the computation is simple and fast, much faster than state-of-the-art approaches back then. Third, the SSIM algorithm (together with its earlier version, the universal image quality index) was presented with striking demonstrations, where images undergoing very different types of distortions but with the same MSE/PSNE value have drastically different visual quality, and such quality variations are well predicted by SSIM. Fourth, despite its simplicity, SSIM gives much better image quality predictions than other complicated methods when tested using subject-rated image databases available back then. All of the above makes SSIM very special. More importantly, with SSIM, suddenly we find that we are getting much closer to deploying highly efficient and highly effective automated image quality assessment systems in the real world.

The success of SSIM played an important role in stimulating a large body of research work on image quality assessment in the past ten years. Like SSIM, many of the newly proposed approaches do not strictly follow biological vision models. A large number of researchers with diverse background are attracted to the field, and more and more Ph.D. these are dedicated to image quality problems. As such, the diversity in the design methodologies of image quality assessment models has been largely enriched.

To summarize the main point of this blog in one sentence: SSIM became popular because, for the first time, it made people believe that a shortcut to a seemingly extremely complicated problem may indeed exist.

Having said the above, we do not mean to say that SSIM is the only shortcut, or SSIM is the ultimate solution in practice. Rather, SSIM is a highly visible milestone in the middle of a long journey, from which great effort is still needed to develop more advanced models that better fit the practice. To achieve the goal, we’d first need to have a clear understanding of the limitations of SSIM, which will be a topic of our future blogs.