SSIM: The paradigm shift in video quality measurement.

The structural similarity index measure (SSIM) is a method to automatically predict perceived quality of images and videos. When it was first introduced in 2004 with the paper: Image Quality Assessment: From Error Visibility to Structural Similarity, it ushered video quality measurement into a new era. Before SSIM, perceptual video quality measurement was a tedious and long process that  produced unreliable results. SSIM showed for the first time that it is possible to use simple algorithms to make accurate predictions of viewer quality-of-experiences. Since 2004, the paper has received close to 35,000 Google Scholar citations, among the highest in video engineering literature. It also received the prestigious IEEE Signal Processing Society Best Paper and Sustained Impact Paper Awards and the inventors of the SSIM algorithm received the Primetime Engineering Emmy® Award in 2015, one of the highest technology awards in the TV industry. 

The Television Academy had this to say: “SSIM is now a widely used perceptual video quality measure, used to test and refine video quality throughout the global cable and satellite TV industry, and directly affects the viewing experiences of tens of millions of viewers daily.”

Now, almost 20 years since Professor Zhou Wang, SSIMWAVE® co-founder and Chief Science Officer, first co-created SSIM, it has become a stepping stone to some of the major developments in video quality assessment in the 21st century.

So what is the science behind SSIM? What made it spread so quickly and so widely?

One good reason is that since it was first published there has been an explosion of video creation and consumption. Video is growing by double-digit percentages every year and US streaming revenue has more than tripled from $7 billion in 2015 to $24 billion in 2020. YouTube was launched in 2005, but streaming has really exploded in the past 2-3 years with consumers spending more time glued to screens due to COVID. (According to statistics, people in the US spent an average of 8 hours a day streaming video in 2020.)

These developments led to strong demand for image/video quality measures, perhaps much more than most people realize. One big reason is that testing videos manually or just using 10-second clips is no longer viable or a good practice, especially as the top streaming services scale across the globe, reaching more than 100 million subscribers. No matter what image/video processing problem CTOs and video engineers are working on, the same issues repeatedly come up: 

  • How should I evaluate the images generated from my algorithms/systems? 
  • How do I know my algorithm/system is creating an improvement between the input and output images, and by how much? 
  • How can I compare the performance of two algorithms/systems which produce different outcome images? 
  • What should my algorithms/systems optimize for? 

With the rapidly increasing volume of image/video data, these issues become impossible to address promptly by subjective visual testing. Only a trustworthy objective image/video quality measure that can be computed instantly can resolve the problems.

While the reasoning above is sound, it does not explain what’s special about SSIM. To understand it better, we’d need to go back to the first few years after the new millennium when the SSIM idea was first proposed. At that time, there had already been some significant work done in the area, but a common belief was that predicting visual perception of image quality is too complicated to achieve. To fulfill the goal, one must have a comprehensive understanding of the computational mechanisms in the visual pathway. While much psychophysical and physiological vision literature has been dedicated to this topic, it was - and is – understood very little (even today). In the engineering world, the computational models used to assess video quality were so complicated, ASIC chips needed to be made just to perform the computation. The equipment used could cost hundreds of thousands of dollars, but could assess only small, 10-second samples of video and not in real-time. 

Moreover, many people still questioned if these complicated vision-based models provided valuable visual quality predictions at all. A good example is the first independent test done by video quality experts group (VQEG) in 2000, where all advanced models performed equivalently to MSE/PSNR. At that time, it seemed that the only way to improve was either to make the models even more complicated, so as to capture more visual features in more precise ways, or to reduce the problem to specific applications, so that a number of objective models could be developed, each targeting only a specific type of distortions.

With the above in mind, it becomes much easier to understand why SSIM was a surprise when it was first published. 

  1. SSIM is not constructed to implement directly any psychological or physiological vision model. Instead, it makes a simple assumption about the overall functionality of the visual system, i.e., to extract structural information from the visual scene. It then attempts to capture structural and non-structural distortions separately before combining them. Before SSIM, very few efforts had been made to challenge the general principle in the design of image quality models, and most people did not believe predicting image quality would be possible without knowing how the neurons work. 
  2. The SSIM formula looked quite different from any image quality assessment method or any biological vision model at that time, and its simple and fast computation was much faster than state-of-the-art approaches back then. 
  3. The SSIM algorithm (together with its earlier version, the Universal Image Quality index (UIQ)) was presented with striking demonstrations, in which images undergoing very different types of distortions would have drastically different visual quality. While these had the same MSE/PSNR value, the quality variations were well predicted by SSIM. 
  4. Despite its simplicity, SSIM gives much better image quality predictions than other complicated methods when tested using subject-rated image databases available back then. 

All of the above make SSIM very special. More importantly, with SSIM, suddenly we find that we are getting much closer to deploying highly efficient and highly effective automated image quality assessment systems in the real world.

The success of SSIM played an important role in stimulating a large body of research work on image quality assessment in the past 17 years. Like SSIM, many of the newly proposed approaches do not strictly follow biological vision models. A large number of researchers with diverse backgrounds are attracted to the field, and more and more PhDs are dedicated to solving image quality problems. As such, the diversity in the design methodologies of image quality assessment models has been largely enriched.

To summarize the main point of this blog in one sentence: SSIM became popular because, for the first time, it made people believe that a simpler solution to a seemingly extremely complicated problem may indeed exist.

SSIM remains a breakthrough milestone in the middle of the long viewer experience journey, but since its introduction in 2004, great advances have been made that have reinforced its value. Professor Wang and his team at SSIMWAVE evolved the algorithm into the more sophisticated SSIMPLUS® that is optimized for streaming – it not only delivers quality but also helps to create more efficient video workflow systems that can work at scale.

To understand why the original SSIM is falling short for the latest developments in streaming and why you need more advanced algorithms you need to know the limitations of SSIM, which is a topic of another blog.