What you should know about Apple’s Advanced Video Quality Tool

By November 30, 2021

Advances in video technology continue to drive the need for new ways of understanding, measuring, and improving the quality of what your viewers see. Regardless of the device they’re watching on or what service they’re watching, today’s viewers continue to demand the highest quality streams — even if they’re bingeing their favorite show for the twentieth time.

In this post, we’re going to take a deep look at Apple’s recently announced Advanced Video Quality Tool (AVQT). We’ll break down what you need to know about the tool, how it compares to SSIMWAVE’s SSIMPLUS® Viewer Score, and what it means for video providers.

What is Apple’s Advanced Video Quality Tool?

Apple announced AVQT at their World Wide Developer Conference this past June. AVQT is a command-line tool available for macOS devices only. 

It provides a full-reference metric, similar to PSNR and VMAF ( input and output videos are needed; it doesn't work as a no-reference algorithm). The tool computes frame level and segment level scores and supports a variety of codecs and formats. AVQT also supports the comparison of videos with different resolutions and takes viewing device resolution and viewing distance into account.

The scores range from one (1) to five (5), with one being the worst quality and five excellent. AVQT offers mid-point/ floating scores, but there is no concrete way to define whether that score is excellent or good or some other definition. AVQT offers mid-point/floating scores like 4.5, for example, but there is no definition on how to map such scores to different ACR (Absolute Category Rating) categories. It is not obvious to a user if 4.5 falls in the excellent or good category.

AVQT is a command-line tool without a GUI. Test and reference videos are passed along with a selection of options to begin the measurement process. Again, since the tool is macOS-only, it cannot be run on a Windows or LINUX system. 

Hits and misses — evaluating AVQT’s performance

There are many parts to an effective quality measurement. We’re going to use nine of the most important success criteria for a viewer experience metric to evaluate AVQT. 

Consistent across content types

In our review, we provided AVQT with two types of video - one simple frame without objects and one sports frame. These two videos had different perceptual qualities. AVQT gave both frames similar scores, but the viewing experience didn’t reflect that. The perceptual quality of the second sample was far higher. If the score range is zero to 100, a 70 for animation should perceptually match 70 for a sports or news broadcast. 

Miss: AVQT does not fulfill this criteria. To be effective and valuable, a quality metric needs to be consistent across content types — if it’s not, the content creator won’t be able to truly understand what the perceptual quality of the content is. 

frame_0_water_crf40_avqt4.71_psnr46.1_svs89_banding83_leftframe_62_motor_crf33_avqt4.75_psnr39.9_svs90_banding0_1080_right
See the full comparison images and download the content and scores.

Consistent across impairments

Impairments can range from blockiness, to blurriness, to banding, to frame breakage due to packet loss. A reliable metric should be sensitive to everything, not just to a category of impairments. Quality measurement must be checked across all content impairments. In our test, we evaluated AVQT detection of banding (contouring effect) that is very common in video nowadays and AVQT did not detect this impairment. 

Miss: AVQT does not fulfill this criteria. Our tests found AVQT to be reliable on some impairments but to skip many others altogether. 

frame_60_ref_leftframe_60_avqt4.60_psnr45.04_banding81_svs83
See the full comparison images and download the content and scores.

Measuring source quality 

The adage “garbage in, garbage out” means a great deal when we’re talking about quality measurements. The score provided by AVQT after encoding is deemed to be reflective of the perceptual quality of the encoded content. The advantage of SSIMPLUS is that it measures the perceived quality of the encoder's outputs by considering the source quality as well as the encoder performance. AVQT cannot measure source video quality since it is only a full-reference metric.

Miss: AVQT does not fulfill this criteria. A terrible quality source is still terrible in people's eyes — even if the encoder does a perfect job. 

frame_65_traincrf50_avqt5_psnr100_svs15_1080_full
See the full comparison images and download the content and scores.

Consistent across content attributes

Attributes like color play an essential role in the perceptual quality of the content, especially for HDR content. Again in our testing, AVQT wasn’t consistent with its scores. The structure of the color in the source might be good, but the perceptual quality is low — AVQT missed to take into account issues with color and other attributes. 

Miss: AVQT does not fulfill this criteria. AVQT appears to be color blind as there are major differences in color between the source video and test video. It also failed to catch issues with the HDR frames we tested and scored them as excellent quality.

frame_0_color_bar_ref_leftframe_0_color_distortion_left
See the full comparison images and download the content and scores.

Support for devices

Viewers may watch one piece of content on their smartphones and then the same piece of content on their 65" HDR TVs. The experience will be different, so the quality measurement must account for the different device types. AVQT has some presets for viewing distance and allows users to set the resolution but it does not allow you to set the size of the device which is a gap.

We did two tests: 

  • First, we took a 720p video and we passed it as a source and output and we set the resolution to be UHD 3840x2160 but we varied the viewing distance from 1.5 times the height to 6 times the height of the device. AVQT viewer score remained the same across all four options. Watching content on a display with a higher resolution and bigger size always smooths up the rendition, and viewers may notice that more by getting closer to the display device.
  • Second, using several 640x360 videos, and setting a display resolution of 3840x2160, we set two viewing distances of 1.5H and 3H. AVQT gave higher scores for the scenarios with the 1.5H distance which is counterintuitive since as viewers get closer to the display, the impairments become more visible, and thus the video quality gets lower.

Miss: AVQT does not fulfill this criteria. Our tests showed that AVQT missed considering device size and did not penalize upscaling operations adequately. It doesn’t provide an accurate or consistent quality measurement for different devices. 

Support for viewer modes

An effective quality measurement needs to consider the different viewer types — from content creators to people of varying age groups and viewing habits. AVQT provides only one type of quality score that may not be a good representation of the variations of human subjective opinion. To capture these variances, SSIMPLUS has three different modes:

  • Typical - for regular end-viewers/streaming services customers who don’t tolerate watching their content with subpar quality.
  • Expert - for video experts who work in the media industry but are not golden eyes. They act as proxies for the end-viewers and are more critical of impairments compared to typical viewers.
  • Studio - for content creators and golden eyes who care about the preservation of creative intent and are very critical of any deviations from it. 

A reliable video quality metric has to consider the subjective assessment of all viewer types.

Miss: AVQT does not fulfill this criteria. Different viewers have different expectations and definitions of video quality.

Consistent across LIVE and VOD

The measurement needs to be consistent with both live broadcasts and video-on-demand or streaming services. A common language for assessing quality is needed across workflows to report and monitor KPIs.

Miss: AVQT does not fulfill this criteria as it does not have the functionality to score live content. 

Temporal alignment

Another important attribute, especially for LIVE video, is the ability to do temporal alignment. 

Miss: AVQT does not fulfill this criteria. AVQT does not provide this function while this is embedded into SSIMPLUS.

Playback viewer experience

Lastly, AVQT doesn’t take playback experience into its measurement. Whether it’s startup delays or interruptions, or profile switching, playback experience can impact viewer experience a lot. 

Miss: AVQT does not fulfill this criteria. A reliable quality metric should measure every point in the video delivery chain to ensure the best quality viewing experience. 

In conclusion

Consistency is critical when it comes to measuring video quality — both in encoding and viewer experience. AVQT shares many of the same shortcomings as VMAF but adds in the inability to be used on anything other than a macOS-powered workstation. We spoke in more detail about our thoughts on AVQT at the Kitchener Waterloo Video Technology Meetup. You can watch the video below:

https://www.youtube.com/watch?v=dCKAygKwqiU&t=2033s