HeadSpin Documentation
Getting Started
HeadSpin Platform
API Reference
Mini Remote
Biometrics SDK
Performance Analysis
Performance Monitoring
On-Premise Deployments
Deprecated Documentation
Legal Matters

HeadSpin Video Quality MOS

HeadSpin Patent-Pending Video Quality MOS

Reference-free Subjective Video Quality Mean Opinion Score (MOS): What is it?

A mean opinion score is a measure of the average of the subjective scores provided by a set of users in a study of the perception of content. The MOS provides a way to quantify the subjective perception of video content using a single number where the alternative is having potentially dozens of metrics that loosely or only occasionally correlate with perceived video quality. This metric is computed by averaging the subjective scores of a large number of users that takes into account user bias. Increasing the number of users and removing as much user bias as possible is critical to ensure that the MOS is as representative of the content as possible. For the HeadSpin Video Quality MOS, the mean opinion score provides an estimate of the true mean subjective quality score of the video content.

Traditionally, video quality estimation techniques have relied upon a comparison of some unaltered source video to a transformed version of that video. The transformation usually incorporates some lossy process that introduces spatial or temporal defects in the video content. In many cases, a reference video cannot be obtained (for example in the case of streaming content providers) or it simply does not exist (as is the case for gaming applications or live-streaming video). The HeadSpin Video Quality MOS implements a reference-free video quality algorithm that does not rely on any comparison to a source video. Instead, the algorithm is designed to compute a MOS time series for an input video that best estimates the MOS provided by actual humans.

How can you use it?

The patent-pending HeadSpin Video Quality MOS time series can be used in a variety of ways. For example, when paired with the HeadSpin Poor Video Quality Issue Card, the UI will surface perceptual video quality issue regions to the time series. The poor video video quality regions can be used to help gain insight on correlations between video quality issues and other application related metrics. Furthermore, the HeadSpin Video Quality MOS time series can be used in conjunction with the range of HeadSpin video quality metrics to help understand which metrics have the largest impact on the user quality of experience. Additionally, if a reference video or session is available, the MOS for a subsequent or parallel test can be compared against the reference to estimate a change in the MOS, providing a comparison of video quality. This can be done exactly (from video playback start to end) or more coarsely by estimating average MOS or comparing distributions of MOS between sessions.

How it works

As simple as video in → HeadSpin Video Quality MOS time series out.

The HeadSpin Video Quality MOS time series can be generated for videos captured directly on our platform or for videos supplied to our server through an API. The algorithm delivers an estimate of the mean opinion score for each frame in the supplied video with a minimum MOS value of 1 and maximum MOS value of 5, corresponding to perceptual video qualities of Very Poor and Excellent, respectively. This estimated MOS is computed by considering an aggregated region of video data around the frame index as the analysis steps through the frames in the video.

hs ui example

Building our training data.

Human-evaluated subjective quality score labels are a critical component of the AI-based HeadSpin Video Quality MOS algorithm. Without these labels, the artificial intelligence based algorithm would be unable to accurately predict the video quality score.

annotation framework

To gather these labels, HeadSpin developed a mobile annotation application that allows users to score the quality of videos displayed on their mobile device. Developing a dataset that incorporates a wide variety of video content is critical to ensuring that the AI generalizes well. The videos that comprise our dataset are collected from recordings of real video streams on real devices on the HeadSpin device cloud. When videos are shown to users for annotation in the HeadSpin mobile annotation app, they are viewing the videos exactly as recorded in real world conditions.

Additionally, ensuring the quality of the actual labels provided by the users is critical to ensuring that the AI makes accurate predictions. Validation of the labels produced by users of the mobile annotation application is done in multiple ways. By randomly injecting videos with known ground-truth MOS scores into user label sessions, we can estimate some bias for each of the users in our annotation platform. Furthermore, reintroducing previously labeled videos allows us to estimate the self-accuracy of any given user. A set of statistical techniques were developed at HeadSpin to provide an accurate estimate of the MOS based on the collected labels from our annotation platform.

Here at HeadSpin, we have gone to great lengths to gather video content and develop the experimental tooling and techniques that allow our AI to both generalize to video content it has never seen before and have it perform accurately and consistently on video content that it was trained on.

Using convolutional neural networks to extract layers of knowledge.

Convolutional neural networks are a machine learning algorithm widely used in the field of computer vision. They are a tried and tested class of algorithms widely relied on in production and cutting-edge research. These neural networks are fundamentally powered by a set of filters applied in various ways to an image. The magic of convolutional neural networks lies in the use of optimization algorithms to learn what each filter at each stage is actually filtering in the image. This is done by supplying the machine learning model with millions of images and allowing the algorithm the modify the filters such that the algorithm works well across as many images as possible.

This type of algorithm is typically used to classify an image as containing a dog, a cat, or a stop sign. However, due to the layered nature of convolutional neural networks, we can "peel off" various layers, such as the classification layer, until we get to deeper, more abstract representations of the input image. This abstract representation of the input image can encode interesting high level features such as the content of the image (dog, cat, etc.) as well as low level features such as geometric details of the image. Convolutional neural networks are able to develop these abstract representations by passing an input image through a set of filters and combining the outputs of the filters in multiple ways. These filters are developed by the algorithm through the process of machine learning where the algorithm is able to observe millions of images, adjusting each filter in the most optimal way given the images it has observed. For example, a convolutional neural network may learn that vertical lines are an important detail to extract from an image, so it will learn a filter that is sensitive to that feature. For high level features, the network may learn that dog and cat tails are an import feature to learn, so it will learn a filter that is senstive to bushy appendages at the back of a body. These learned filters are an efficient way of extracting both high and low level detail contained in an image and by using convolutional neural network algorithms, we are able derive generic abstract representation of our videos, making use of large datasets to allow our AI to determine the best way to use those abstractions.

network diagram

Tree-based algorithm to capture complicated, non-linear relationships.

Similar to neural networks, tree-based algorithms are able to capture non-linear relationships that exist in a set of data. This class of algorithms has many advantages that include robustness and performance even on a limited set of training data. Tree-based models form the foundation of many sophisticated machine learning models used in industry today and are trusted widely across many fields from finance to physics.

For the HeadSpin Video Quality MOS algorithm, we use a tree-based model to map convolutional neural network-encoded video streams to labels provided by users of our video annotation mobile application. This multi-model process allows us to leverage the power of convolutional neural networks to extract deep information while relying on robust tree-based models to predict the MOS given the abstract representations of the video.

HeadSpin Video Quality MOS Overview

Expectations for the HeadSpin MOS algorithm.

The HeadSpin Video Quality MOS is a deterministic algorithm meaning that any given input is guaranteed to map to only one output. This guarantee allows users to perform absolute comparisons with the HeadSpin Video Quality MOS. Videos captured on different devices with identical frame content will have identical MOS time series results.

Futhermore, the HeadSpin Video Quality MOS algorithm is independent of the input video frame dimensions to the extent that those dimensions do not affect the perception of the video. For example a video stream that is likely to be perceived as low-quality but captured at high resolution will have correspondingly low values that form the HeadSpin Video Quality MOS time series.

HS MOS exxample

Why use a machine learning approach?

Cutting-edge machine learning approaches can more accurately estimate the human-perceived MOS when compared to more traditional approaches used by the industry. In addition, a machine learning approach is well-suited to work without a reference video -- the machine learns what a human would perceive the quality of the content as. As discussed above, complicated spatial or temporal structure can be more easily exposed via a convolutional neural network approach while highly non-linear relationships between these structural elements can be learned via tree-based algorithms. Furthermore, this model is ultimately a computer vision model and convolutional neural networks are generally regarding as the best current tool available for computer vision tasks. Convolutional neural networks allow us to leverage the value of large datasets as opposed to spending time developing heuristic models with limited scope and application. This class of models is better at generalizing solutions and can make use of previous research and development to attain higher accuracy even in data-limited environments.

How to acheive the most value from the HeadSpin MOS?

In order to gain the most value from the HeadSpin Video Quality MOS algorithm, we reccomend considering the following:

  • Develop distinct experiments

Create sessions or upload videos with the intention of each session or upload encapsulating some independent unit test case. This will allow for easier comparison across each test case and will ensure that the statistical aggregate values of the MOS time series for each session or upload have some useful meaning.

  • Supply the algorithm with rich media content

Focus on capturing sessions or uploading videos that display some rich media content, such as streaming video or video gaming content. The HeadSpin Video Quality MOS algorithm generalizes best to the type of content it was trained on, which includes but is not limited to streaming video, live-streaming video, mobile gaming, and console gaming.