Google has proposed new a new metric for evaluating AI-generated audio and video quality in a bid to develop a more widely adopted method of synthesizing audio and video.
The tech behemoth said in a Tuesday blog post that it has made strides in establishing a more accurate way of measuring AI-generated content against what the machine was trained on.
To explain this, Google researchers Kevin Kilgour and Thomas Unterthiner used the example of a model that generates videos of StarCraft video game sequences.
“Clearly some of the videos shown below look more realistic than others, but can the differences between them be quantified?” the researchers wrote. “Access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently no such metrics exist.”
Which is best?
To better quantify the accuracy of machine-generated content, Google proposes two new metrics: the Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD).
“We document our large-scale human evaluations using 10k video and 69k audio clip pairwise comparisons that demonstrate high correlations between our metrics and human perception,” the researchers said in a blog post.
Building on the Fréchet Inception Distance
The two metrics were built on the principles of the Fréchet Inception Distance, a similar metric specifically designed for images that takes a large number of image from bot the target distribution and generative model, and uses the Inception object recognition network to embed each image into a lower-dimensional space to capture important features.
Unlike other popular metrics, FVD looks at entire videos to avoid the drawback of framewise metrics, and FAD is reference-free and can be used on any type of audio, unlike existing metrics that either require a time-aligned ground truth signal or target a specific domain like speech quality, Google said.
Since human judgement is the gold standard for what looks and sounds realistic, Google’s team of researchers conducted a large-scale human study to determine how well the proposed metrics align with human judgement of AI-generated audio and video.
Humans examined 10,000 video pairs and 69,000 five-second audio clips. For FAD, they compared the effect of two different distortions on the same audio clip, randomizing both the pair and the order in which they appeared.
Testers were asked which audio clip sounds most like a studio-produced recording, and the study found that FAD “correlates quite well” with human judgement.
“We are currently making great strides in generative models. FAD and FVD will help us keeping this progress measurable, and will hopefully lead us to improve our models for audio and video generation,” the team said.