HeadSpin's custom Bluetooth board and analysis API allows testing voice assistants, validating streaming media, and working with voice calls on real devices.
Previously, we looked at how to capture audio playback during testing. Now, it's time to look at how to verify that the audio matches our expectations! The very fact that we have expectations needs to be expressed in audio form. In other words, what we want to do is take the audio we've captured from a particular test run, and assert that it is in some way similar to another audio file that we already have available. We can call this latter audio file the "baseline" or "gold standard", against which we will be running our tests.
In the first part, the audio we care about verifying is a sound snippet from one of my band's old songs. So, what we need to do is save a snippet we know to be "good", so that future versions of the tests can be compared against this. I've gone ahead and copied such a snippet into the resources directory of the Appium Pro project. The state of our test as we left it in the previous part was that we had captured audio from an Android emulator, and asserted the audio had been saved, but had not done anything with it. Here's where we are starting out from now, with our new test class that inherits from the old test class:
What we now need to do is assert that our audioCapture File in some sense matches our gold standard. But how on earth would we do that?
Audio file similarity
As a naive approach, we could assume that similar wav files might be similar on a byte level. We could try to use something like an MD5 hash of the file and compare it with our gold standard. This, however, will not work. Unless the WAV files are exactly the same, the MD5 hash will likely be completely unrelated. We could get slightly more complicated, and actually read the WAV file as a stream of bytes and compare each byte of our captured WAV with the baseline WAV. This approach, unfortunately, is also doomed to fail! Tiny differences in sound would lead to huge differences on a byte level. Also, if the timing of the two WAV files differs by anything more than the sample rate (which is many thousands of times per second), every single byte will be different, and our comparison will be utter garbage.
HeadSpin - One platform for all your Audio-Visual testing. Learn more!
What we will do instead is take advantage of work that has been done in the world of audio fingerprinting. Fingerprinting is what lies behind services like Shazam or last.fm that can detect what song is being played even though it might be recorded through your phone's microphone. Fingerprinting is a complex algorithm that takes into account various acoustic properties of WAV file segments and produces what is essentially a hash of a piece of audio. The important thing is that similar audio files will produce more similar hashes, so they can actually be fruitfully compared with one another.
The fingerprinting library we will use is called Chromaprint, and you will need to download the appropriate version for your system. Just like with ffmpeg, we will run the Chromaprint binary as a Java subprocess. The way we'd run it outside of Java, on the command line, would be like this:
This will produce output that corresponds to the fingerprint of the audio file. Using the -raw flag means we get the raw numeric output rather than the base64-encoded output (which is nice and small but makes the comparisons between fingerprints less strong). Running from the terminal, the output will look something like:
But we want to run this from Java, so we need a handy class that encapsulates all this fingerprinting business, including running Chromaprint's fpcalc binary. It will also be responsible for parsing the response and storing it in a way that makes comparison easy:
Basically, what's going on here is that we are setting a path to the fpcalc binary, and then using the ProcessExecutor Java library (from the good folks at ZeroTurnaround to make executing fpcalc very easy. We then use regular expression matching on the output to extract a fingerprint from an audio file. Most of the code here is simply Java class boilerplate and regular expression logic!
The most important bit is the compare method, where we are making use of something called the Levenshtein distance between strings to figure out how similar to audio fingerprints really are. To this end I'm using a library called JavaWuzzy (a port of the useful Python library FuzzyWuzzy), which contains the important algorithms so I don't need to worry about implementing them. The response of my call to the partialRatio method is a number between 0 and 100, where 100 is a perfect match and 0 signifies no matching segments at all.
All we need to do then, is hook this class up into our test so that we can fingerprint both our newly captured audio as well as the baseline audio, and then run the comparison. In my experiments, I was able to achieve a value of about 75 for a correct comparison, whereas other song snippets came in at an appropriately lower value, say 45. Of course, you'll want to determine through experimentation what your similarity threshold should be, based on the particular audio domain, clip length, etc...
Read: Data-Driven Reasons to Use Audio Visual AI for End-to-End Testing
Hooking in the new code is relatively easy (starting from the point in the test method where we have the audioCapture file populated with the new audio:
Here, I've added a helper method called getReferenceAudio() to get me the baseline audio File object from the resources directory. And notice the assertion in the final line, which turns this bit of automation into a bona fide test of audio similarity!
So, when all is said and done, it is possible to test audio with Appium and Java (and since we are using ffmpeg and Chromaprint as subprocesses, the same technique can be used in any other programming language as well). This is relatively unexplored territory, though, so I would expect there to be a certain amount of potential flakiness for this kind of testing. That being said, the Chromaprint fingerprinting algorithm is used commercially and appears to be quite good, so at the end of the day the quality of the test will depend on the quality, length, and genre of your audio. Please do let me know if you put this into practice as I'd love to hear any case studies of this technique. And don't forget to check out the full code sample on GitHub, to see everything in context. Happy testing, and happy listening! Oh, and in case you really wanted to know: yes, my band will be coming out with a new studio album very soon!