How does music recognition work with Shazam app?
It’s a typical scenario: You listen to a song at a restaurant or an event, and then that song stays with you, haunting you, forcing you to find its source and singer.
Earlier, the only option was to replicate that song, or a few verses and ask friends and family to hunt for the source.
But since 1999, this changed, because that year, a magical app was born, called Shazam.
What Exactly Is Shazam?
Shazam is an app that can recognize music, movies, advertising, and television shows, and showcase the source and other details about that content.
It seems magical, right?
In this blog, we will decode the internal working of Shazam, and find out how it works.
But first, let’s have an overview of Shazam, and find out some startling facts about this app.
Shazam: A brief history
Shazam was launched in 1999 by Chris Barton, Philip Inghelbrecht, Avery Wang, and Dhiraj Mukherjee, who teamed up to create a system that can recognize music and other related content and reveal everything about them.
In 2018, Apple acquired Shazam for $400 million, making it one of the biggest acquisitions of any mobile app at that time.
As of now, Shazam is not only available in Apple App Store, but also on Android, macOS, iOS, Wear OS, watchOS, and as a Google Chrome extension.
Stunning facts about Shazam
As of September 2022, Shazam has more than 225 million global monthly users, and it’s expanding at a rapid pace.
In order to recognize songs and TV/video content, Shazam has acquired 200 patents as of now, the app has more than 12 billion tags, which can categorize music and video content based on user inputs.
In 2015, it was found that 5% of all music downloads, across the world, originated from Shazam, making it one of the biggest databases for music content, anywhere in the world.
And one big statistic for businesses, looking for a solid platform to advertise their products: Advertisers on the Shazam app receive an average of 1 million clicks per day!
And one interesting trivia: The most searched (Shazamed) song ever is Dance Monkey, which has been searched a record 41 million times, and Drake is the most Shazamed artist ever with 350 million hits.
An incredible achievement for a tech-powered mobile app!
How Shazam works: Understanding music recognition algorithms & fingerprinting
The operational model of the Shazam app is simple: The app listens to max 20 seconds of a song or video content from TV, a movie, ads, etc, and it can be a chorus, verse, or a mere intro, and then instantly recognize that content, and show the results.
An important thing to note: No matter how long that song or content is, the Shazam app will only read the first 20 seconds.
Now, once that data is fed into the Shazam app, then it will:
- Create a fingerprint record that sample
- Create a fingerprint record that sample
- Deploy music recognition algorithms to tell you exactly which song or content it is.
Now, this process of recognizing music is not that simple!
There are tons of processes and algorithms that work in tandem to reveal the exact source of the music and content.
In 2013, one of the inventors of the Shazam app: Avery Li-Chung Wang, shared the magic behind Shazam app via research paper, and for the first time, revealed how this app works.
Understanding the elements of sound
First, let’s understand what is sound..
As per science, sound is a vibration, that propagates via mechanical waves comprising pressure and displacement, and the medium is air mostly, or water in some cases.
The three main components of sound are frequency, time, and amplitude.
Amplitude is the loudness of the sound, which is actually the size of the vibration.
Frequency, measured in Hertz (Hz) is the rate at which the vibration occurs. A human being can only listen to sound whose frequency lies between 20Hz to 20,000Hz.
And time is important because it shows at which time interval a sound has occurred, in relation to other sounds.
This is important to know because when a song is produced, it has sounds from different instruments, that vary in frequency and amplitude, as they move through time in relation to one another.
This is the reason that the same song having two different versions will still generate a unique fingerprint, due to the complexity of frequency, amplitude and time.
Creating unique audio fingerprint
Once the Shazam app records the first few seconds of a song or any audio content (max is 20 seconds of recording), it will create a unique audio fingerprint of that song.
And for that, the analog sound being recorded is converted into a spectrogram, wherein the X-axis represents time, Y-axis represents frequency, and the density of the shading represents amplitude.
For each section of an audio file, the algorithm chooses stronger peaks, and gradually, the spectrogram is reduced to a scatter plot. A point comes when the amplitude is no longer needed.
And this is the crux of Shazam’s operations.
Two unique fingerprints are created, and then they are matched to find the exact song which is being fed into the system.
There is an advanced process called combinatorial hashing, which is deployed to create exact and unique fingerprints of the audio files and to ensure that the matching is perfect.
This is how it works:
- Each anchor-point pair is first stored in a table, which contains the frequency of the anchor, the frequency of the point, and the time consumed between the anchor and the point called Hash. (Table #1)
- There is another table that contains the time between the anchor and the beginning of the audio file. (Table #2)
- Hash is then linked with Table #2
- The files in the database too, have unique IDs, which are used to extract more information about the song, such as the singer’s name, the song’s title, and more.
- The anchor-point pairs from the user’s recordings are first sent to Shazam’s database, for searching the exact match of anchor points in the database.
- This search will return the audio fingerprints of all those songs, that contain this Hash (formulated from combinatorial hashing)
- Once all the possible matches are located for the user’s recordings, the time offset between the beginning of the user’s recording and the beginning of the possible matches is found out.
- In case a significant amount of matching hashes have the same time offset, then bingo! It’s definitely the same song.
- The song is recorded by the user via the Shazam app
- This song is in Analog form, which is converted into Digital form
- This digital form is converted into the frequency domain, via Fourier Transform
- Unique audion fingerprint is created using Spectrogram
- This fingerprint is compared with all the possible matches in the Shazam database
- Via combinatorial hashing, the exact match is found
- The user gets the details of the audio content
How Shazam matches the songs & provides the results?
At this point, we have the unique fingerprints of both audio files. Now, the actual process of matching the songs starts.
This is how it works:
Now, if we plot this process of the matching process onto a scatter plot, wherein the Y-axis represents the time at which hash occurs in the user’s recording, and X-axis is the time at which the hash occurs in the database’ recording, then the matching hashes will form a diagonal line.
At the same time, if this plotting is done on a histogram, then there will be a spike at the correct offset time.
Here’s a summary of this entire process, in brief:
And it hardly takes seconds to complete this entire process of finding the exact match.
If you wish to know more about how Shazam’s highly advanced algorithm for recognizing songs and other audio content works, and if you wish to use the same logic and process to create your own mobile app for recognizing songs, then our System Architects and Mobile App Engineers at TechAhead can help you.