Automatic Speech Recognition - Transcribing Lyrics

Understand singing content. Written in 2018.

Photo by Dmitry Demidov

We know that spoken content (speech) can now be successfully retrieved, browsed, summarized and comprehended, but songs, which also appear in many multimedia, have not been well coped with. Songs are human voice carrying plenty of semantics just as speech. For example, we can capture the core idea of the songs by their lyrics. However, the lyrics are difficult to recognize because of the much more flexible prosody (pitch, duration, pauses, energy).

Challenges to Understand Singing Voice

We take transcribing lyrics as a more difficult version of automatic speech recognition (ASR) while considering characteristics of songs:

  • Word Repetition & Meaningless Word & Pause

    "Lady Gaga - Bad Romance"

  • Varying Pitch

    "Bruno Mars - The Lazy Song"

    • 324.5Hz
    • 374.2Hz
    • 423.6Hz
  • Prolonged Phoneme Duration

    "Charlie Puth - See You Again" : the phoneme "ao" in the word "long"

Method

We address above three characteristics by:

  • training lyrics language model
  • doing feature adaptation for every segments
  • extend lexicon and increased self-loop HMM

Combining the above methods, we trained both GMM-HMM models and deep learning models. Specifically, we got the best performance by TDNN-LSTM with 3-folded speed perturbation.

Check the full results in the paper [link].

Demo

Good Results
Reference Hypothesis Original Audio

WHAT DOESN'T KILL YOU MAKES YOU STRONGER STAND A LITTLE TALLER DOESN'T MEAN I'M LONELY WHEN I'M ALONE WHAT DOESN'T KILL YOU MAKES A FIGHTER FOOTSTEPS EVEN LIGHTER DOESN'T MEAN I'M OVER CAUSE YOU'RE GONE WHAT DOESN'T KILL YOU MAKES YOU STRONGER STRONGER JUST ME MYSELF AND I WHAT DOESN'T KILL YOU MAKES YOU STRONGER STAND A LITTLE TALLER DOESN'T MEAN I'M LONELY WHEN I'M ALONE

WHAT DOESN'T KILL YOU MAKES YOU STRONGER THAN A LITTLE TALLER DOESN'T MEAN I'M ONLY WHEN I'M LONG GONE KILL YOU MAKES A FIERY AND SNUFFY THE LIGHTER DOESN'T MEAN I'M OVER CAUSE YOUR GONE FUCKING KILL YOU MAKES YOU STRONGER SOLD OUR JUST ME MYSELF AND I WHAT DOESN'T KILL YOU MAKES YOU STRONGER THAN A LITTLE TALLER DOESN'T MEAN TO LOVE ME WHEN I'M ALONE

REF: YEAH MAN YOU SAY YOU'RE SEARCHING FOR SOMEBODY THAT WILL TAKE YOU OUT AND DO YOU RIGHT WELL COME HERE BABY AND LET DADDY SHOW YOU WHAT IT FEELS LIKE YOU KNOW ALL YOU GOTTA DO IS TELL ME WHAT YOU'RE SIPPING ON AND I PROMISE THAT I'M GONNA KEEP IT COMING ALL NIGHT LONG

HYP: YEAH MAN SHE'S SEARCHING FOR SOMEBODY THAT _ YOU _ _ DO UNTO YOU ALONE AND BABY I LIKE THAT YOU EVER FEEL US ARE YOU KNOW YOU GOTTA DO IS TO BE WHAT YOU SIPPING KNOW YOU ARE THE AND A PROMISE THAT I'M GONNA KEEP IT COMING ON AND ON

Noisy and Bad Results
Hypothesis Original Audio

HYP: WE'LL WRITE A I'M YOU MIGHT THINK I'D SIT AND CRY HOLY ONLY LINE BEFORE YOUR EYES

HYP: LIKE MUSLIM TURN TO CRY TO SPILL MY DRINK SOME LITTLE GIRLS I'M SCREAMING THE SILVER SCREEN PROTECTION IF YOU WANT THIS TIME AGAIN THE TELEPHONE KEISHA

Error Analysis

Most errors can be summarized into three categories: falsetto, high pitch with harmony, and prolonged vowel.

  • Falsetto

    HYP: I'M FEELING SEXY AND THE

  • High Pitch & Harmony

    HYP: WHAT DOESN'T KILL YOU MAKES YOU FORGOT SOUL ARE JUST ME MYSELF

  • Prolonged Vowel

    HYP: WHEN THE WHOLE SONG MOAN