Transcribing Lyrics from Commercial Song Audio: the First Step towards Singing Content Processing

Che-Ping Tsai*, Yi-Lin Tuan*, and Lin-shan Lee. (*co-first author)

BIbTex

@inproceedings{TsaiTuan2018Lyrics,
  title={Transcribing Lyrics from Commercial Song Audio: the First Step towards Singing Content Processing},
    author={Che-Ping Tsai, Yi-Lin Tuan, and Lin-shan Lee},
    booktitle={Proceedings in ICASSP},
    year={2018}
}

Why Transcribing Lyrics?

We know that spoken content (speech) can now be successfully retrieved, browsed, summarized and comprehended, but songs, which also appear in many multimedia, have not been well coped with. Songs are human voice carrying plenty of semantics just as speech. For example, we can capture the core idea of the songs by their lyrics. However, the lyrics are difficult to recognize because of the much more flexible prosody (pitch, duration, pauses, energy).

Datasets

How to Transcribe Lyrics?

We take transcribing lyrics as a more difficult version of automatic speech recognition (ASR) while considering characteristics of songs:

Corresponding to the above three characteristics, we can respectively tackle them by:

Combining the above methods, we trained both GMM-HMM models and deep learning models. Specifically, we got the best performance by TDNN-LSTM with 3-folded speed perturbation.

Feel free to see the results in our paper.

Demo