`webrtcvad` and Shorter Sections

March 15, 2021 1 minute read

Cheryl and I spent some time taking a bit of a break from training models to try and figure out what it is. One big thing could be an issue with long sentences. When digging around on some forums, we found that some people are having problems with their custom datasets when longer sentences keep popping up.

Because manually labelled data is entirely unfeasible for this project, not to mention takes away from the spirit of “mostly automatically generated dataset”, I decided to also lookup good ways to cut off the silence at the start of audio segments. webrtcvad looked like the most promising option as I was unable to configure Aeneas to do it automatically. I also did not have as much luck with other forced alignment programs, so webrtcvad it is.

Attempt 1 at removing starting silence and breath intake can be found here. A second try filtered out audio segments that were longer than 10 seconds can be found here

Other modifications include:

added breaking up segments using – in preprocessing
- made sure to filter out stuttering
increased minimum length of each segment to 5 words
manual efforts to remove overly short segments
filtered single word before comma –> added to next fragment
previous attempts at filtering . . . found to be faulty –> fixed

notes:

some abbreviations still treated as the end of a sentence
- this is from nltk punkt, can’t really do much beyond manual fixing when I see it

Audio Samples

The first run was actually very weird. I ran it through the harvard sentences again, and while most sentences remained entirely silent, it would randomly start speaking sometimes from the middle of the sentence. I have included those examples below:

Glue the sheet to the dark blue background.

however creating audiobooks can be expensive.

The training run with that filtered out sentences longer than 10 seconds also had the same issue, but different sentences produced sound now. An example is found below:

The birch canoe slid on the smooth planks.

It’s possible that I have been too aggressive in removing silences.

Twitter Facebook LinkedIn

Emily Zeng

`webrtcvad` and Shorter Sections

Audio Samples

You May Also Enjoy

even less aggressive silence filtering and more data

less aggressive silence filtering

Blizzard

Transfer Learning and Hand-Labelling