webrtcvad
and Shorter Sections
Cheryl and I spent some time taking a bit of a break from training models to try and figure out what it is. One big thing could be an issue with long sentences. When digging around on some forums, we found that some people are having problems with their custom datasets when longer sentences keep popping up.
Because manually labelled data is entirely unfeasible for this project, not to mention takes away from the spirit of “mostly automatically generated dataset”, I decided to also lookup good ways to cut off the silence at the start of audio segments. webrtcvad
looked like the most promising option as I was unable to configure Aeneas to do it automatically. I also did not have as much luck with other forced alignment programs, so webrtcvad
it is.
Attempt 1 at removing starting silence and breath intake can be found here. A second try filtered out audio segments that were longer than 10 seconds can be found here
Other modifications include:
- added breaking up segments using – in preprocessing
- made sure to filter out stuttering
- increased minimum length of each segment to 5 words
- manual efforts to remove overly short segments
- filtered single word before comma –> added to next fragment
- previous attempts at filtering . . . found to be faulty –> fixed
notes:
- some abbreviations still treated as the end of a sentence
- this is from
nltk punkt
, can’t really do much beyond manual fixing when I see it
- this is from
Audio Samples
The first run was actually very weird. I ran it through the harvard sentences again, and while most sentences remained entirely silent, it would randomly start speaking sometimes from the middle of the sentence. I have included those examples below:
The training run with that filtered out sentences longer than 10 seconds also had the same issue, but different sentences produced sound now. An example is found below:
It’s possible that I have been too aggressive in removing silences.