To narrow down the problem further, I trained another model using the LJSpeech dataset but my own text preprocessing pipeline. Very annoyingly enough, the model worked. Unfortunately, this result means the dataset itself may be problematic, which is the most annoying component to try and fix. The worst-case scenario would be that there are simply too many errors made by Aeneas and only hand-labeled data would work.
The tensorboard results can be found here. The first training run started overfitting too quickly, so I reduced the learning rate and ran it again. Both runs worked well, however, on the Harvard sentences I used for testing, so I will leave that hyperparameter tuning to when I see some semblance of the model starting to work on the custom dataset.
The audio samples below are from the first run at 20k steps:
I also did some comparisons of how the model performs with and without different punctuations. It can sometimes be a hit or miss, but the model does show some sensitivity to the changes as seen in the audio samples below:
Tensorboard data can be found here. This version used even less aggressive silence filtering (set to aggressiveness value to 0 this time) as well as data fro...
I modified the aggressiveness from 3 to 1. Training history can be found here for a run which used the LJSpeech mode I trained, and here for a model trained ...
Cheryl and I spent some time taking a bit of a break from training models to try and figure out what it is. One big thing could be an issue with long sentenc...
I decided to try yet another dataset, this time the Blizzard dataset. The tensorboard results can be seen here. With this dataset, it is again successful at ...