LJSpeech

1 minute read

To narrow down the problem further, I trained another model using the LJSpeech dataset but my own text preprocessing pipeline. Very annoyingly enough, the model worked. Unfortunately, this result means the dataset itself may be problematic, which is the most annoying component to try and fix. The worst-case scenario would be that there are simply too many errors made by Aeneas and only hand-labeled data would work.

The tensorboard results can be found here. The first training run started overfitting too quickly, so I reduced the learning rate and ran it again. Both runs worked well, however, on the Harvard sentences I used for testing, so I will leave that hyperparameter tuning to when I see some semblance of the model starting to work on the custom dataset.

The audio samples below are from the first run at 20k steps:

The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
These days a chicken leg is a rare dish.
Rice is often served in round bowls.
The juice of lemons makes fine punch.
The box was thrown beside the parked truck.
The hogs were fed chopped corn and garbage.
Four hours of steady work faced us.
Large size in stockings is hard to sell.

I also did some comparisons of how the model performs with and without different punctuations. It can sometimes be a hit or miss, but the model does show some sensitivity to the changes as seen in the audio samples below:

however creating audiobooks can be expensive
however, creating audiobooks can be expensive
however, creating audiobooks can be expensive!