Coqui_tts is a text to speech software that exists in different "flavors", including as an extension for the OobaBooga text generation WebUI.

This post is going to look very geeky while, it is actually pretty cool to be able to listen to the AI generated voices based on AI generated answers...


If you don't already have coqui_tts as an extension in OobaBooga (see the "Session" menu), you should first use the "update_windows.bat" script (for Windows) to update your interface.

If after this, it is still not available, please use the instructions of this URL:
https://github.com/kanttouchthis/text_generation_webui_xtts

Note that is this case, the extension will likely be called "text_generation_webui_xtts"

If it is still a problem, just copy this file structure:
https://github.com/oobabooga/text-generation-webui/tree/main/extensions/coqui_tts
in your "extension" directory.


How to install Coqui TTS ?

So let's suppose you have all the starting files downloaded.
You need then to click on "cmd_windows.bat" (or equivalent) to access the environment of OobaBooga.

From there, do:

pip install -r extensions\coqui_tts\requirements.txt

Of course, adapt the path, if the name is not "coqui_tts" for you.

Then do:
start_windows.bat

Check in the "Session" menu that the extension is enabled.

If another day, you open oobabooga again, "coqui_tts" doesn`t seem to stay enabled.
So you may have to enable it again, before restarting the interface with the button:
"Apply flags/extension and restart".

At some point you have to answer "yes" to accept the term of use of the extension.

Don't forget to select and load a model, from the "model" menu before you test the extension!

Once the extension is loaded and the model is loaded too, instead of getting written answers, you will actually get spoken answers from OobaBooga...
Now, when I test the results that I get with this extension, I am actually very amazed.

Because, besides the model that is downloaded by the extension the first time, the voices are actually fine tuned thanks to very basic wav files that are in the directory:
text-generation-webui\extensions\coqui_tts\voices

You could take whatever WAV file in the correct format and you would be able to emulate the voice of the WAV file.
I tried with wav files that were above 100 MB without any problem.
The only requirement seems to be needed is the fact the wav file has to use a sample rate of 22050Hz.
Also it is also advisable to remove any noises that are not related to the voice of the character. Also i warn you that you may get issues if your sample of voice speaks very fast.

Personally, I tried to use the sound of a YouTube video from my favorite creator thanks to this command line that you have to adapt:


yt-dlp --extract-audio --audio-format wav [video url] -o - | ffmpeg -i pipe: -acodec pcm_u8 -ar 22050 [name of the wav file]

Then I clean the file with this free online service (and until now registration free):
https://www.lalal.ai/

I was able to reproduce quite decently the voice of the YouTube creator.
However, i also tried to reproduce some ASMR content and i wasn't able to reproduce the tone of the voice very well... but the voice itself was ok...