Coqui_tts is a Text-To-Speech (tts) project.
Today I don't want to make it work with the "OoobaBooga text generation web UI" but use coqui_tts for some different coding projects.
Thanks to it, I would like to make it read the content of some text file or just to say something, so I can hear it.
Why a tutorial for OobaBooga users ?
Because I have already installed everything that I need and that works well for the OobaBooga webUI in a previous tutorial (just below).
So if you have done the same thing yourself and have coqui_tts installed as an extension, this will save you some time with your installation and some space on your hard drive.
So, today, I don't want to create a new environment and download everything AGAIN from scratch to install coqui_tts, but in another tutorial, I'll certainly do it (the link is here).
How to install Coqui_tts if you already have it as an extension for Oobabooga?
So, since I am using a Windows system, I go first to the "text-generation-webui" directory of "OobaBooga", then I click on:
"cmd_windows.bat
".
Then I follow the instruction of installation: https://github.com/coqui-ai/TTS/tree/dev#installation
and do:
pip install TTS
I expected it to not update anything, but it seems that it still needed to update panda to version 1.5.3 after uninstalling version 2.1.4. This doesn't seem to affect the way OobaBooga works.
Now let's test if i can run the snippet of code given by Coqui_TTS !
source: https://github.com/coqui-ai/TTS/tree/dev#installation
***
import torch
from TTS.api import TTS
# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"
# List available TTS models
print(TTS().list_models())
# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
# Run TTS
# Since this model is multilingual voice cloning model, we must set the target speaker_wav and language
# Text to speech list of amplitude values as output
wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en")
# Text to speech to a file
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
***
It returns some errors but let's see that later.
I see that it calls first a TTS model and we need to know if we can load another one.
Here are how the models are loaded:
Loading a TTS model:
For the path of the model:
The one from the source code above is actually located here, inside this directory.C:\Users\[username]\AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2
You can recognize the path, above, in this line:tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
If you try this other piece of code:
"tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False).to(device)
"
You will see that it explicitly says that it will download the model here:
> Downloading model to C:\Users\[username]\AppData\Local\tts\tts_models--de--thorsten--tacotron2-DDC
This command line below is from here: https://github.com/coqui-ai/TTS
tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
It explains more about the syntax of the models.
So, let's say you want a list of all the available models:
You use this command line:
tts --list_models
And you copy the line that you want in the code.
For everything that is command-line related, refer to this page: https://github.com/coqui-ai/TTS
Bottom of the page.
However, let me add how to mimic the voice of a wav file by using just a command line with Coqui_tts:
tts --text "Get to the choppa!" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --language_idx en --out_path speech.wav --speaker_wav [path to]\oobabooga-windows\text-generation-webui\extensions\coqui_tts\voices\arnold.wav
Explanation of the command line:
so tts: the name of the program
--text "Get to the choppa!" : the text to read
--model_name "tts_models/multilingual/multi-dataset/xtts_v2": It SEEMS to work only with multilingual models
--language_idx en: needed for multilingual models
--out_path speech.wav: out path
--speaker_wav [path to]\oobabooga-windows\text-generation-webui\extensions\coqui_tts\voices\arnold.wav
: the wav file to be used as a reference
if the path is wrong, you will get this error: "(wrong path of the wav file): System error."
The result is not so iconic as with the real Arnold Schwarzenegger, but you can recognize him...
Let's circle back to our code above and correct it to make it work:
*****
import torch
from TTS.api import TTS
# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"
# List available TTS models
print(TTS().list_models())
# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
# Run TTS
# Since this model is multilingual voice cloning model, we must set the target speaker_wav and language
# Text to speech list of amplitude values as output
wav = tts.tts(text="Get to the choppa!", speaker_wav="arnold.wav", language="en")
# Text to speech to a file
tts.tts_to_file(text="Get to the choppa!", speaker_wav="arnold.wav", language="en", file_path="output123.wav")
*****
Make a copy of the "arnold.wav" file that is provided with the extension coqui_tts (path: text-generation-webui\extensions\coqui_tts\voices\arnold.wav
)
in the same directory as the python file with the source code.
Again make sure to run the code from the environment where coqui_tts is installed (for this tutorial it was with the OoobaBooga interface)
It seems like the sound is better if you keep this line:wav = tts.tts(text="Get to the choppa!", speaker_wav="arnold.wav", language="en")
Now let's modify a little bit the code to read a text file filetoread.txt, that has to be in the same directory as the script below:
***
import torch
from TTS.api import TTS
def read_file(filename):
with open(filename, 'r') as f:
data = f.read()
return data
text_string = read_file('filetoread.txt')
# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"
# List available TTS models
print(TTS().list_models())
# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
# Run TTS
# Since this model is multilingual voice cloning model, we must set the target speaker_wav and language
# Text to speech list of amplitude values as output
wav = tts.tts(text=text_string, speaker_wav="arnold.wav", language="en")
# Text to speech to a file
tts.tts_to_file(text=text_string, speaker_wav="arnold.wav", language="en", file_path="output123.wav")
***
The problem of this source code is that it is limited to 250 characters in English...
... but when i tried to read a page from yahoo news, it returned a 6'30'' long wav file.
While the text was not perfectly read, the result was pretty convincing and pretty much usable.
In case you need to read 250 characters by 250 characters, you can use this code to merge the wav files:
***
import wave
def append_wav(input_file1, input_file2, output_file):
infiles = [input_file1, input_file2]
data= []
for infile in infiles:
w = wave.open(infile, 'rb')
data.append( [w.getparams(), w.readframes(w.getnframes())] )
w.close()
output_file = wave.open(output_file, 'wb')
output_file.setparams(data[0][0])
for i in range(len(data)):
output_file.writeframes(data[i][1])
output_file.close()
append_wav("arnold.wav", "arnold.wav", "aaa.wav")
***
It would save a arnold.wav after a arnold.wav to aaa.wav
(modified of: https://stackoverflow.com/questions/2890703/how-to-join-two-wav-files-using-python )
***
I didn't find (until now) a way to change the pace of the speech with this voice.