AI is a highly advanced field of technology and can, at times, be unpredictable as the output is based on the input and then interpreted by the AI. We have tried to minimize the unpredictability as much as possible and keep adding features and improvements that make it more predictable and controllable. However, there are still a few things you need to be mindful of, and this applies to all generative AI.

In this section, we will go through some of the issues we’ve encountered and what has been reported by users. Just because they are mentioned here doesn’t mean that you will experience these issues, and in many cases, there are things you can do to prevent them from happening.

Multilingual v2

The multilingual v2 model was a huge leap forward in accuracy, predictability and consistency compared to the experimental multilingual v1 model. We hope to achieve a similar improvement once we release our next iteration of the models. The multilingual v1 model suffered from a lot of issues and was never released outside of its experimental phase. You can read more about those issues below, but the multilingual v2 model seems to have solved most of them.

However, there are still a few issues that we have observed and heard reports of from our users.

Inconsistency

Although the v2 model is a significant step forward compared to the v1 model, we are still dealing with AI, which can be unpredictable at times. We are, of course, continuously working on bringing this unpredictability down on the model level, but there are already things you can do to minimize this unpredictability.

We’ve heard reports where users say that there is inconsistency between generations, so takes don’t always fit together perfectly. Even though this issue is a lot less prominent in the multilingual v2 model compared to the multilingual v1 model, and it seems to be even less of an issue with Projects as that feature is a bit more advanced than Speech Synthesis when it comes to long-form content since it was specifically made for that, it can still happen.

Generally, what we’ve found is that this happens more to certain voices than others and when the samples for those voices are suboptimal, which most often means not following the guidelines. The current suggestion to resolve this would be to try using a cloned voice or, if you are already cloning the voice, try cloning the voice again with different samples if you encounter a lot of variability. We recommend that you read the cloning guide in the documentation as it goes through a lot of best practices.

Ensuring you use high-quality samples that are very consistent will help tremendously in maintaining consistency.

In general, for consistency, you should use around 1 to 2 minutes of audio when Instant Voice Cloning. This audio should be very consistent across all aspects such as tonality, performance, accent, quality, and so on, if you want the output to be consistent. Using more audio than that can make the AI too variable, which can cause inconsistency between generations.

For Professional Voice Cloning, we recommend at the very least 30 minutes of audio and suggest between 2 to 3 hours of audio for the best results. Keep in mind, the audio needs to be consistent throughout for the best result. You can find more in-depth information in the guide specifically about cloning.

If you clone a voice properly, one that is consistent throughout, of high quality, properly recorded without background noise or multiple speakers, and if you follow the guidelines, you should be able to get a very good and consistent clone. It might require a bit experimenting if you don’t get it right the first time.

Mispronunciation

The multilingual v2 model does have a fairly rare issues where the AI mispronounces certain words, even in English. So far, the trigger seems somewhat arbitrary, but it appears to be voice and text-dependent. It seems to happen more often with certain voices and text than others, especially if you use words that appear in other languages as well.

The best way to deal with this is to use the Projects feature, which seems to minimize the issue as it is more prevalent across longer sections of text when using Speech Synthesis. It will not completely remove the issue, but it will hopefully help both avoid it and make it easier to just regenerate the specific section affected without redoing the whole text.

As with the above issue of inconsistency, this issue also seems to be minimized by using a probably cloned voice, cloned in the languages you want the AI to speak.

Both stability and similarity can change how the voice acts as well as how prominent the artifacts are. Hovering over the ! next to each side of the sliders will reveal some more information. The multilingual models may also mispronounce certain numbers and symbols. For instance, 1, 2, 3 might be pronounced as “one,” “two,” “three.” Therefore, if you need them to be pronounced in another language, it is recommended to write them out fully.

Language Switching and Accent Drift

The AI can sometimes switch languages or accents throughout a single generation, especially if that generation is longer in length - very similar to the mispronunciation issue above. This is also something we’re working on fixing, hopefully with the next iteration, as there’s not too much you can do right now. Using a proper clone paired with Projects which has been cloned speaking the language you want the AI to speak should again help mitigate most of this.

The most important thing to remember is that if you are using a pre-made (default voice) or generated voice, they are all in English and might have an English accent in other languages. This means that they may not have the proper pronunciation and might be more prone to switching languages and accent. The best approach would be to clone a voice speaking the language you want the AI to speak with the accent you want. This will provide the most context for the AI to understand how to perform a passage and should minimize language switching.

There is currently no way to select the language you want the AI to speak. Instead, the way you “select” the language is by writing in the language you want the AI to speak. If you are using a voice that is not native to the language - for example, one of the pre-made voices since they are in English - the AI might have a slight English accent when speaking other languages.

To get optimal results, we recommend cloning a voice that speaks the original language with the correct accent. This is especially important when dealing with languages that are very similar and share a lot of common words. This ensures that the AI has the most information to understand which pronunciation and language it should choose.

Another important point to note is that the AI usually begins with one accent and can gradually shift over longer segments of text, which generally means text longer than a few hundred characters. We highly recommend using the “Projects” feature to avoid many of these issues.

Corrupt Speech

This is a very rare issue, but some users haven’t encountered it. It seems to be a bit arbitrary when this happens, but sometimes the AI produces speech that is wrapped, sounding very muffled and strange. It sounds like it has some sort of effect on it. Unfortunately, we do not have any suggestions for it as we have not been able to replicate the issue or find any cause for it. If this happens, the best course of action is to just regenerate the section, and it should resolve itself, as it is very rare.

Projects

One of the world’s most advanced workflows for creating long-form content using AI. Even despite its high complexity, there are very few issues with projects, and in general, it works fantastically well if you use a proper voice paired with the appropriate model.

Import Function

The import function will do its best to try and import the file you give it to the website. However, since there are so many variables related to websites and how a book can be formatted, including the presence of images, you should always double-check to ensure that everything is imported correctly.

One such issue you might encounter is when importing a book where each chapter starts with an image as the first letter. This can be very confusing for the AI, as it will not be able to extrapolate the letter. Therefore, you will have to add that letter to each chapter.

If something is imported as a single long paragraph instead of being split where a new line break starts, something is wrong, and it might not work properly. It should follow the same structure as the original book. If that doesn’t work, you can try copying and pasting. If that also doesn’t work, there might be something wrong with how the text is presented, and this book might not work without first converting it to another format or rewriting it fully. This is very unusual, but it’s essential to keep in mind.

Glitches between paragraphs

On the rare occasions, you might encounter certain forms of glitches or sharp breaths between paragraphs, which you might not experience with Speech Synthesis, as they operate differently. Generally, this issue is not extremely disruptive and is relatively rare, but we are actively working on resolving it. At the moment, there is no straightforward solution to completely avoid this problem. If you do happen to encounter an issue like this, we recommend regenerating the last paragraph. These issues tend to occur at the end of certain paragraphs rather than at the beginning. So, if you hear a problem between two paragraphs, it’s usually the preceding paragraph that is the cause of the issue.

Multilingual v1

The multilingual model is generally only recommended for old PVC clones that require it, as it does exhibit a few issues that are not present in the newer models.

During generation, the audio may change in tone, quality, introduce noise, and distort, and the voice may transition from male to female or start whispering, and more. The prominence of these issues largely depends on the model and voice used. Currently, the monolingual model handles longer generations better, but we are continuously working on both models to improve this in the future.

We are aware that the voices have a tendency to degrade during longer audio generations, and our team is working hard to develop the technology to improve upon this. As stated above, this issue is more prominent in the experimental multilingual model.

To help mitigate these problems, we recommend breaking down the text into shorter sections, preferably below 800 characters, as this can help maintain better quality. Additionally, if you are using English voices, it is advisable to stick with the monolingual model for now, as it tends to exhibit more stability.

There are a few other factors that could contribute to these issues, and we’d like to highlight some of the key ones:

How long is the text chunk?

The voices do have a tendency to degrade over time. The experimental multilingual model tends to degrade quicker than the monolingual model. The team is currently working hard on finding solutions to these problems.

Pre-made, voice-designed voices, or cloned voices?

Some of the pre-made voices have a tendency to start whispering during longer generations when using the multilingual v1 model. Similar problems have been observed in the voice-designed voices as well, but it is dependent on the voice itself. If you’re using cloned voices, the quality of the samples used is very important to the final output. Noise and other artifacts tend to be amplified during long generations.

What settings are you using?

Both stability and similarity can change how the voice acts as well as how prominent the artifacts are. Hovering over the ! next to each side of the sliders will reveal some more information. The multilingual models may also mispronounce certain numbers and symbols. For instance, 1, 2, 3 might be pronounced as “one,” “two,” “three.” Therefore, if you need them to be pronounced in another language, it is recommended to write them out.

We acknowledge that these solutions are temporary measures and may not address all concerns perfectly. However, we believe they can be beneficial in specific situations.