This application is client software for real-time voice conversion that supports various voice conversion models. This application support the models including RVC, MMVCv13, MMVCv15, So-vits-svcv40, etc. However, this document focus on RVC(Retrieval-based-Voice-Conversion) for voice conversion as the tutorial material. The basic operations for each model are essentially the same.
From the following, the original Retrieval-based-Voice-Conversion-WebUI is referred to as the original-RVC, RVC-WebUI created by ddPn08 is referred to as ddPn08-RVC.
- Model training must be done separately.
- If you want to learn by yourself, please go to original-RVC or ddPn08RVC.
- Recording app on Github Pages is convenient for preparing voice for learning on the browser.
- [Commentary video] (https://youtu.be/s_GirFEGvaA)
- TIPS for training has been published, so please refer to it.
Unzip the downloaded zip file and run start_http.bat
.
If you have the old version, be sure to unzip it into a separate folder.
It is launched as follows.
-
Unzip the downloaded file.
-
Next, run MMVCServerSIO by hold down the control key and clicking it (or right-click to run it). If a message appears stating that the developer cannot be verified, run it again by holding down the control key and clicking it (or right-click to run it). The terminal will open and the process will finish within a few seconds.
-
Next, execute the startHTTP.command by holding down the control key and clicking on it (or you can also right-click to run it). If a message appears stating that the developer cannot be verified, repeat the process by holding down the control key and clicking on it (or perform a right-click to run it). A terminal will open, and the launch process will begin.
- In other words, the key is to run both MMVCServerSIO and startHTTP.command. Moreover, you need to run MMVCServerSIO first.
If you have the old version, be sure to unzip it into a separate folder.
When connecting remotely, please use .bat
file (win) and .command
file (mac) where http is replaced with https.
Access with Browser (currently only chrome is supported), then you can see gui.
When you run a .bat file (Windows) or .command file (Mac), a screen like the following will be displayed and various data will be downloaded from the Internet at the initial start-up. Depending on your environment, it may take 1-2 minutes in many cases.
Once the download of the required data is complete, a dialog like the one below will be displayed. If you wish, press the yellow icon to reward the developer with a cup of coffee. Pressing the Start button will make the dialog disappear.
Use this screen to operate.
You can immediately perform voice conversion using the data downloaded at startup.
(1) To get started, click on the Model Selection area to select the model you would like to use. Once the model is loaded, the images of the characters will be displayed on the screen.
(2) Select the microphone (input) and speaker (output) you wish to use. If you are unfamiliar, we recommend selecting the client and then selecting your microphone and speaker. (We will explain the difference between server later).
(3) When you press the start button, the audio conversion will start after a few seconds of data loading. Try saying something into the microphone. You should be able to hear the converted audio from the speaker.
Q1. The audio is becoming choppy and stuttering.
A1. It is possible that your PC's performance is not adequate. Try increasing the CHUNK value (as shown in Figure as A, for example, 1024). Also try setting F0 Det to dio (as shown in Figure as B).
Q2. The voice is not being converted.
A2. Refer to this and identify where the problem lies, and consider a solution.
Q3. The pitch is off.
A3. Although it wasn't explained in the Quick Start, if the model is pitch-changeable, you can change it with TUNE. Please refer to the more detailed explanation below.
Q4. The window doesn't show up or the window shows up but the contents are not displayed. A console error such as electron: Failed to load URL: http://localhost:18888/ with error: ERR_CONNECTION_REFUSED
is displayed.
A4. There is a possibility that the virus checker is running. Please wait or designate the folder to be excluded at your own risk.
Q5. [4716:0429/213736.103:ERROR:gpu_init.cc(523)] Passthrough is not supported, GL is disabled, ANGLE is
is displayed
A5. This is an error produced by the library used by this application, but it does not have any effect, so please ignore it.
Q6. My AMD GPU isn't being used.
A6. Please use the DirectML version. Additionally, AMD GPUs are only enabled for ONNX models. You can judge this by the GPU utilization rate going up in the Performance Monitor.(see here)
Q7. onxxruntime is not launching and it's producing an error.
A7. It appears that an error occurs if the folder path contains unicode. Please extract to a path that does not use unicode (just alphanumeric characters). (Reference: w-okada#528)
Icons are links.
Icon | To |
---|---|
Octocat | github repository |
question | manual |
spanner | tools |
coffee | donation |
Initialize configuration.
Select the model you wish to use.
By pressing the "edit" button, you can edit the list of models (model slots). Please refer to the model slots editing screen for more details.
A character image loaded on the left side will be displayed. The status of real-time voice changer is overlaid on the top left of the character image.
You can use the buttons and sliders on the right side to control various settings.
The lag time from speaking to conversion is buf + res
seconds. When adjusting, please adjust the buffer time to be longer than the res time.
This is the volume after voice conversion.
The length of each chunk in milliseconds when capturing audio. Shortening the CHUNK will decrease this number.
The time it takes to convert data with CHUNK and EXTRA added is measured. Decreasing either CHUNK or EXTRA will reduce the number.
Press "start" to begin voice conversion and "stop" to end it.
When this button is pressed, the sound inputted will be outputted as is.
The sound inputted will be outputted as is.. By default, a confirmation dialog will appear when it's activated, but you can skip this dialog through the Advanced Settings.
-
in: Change the volume of the inputted audio for the model.
-
out: Change the volume of the converted audio.
Enter a value for how much to convert the pitch of the voice. Conversion can also be done during inference. Below are some guidelines for settings.
- +12 for male voice to female voice conversion
- -12 for female voice to male voice conversion
You can specify the rate of weight assigned to the features used in training. This is only valid for models which have an index file registered. 0 uses HuBERT's output as-is and 1 assigns all weights to the original features. If the index ratio is greater than 0, it may take longer to search.
Set the speaker of the audio conversion.
Save the settings specified. When the model is recalled again, the settings will be reflected. (Excluding some parts).
This output will convert the PyTorch model to ONNX. It is only valid if the loaded model is a RVC PyTorch model.
The item that can be configured by the AI model used will vary. Please check the features and other information on the model manufacturer's website.
You can review the action settings and transformation processes.
You can switch the noise cancellation feature on and off, however it is only available in Client Device Mode.
- Echo: Echo Cancellation Function
- Sup1, Sup2: This is a noise suppression feature.
Choose an algorithm for extracting the pitch. You can choose from the following options. AMD is available for only onnx.
F0 Extractor | type | description |
---|---|---|
dio | cpu | lightweight |
harvest | cpu | High-precision |
crepe | torch | GPU-enabled、high-precision |
crepe full | onnx | GPU-enabled、high-precision |
crepe tiny | onnx | GPU-enabled、lightweight |
rnvpe | torch | GPU-enabled、high-precision |
This is the threshold of the volume for performing speech conversion. When the rms is smaller than this value, speech conversion will be skipped and silence will be returned instead. (In this case, since the conversion process is skipped, the burden will not be so large.)
Decide how much length to cut and convert in one conversion. The higher the value, the more efficient the conversion, but the larger the buf value, the longer the maximum time before the conversion starts. The approximate time is displayed in buff:.
Determines how much past audio to include in the input when converting audio. The longer the past voice is, the better the accuracy of the conversion, but the longer the res is, the longer the calculation takes. (Probably because Transformer is a bottleneck, the calculation time will increase by the square of this length)
Detail is here
You can select the GPU to use in the onnxgpu version.
In the onnxdirectML version, you can switch the GPU ON/OFF.
On DirectML Version, these buottns is displayed.
- cpu: use cpu
- gpu0: use gpu0
- gpu1: use gpu1
- gpu2: use gpu2
- gpu3: use gpu3
Even if a GPU is not detected, gpu0 - gpu3 will still be displayed. If you specify a GPU that doesn't exist, the CPU will be used instead.reference
Choose the type of audio device you want to use. For more information, please refer to the document.
- Client: You can make use of the microphone input and speaker output with the GUI functions such as noise cancellation.
- Server: VCClient can directly control the microphone and speaker to minimize latency.
You can select a sound input device such as a microphone input. It's also possible to input from audio files (size limit applies).
For win user, system sound is available as input. Please note if you set the system sound as output, the sound loop occurs.
You can select audio output devices such as speakers and output.
In monitor mode, you can select audio output devices such as speaker output. This is only available in server device mode.
Please refer to this document for an overview of the idea.
It will output the converted audio to a file.
We can record and confirm the input audio to the speech conversion AI and the output audio from the speech conversion AI.
Please refer to this document for an overview of the idea.
I will start/stop recording both the audio inputted into the voice conversion AI as well as the audio outputted from the voice conversion AI.
The AI will play back any audio that is input into it.
I will play the audio inputted to the speech conversion AI.
Play the audio output from the Speech Conversion AI.
You can do more advanced operations.
It is possible to do synthesis of models.
You can set up more advanced settings.
You can check the configuration of the current server.
By pressing the edit button in the Model Slot Selection Area, you can edit the model slot.
You can change the image by clicking on the icon.
You can download the file by clicking on the file name.
You can upload the model.
In the upload screen, you can select the voice changer type to upload.
You can go back to the Model Slot Edit Screen by pressing the back button.
You can download a sample.
You can go back to the Model Slot Edit Screen by pressing the back button.
You can edit the details of the model slot.