Voice Conversion with Inference

Inference is the process of using a trained voice model to convert an audio file from one voice to another. Applio makes this process simple and provides a wealth of advanced options for fine-tuning your results.

The main inference interface in Applio, showing the model selection, audio input, and conversion button.

How to Perform Inference

You can perform inference on a single audio file or on a batch of files at once.

Single Inference
Batch Inference

Select Your Model: Choose the .pth and .index files for your model. If you don’t see your model in the list, click the refresh button.
Upload Your Audio: Drag and drop your audio file into the input box, or click to browse your files.
Convert: Click the Convert button to start the voice conversion process. The time it takes will depend on the length of your audio and the speed of your computer.

Select Your Audio Folder: Enter the path to the folder containing your audio files in the Input Folder field.
Select Your Model: Choose the .pth and .index files for your model.
Convert: Click the Convert button to process all the files in the folder.

Advanced Settings

Applio provides a wide range of advanced settings to help you achieve the perfect voice conversion.

Basic Settings

Export Format: Choose the audio format for your converted file (e.g., WAV, FLAC, MP3).
Split Audio: This option chops the audio into smaller segments for processing and then stitches them back together at the end. This can be useful for very long audio files to avoid memory issues.
Autotune: Applies a gentle autotune effect to the output. This is often recommended for singing conversions.
Clean Audio: Uses noise reduction algorithms to clean up the output audio. This is recommended for speech-to-speech conversions.
Upscale Audio: Enhances the quality of the output audio. This is useful if your input audio is low quality.
Clean Strength: Controls the intensity of the audio cleaning. Higher values are more aggressive but can sometimes introduce compression artifacts.

Pitch and Voice Settings

Pitch: Adjusts the pitch of the output. Use negative values to lower the pitch (more masculine) and positive values to raise it (more feminine). A common starting point is -12 for male-to-female and +12 for female-to-male.
Filter Radius: Applies a filter to reduce breathiness and other artifacts.
Search Feature Ratio: Controls how closely the model searches for matching features in the index file. Higher values can be more accurate but may also introduce artifacts. It’s usually best to leave this at the default value.
Volume Envelope: Blends the volume envelope of the output with the original audio’s volume.
Protect Voiceless Consonants: Helps to preserve the clarity of consonants and breathing sounds, which can prevent artifacts.
Hop Length: Determines how quickly the model adapts to large pitch changes. Smaller values can result in more accurate pitch tracking but require more processing time.

Models and Algorithms

Pitch Extraction Algorithm: Choose the algorithm used for pitch detection. rvmpe and crepe are popular choices, but you can experiment to see which one works best for your audio.
Embedder Model: Select the embedder model that was used to train your voice model. This must be the same embedder used during training.
Formant Shifting: This can be used to make male-to-female and female-to-male conversions sound more natural. You can adjust the Quefrency and Timbre for formant shifting.

Post-Processing Effects

You can apply a variety of audio effects to your output file.

Reverb: Adds reverberation to the audio.
Pitch Shift: Shifts the pitch of the audio in semitones.
Limiter: Prevents the audio from clipping.
Gain: Adjusts the volume of the audio.
Distortion: Adds harmonic distortion to the audio.
Chorus: Adds a chorus effect.
Bitcrush: Reduces the bit depth of the audio, creating a lo-fi effect.
Clipping: Intentionally clips the audio at a certain threshold.
Compressor: Reduces the dynamic range of the audio.
Delay: Adds an echo effect to the audio.

Tips for Better Results

If your output audio sounds robotic or has artifacts, here are a few things you can try:

Improve Input Quality: The better your input audio, the better the output will be.
Check Model Training: If you’re using a custom model, it may be undertrained or overtrained.
Clean Your Audio: Use a tool like UVR to remove any reverb, background noise, or double vocals from your input audio.
Verify Dataset Quality: If your model was trained on a noisy dataset, you’ll need to clean the dataset and retrain the model.
Experiment with Settings: Don’t be afraid to experiment with the advanced settings to find the best combination for your specific audio.