Understanding Embedders

What is an Embedder?

An embedder is a crucial component in the voice conversion process. It’s a neural network that analyzes an audio file and converts it into a set of numerical representations, called “embeddings.” These embeddings capture the essential acoustic and linguistic features of the audio, such as the speaker’s tone, pitch, and accent.

Think of an embedder as a translator that turns complex audio waves into a simplified language that the voice conversion model can understand and work with.

How to Use Embedders in Applio

You’ll interact with embedders at two key stages of the voice conversion process:

Training: When you’re training a new voice model, you’ll need to select an embedder in the Extraction Settings.
Inference: When you’re using a trained model to convert a voice, you must select the same embedder in the Advanced Settings.

Where to Find Embedders

You can find a wide variety of pre-trained embedders on Hugging Face. Here’s how to find them:

Go to the Hugging Face model hub.
In the sidebar, filter by Task > Feature Extraction.
Use the search bar to find specific embedder types, such as “HuBERT” or “ContentVec”.

Best Practices for Using Embedders

Consistency is Key: Always use the same embedder for a given model, from training all the way through to inference.
Keep Track: If you’re working with multiple models, keep a record of which embedder you used for each one.
Experiment: Don’t be afraid to experiment with different embedders to see which one works best for your specific use case. Some embedders may be better suited for singing, while others excel at speech.
Stay Updated: The field of audio processing is constantly evolving. Keep an eye out for new and improved embedders that may offer better performance.