Glossary

What Is RVC (Retrieval-Based Voice Conversion)?

RVC is an open-source AI method for converting audio to sound like a trained voice model. Here's what it is and how apps like VibeSing use it.

The quick answer

RVC stands for Retrieval-Based Voice Conversion. It's an open-source AI architecture that converts audio — typically singing or speech — to sound like a target voice model. It became the dominant method for AI cover songs starting in 2023 because it offers fast training, good quality, and relatively low compute costs.

If you've heard an AI cover on social media in the last two years, there's a reasonable chance it was made with RVC or a derivative.

How it works at a high level

Traditional voice conversion approaches tried to directly map the acoustic features of one voice onto another. RVC takes a different route — the "retrieval" part.

When converting audio, RVC:

Encodes the input — The source audio (a vocal stem) is encoded into a set of feature vectors that represent the content (what's being sung) separately from the voice characteristics (how it sounds).
Retrieves similar features — Rather than purely synthesizing new audio, RVC searches a learned index of features from the target voice model to find the closest matches to each incoming frame.
Decodes — A vocoder converts the retrieved features back into audio, in the style of the target voice.

The retrieval step is what makes RVC distinctive. By finding the closest matching features in the target model rather than purely generating them, the output tends to preserve more natural vocal texture and avoids some of the artifacts that plagued earlier methods.

Why RVC became popular for AI covers

Fast training — A usable voice model can be trained on a few minutes of audio in 20–40 minutes on consumer GPU hardware. Compare this to earlier commercial systems that required hours of audio and proprietary training infrastructure.

Open source — The code is publicly available. Developers could build on top of it without licensing fees or access restrictions. This kicked off a wave of consumer tools.

Reasonable quality — RVC output is good enough for social sharing, which is all most use cases require. Not studio-perfect, but convincing on small speakers and mobile screens.

Low compute at inference — Once a model is trained, generating a cover is fast and inexpensive.

RVC vs. TTS-based voice cloning

These are two different approaches to making audio sound like a specific person:

TTS-based cloning (text-to-speech) takes text as input and generates speech from scratch in the target voice. It starts with a written script and produces audio. Good systems include ElevenLabs and similar commercial tools.

RVC takes audio as input and converts it to the target voice while preserving the original timing, melody, and phonetic content. It's a voice-to-voice transformation — which is exactly what you need for singing.

For AI cover songs, RVC-style voice conversion is the right tool because you're not starting from text — you're starting from the original vocal performance and swapping the voice out.

How consumer apps build on top of it

Apps like VibeSing wrap the underlying voice conversion technology in a user-friendly interface:

Record your voice samples in-browser (no technical setup)
Training runs on cloud hardware (no need for a GPU)
The cover generation pipeline handles stem separation and vocal replacement automatically
The output is delivered as a shareable link

The underlying voice conversion engine handles the hard part. The app handles everything around it.

Want to experience voice conversion without the technical setup? Open VibeSing Studio and make your first AI cover in a few minutes.