Understanding Voice Cloning

A simple guide to voice cloning technology.
By Boris Delovski • Updated on Oct 8, 2025
blog image

The digital landscape is undergoing a profound transformation, largely driven by rapid advancements in Artificial Intelligence. Much of the recent attention has been on the text-based capabilities of Large Language Models (LLMs).

Nonetheless, another equally revolutionary area is quietly gaining ground: voice synthesis. This field has evolved far beyond the basic virtual assistants we once knew. Recent breakthroughs in voice cloning are particularly noteworthy, with the potential to fundamentally reshape how we interact with both technology and one another

In this article, we'll explain what voice cloning is, why it matters, and how it works. We'll also demonstrate how easily a voice can be cloned using just a short audio sample.

What Is Voice Cloning And How Does It Work

At its core, voice cloning is the process of generating artificial speech that convincingly mimics a specific person’s voice. Traditional text-to-speech (TTS) systems, by contrast, typically rely on a generic narrator voice or a limited set of voices. These standard TTS voices may sound smooth and technically perfect. However, that very perfection often reveals their artificial nature. They lack the “soul” and nuance of a real human voice

Voice cloning, instead, aims to capture the distinctive qualities that make someone’s voice truly their own, such as pitch, accent, tone, and speaking style. As a result, the generated voice might sound slightly less polished than a typical TTS voice. But this slight roughness is desirable. It's the subtle imperfections that give a voice its authenticity and emotional depth.

Understanding voice cloning requires delving into its technical foundations, particularly its relationship with Text-to-Speech (TTS) systems and the innovative methods used to train models to adapt and replicate voices. The typical process of AI voice cloning involves several key steps:

  • Data Collection 
  • Feature Extraction
  • Model Training
  • Voice Synthesis

How Do Data Collection and Preprocessing Work in Voice Cloning

Everything begins with audio recordings of the target voice. In this stage, both the quality and quantity of the data are immensely important. Some models require less data, while others require more, but all models require high-quality recordings of the original voice. Ideally, the person should record their voice using a high-quality microphone, preferably a condenser microphone, in a quiet and controlled environment. 

Background noise, echo, and distortion should be avoided, as they can reduce the model’s ability to accurately extract vocal features. Even subtle artifacts, such as the hum of an air conditioner, reverberation from bare walls, or digital compression, can compromise the clarity of the speech signal. These imperfections may introduce unwanted "noise" into the cloned voice and diminish its authenticity.

It is important to note that recording in less-than-ideal conditions does not necessarily mean you will get poor results. Modern voice cloning models are highly advanced in that they can still extract and clone your voice effectively even if the recording contains some background noise. However, for professional-grade results, a carefully prepared recording setup is essential.

Finally, if you plan to use a model that requires a larger amount of voice data, it is important to include variety in your recordings. Providing multiple clips with different intonations, emotional expressions, and speaking speeds allows the model to better understand how your voice behaves in different contexts. This results in a more dynamic and expressive cloned voice, rather than a flat or monotonous one.

How Does Feature Extraction Work in Voice Cloning 

Raw audio cannot be fed directly into deep learning models. Instead, it is necessary to extract the most relevant features from the audio and use those as input for the model. Traditional pipelines extract acoustic features like spectrograms or mel-frequency cepstral coefficients (MFCCs) which represent the audio’s frequencies over time. 

Modern approaches go a step further. They use audio tokenizers or codecs to compress speech into discrete tokens. For instance, neural codecs, like Meta’s EnCodec, convert audio into a sequence of numeric codes. In essence, the AI “dissects” the voice recording, analyzing thousands of micro-characteristics, such as pronunciations, intonation patterns, accent nuances, and other subtle details.

Article continues below

How Does Model Training Work in Voice Cloning 

In this phase, the model processes the features extracted from the input audio. More precisely, the model learns to map text to these voice characteristics. Early voice cloning systems, such as Google’s SV2TTS, used a multi-step approach to achieve this. Today, however, state-of-the-art models use an end-to-end integrated model.

For example, neural codec language models like Microsoft’s VALL-E  train a Transformer to directly predict codec tokens for audio-given text, conditioned on a sample of the target voice. During training, the model is exposed to many different speakers, which enables it to generalize and accurately replicate diverse voice characteristics

How Does Speech Synthesis Work In Voice Cloning 

Finally, given some input text and a target voice, the system generates the audio. The process begins by breaking down the written text into its basic sound units, called phonemes. For example,  the word "cat" can be broken down into the following phonemes: 

  • /k/
  • /a/
  • /t/

Next, the model predicts the specific audio characteristics needed to produce those sounds in the target voice. This includes factors like pitch, speaking speed, and the unique quality that defines the person's voice. To complete the voice synthesis process, a special component called a vocoder takes these predicted sound features and converts them into a waveform. This waveform can then be played back as audio. The final result should sound like the target person speaking the new phrase naturally

What Are the Different Types of Voice Cloning Models 

Most voice cloning systems can be categorized based on the amount of data required to learn a new voice. Another key factor is whether they need an explicit training phase for that voice. The three main types are:

  • Zero-Shot
  • One-Shot
  • Few-Shot

What Is Zero-Shot Voice Cloning 

“Zero-shot” means the model can clone a new voice without any additional training on that specific voice. Essentially, it works right out of the box. We provide an audio sample of the target voice along with the text we want the cloned voice to say, and that’s it. These models typically require only a very short recording, usually just 3 to 5 seconds, to successfully clone a voice.

While most commercial voice cloning applications still don't rely on zero-shot models, much research is focused on improving their performance. In fact, many of the latest research models are zero-shot by design. Notable examples include Microsoft’s VALL-E, Meta’s Voicebox, and the Spark-TTS model released by Alibaba’s DAMO Academy in collaboration with others.

What Is One-Shot Voice Cloning 

This term is sometimes used interchangeably with zero-shot in literature. However, it generally means the system requires only one example of the new voice and a minimal adaptation before it can start cloning. In other words, one-shot cloning may perform a quick training or fine-tuning step using that single example. It does not simply generate speech out of the box. This process often involves techniques like transfer learning or meta-learning. These methods enable the model to rapidly adapt from just one instance.

Using this approach, it is possible to achieve a closer resemblance to the target voice than with zero-shot models. This happens because the model can adjust itself to better match that voice. However, this process demands more computational resources. It also begins to overlap with few-shot models. In general, one-shot models are less common today. They do not offer a significant advantage over zero-shot models while being slower and requiring more resources.

What Is Few-Shot Voice Cloning 

Few-shot cloning means the system is given a small dataset of the target voice. For example, it might receive 20 voice samples that are a few minutes long. The system then uses these samples to fine-tune a pre-trained multispeaker TTS model. This process adjusts the model’s weights so it can reproduce that specific voice. This can be done through full model fine-tuning or lighter methods such as embedding training or adapter layers

Nowadays, most companies offering voice cloning services rely on few-shot models. These models produce the best end results, even though the cloning process takes longer and is more involved. They also require a large number of audio samples to work properly.

Examples of commercial services include  Microsoft’s Custom Neural Voice and the voice cloning feature of ElevenLabs. There are also open-source models, mostly available through TTS frameworks. A good example is the Coqui TTS framework, which allows users to fine-tune a model on a small dataset of a new speaker. 

How Does a Voice Cloning Demo Work 

In this demo, we will clone a voice using the Spark-TTS model. This model is an open-source text-to-speech model introduced in early 2025 by Alibaba’s DAMO Academy in collaboration with others. At its core, Spark-TTS uses an LLM. The key innovation it introduces is a novel BiCodec representation for speech.

Unlike most models, Spark-TTS uses vector quantization (VQ) to encode speech into two types of tokens:

  • Global Tokens
  • Semantic Tokens

Global tokens are a small set of tokens that capture speaker-specific attributes such as timbre, accent, and intonation patterns. Semantic tokens, on the other hand, are a sequence of tokens that capture the linguistic content and coarse prosody.

During synthesis, the LLM at the core of the model takes in textual tokens, which come from the user’s text input, and global tokens, which come from a reference voice. It then directly predicts the sequence of semantic tokens needed for the speech. These semantic tokens, combined with the global tokens, are fed into a decoder. The decoder converts them into an audio waveform, essentially reconstructing speech via the codec.

Additionally, the Spark-TTS model can do more than just clone voices.  It also supports controllable voice generation. If you don’t have a reference audio sample, you can specify attributes such as “old male voice, low-pitched, slow speaking rate.” Spark-TTS will then adjust the global token to create a new voice matching that description.

Let's demonstrate how to clone a voice using Spark-TTS. The first step is to clone the Github repo of Spark-TTS by running:

 git clone https://github.com/SparkAudio/Spark-TTS 

After that, you need to create a new Python 3.12. environment using your terminal. For this demo, let's call it voice_cloning_env. Once the environment is created, activate it and navigate to the cloned repo directory on your PC. Once you are there, just run the command pip install -r requirements.txt to install all the necessary libraries for running the model. From here, there are two approaches you can take. 

You can use the CLI, which can sometimes be confusing for beginners. Alternatively, you can use Spark-TTS’ Python API, which is simpler. For this demo, I will take the Python API approach

Create a new Python file in the Spark-TTS folder, and include the following code:

import torch
import pathlib
import os
import torchaudio
from cli.SparkTTS import SparkTTS
from huggingface_hub import snapshot_download

# Download the pretrained Spark-TTS model
snapshot_download(
	"SparkAudio/Spark-TTS-0.5B", 
	local_dir='model/Spark-TTS-0.5B'
)

# Setup Device (GPU if available, otherwise CPU)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

# Define model location

model_dir = 'model/Spark-TTS-0.5B'

# Create the model
tts = SparkTTS(model_dir=model_dir, device=device)

# Define reference audio

reference_audio = "reference_audio.wav"   

# Define reference text

reference_text = "Hi, my name is Boris Delovski and I'm a trainer and consultant at Edlitera." 


# Run synthesis
clone = tts.inference(
    "Hello, let's see if the cloned voice actually sounds like me.",
    prompt_speech_path=reference_audio,
    prompt_text=reference_text,                       
)

# Save the result
wav_tensor = torch.from_numpy(clone).unsqueeze(0)
out_path = pathlib.Path("sparktts_output.wav")
torchaudio.save(str(out_path), wav_tensor, tts.sample_rate)

Let's break down the code above. First, we need to import everything that we will use.

import torch
import pathlib
import os
import torchaudio
from cli.SparkTTS import SparkTTS
from huggingface_hub import snapshot_download

We import the following:

•    Helper libraries for manipulating files: os, pathlib.  
•    Libraries for working with AI models and manipulating tensors: torch, torchaudio.
•    A library that allows us to access the pre-trained Spark-TTS model: huggingface_hub.
•    The Spark model itself: cli.SparkTTS.

Next, we need to download the pre-trained model.

# Download the pretrained Spark-TTS model
snapshot_download(
	"SparkAudio/Spark-TTS-0.5B", 
	local_dir='model/Spark-TTS-0.5B'
)

This is a one-time operation. Even if you keep it in your code after running it once, you don't need to worry. After the first run, snapshot_download() will detect that the Spark-TTS weights are already downloaded and will not download the files again.

Next, we need to specify the device where we want to run the model. Then, we initialize an instance of the model class:

# Setup Device (GPU if available, otherwise CPU)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

# Define model location

model_dir = 'model/Spark-TTS-0.5B'

# Create the model
tts = SparkTTS(model_dir=model_dir, device=device)

At this point, we are almost ready to perform inference. We just need to provide the necessary input data before proceeding:

# Define reference audio

reference_audio = "reference_audio.wav"   


# Define reference text

reference_text = "Hi, my name is Boris Delovski and I'm a trainer and consultant at Edlitera." 

# Define the text you want the cloned voice 
# to say
TARGET_TEXT = "The future of AI voice is here."

We need to provide the model with three inputs: the reference audio sample, the transcription of that audio sample, and the text we want the cloned voice to say. Including the transcription of the reference audio is highly important, as it helps the model accurately process the input voice.

Finally, it is time to perform inference:

# Run synthesis
clone = tts.inference(
    "Hello, let's see if the cloned voice actually sounds like me.",
    prompt_speech_path=reference_audio,
    prompt_text=reference_text,                       
)

As shown in the code above, we first input the text that we want the cloned voice to say. In addition, we provide the reference audio and its transcription. Based on these three inputs, the model generates a one-dimensional NumPy array, for example, 240,000 samples for a one-second clip at 24 kHz.  Next, we convert this NumPy array into an audio file:

# Save the result
wav_tensor = torch.from_numpy(clone).unsqueeze(0)
out_path = pathlib.Path("sparktts_output.wav")
torchaudio.save(str(out_path), wav_tensor, tts.sample_rate)

Here, we first convert the NumPy array into a tensor in an efficient way:

•  torch.from_numpy(clone): shares memory with the original NumPy array and converts it into a tensor.
 .unsqueeze(0): inserts a new dimension at index 0. This ensures that the generated tensor has the shape expected by torchaudio.save.

After converting the array into a tensor, the next step is to specify where to store the output audio file and choose its format. We can use pathlib. Path to define the file location and name. In this case, we will save the file as a WAV format.

To wrap up, torchaudio.save() serializes the tensor into a WAV file.  It requires three inputs: the file path where the audio will be saved, the tensor we generated by converting the NumPy array, and the sample rate. To ensure the saved file plays back at the intended speed and pitch, always use the same sample rate as the Spark-TTS model. You can get this by accessing the model's sample_rate attribute.

In this article, we explained the fundamentals of voice cloning, its importance and the way it works. We also covered different types of voice cloning. Some require just a small audio sample, like zero-shot models, while others need more recordings to achieve higher quality, such as few-shot models. Most companies prefer few-shot approaches for the best results. Finally, we demonstrated a practical example using Spark-TTS, showing how to clone a voice with just a few lines of code

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.