Building a Video Editing App in Python: How Do Preprocessing and Transcription Work

Preprocessing and transcription for video editing in Python.
By Boris Delovski • Updated on Apr 22, 2025
blog image

In the previous article of this series, we discussed combining multiple Python libraries to build a simple yet robust video editing application. In this installment, we focus on the first two components: preprocessing and transcription. These steps are essential, as the transcription quality significantly impacts the overall effectiveness of the video editing application. 

Our first task is to extract the audio from the input video and save it as a WAV file. This step is required by our transcription model Once the audio is ready, we process the text using a specialized version of the Whisper model called Whisper Timestamped. This model generates the transcription that users will edit within the application. By removing specific sections of the transcription, users can remove the corresponding video segments. This approach makes the editing process more straightforward and efficient.

How Does Preprocessing Work 

Achieving a high-quality transcription depends on providing clear audio to the Whisper Timestamped model. The complexity of the preprocessing pipeline varies based on the recording conditions of the audio being processed. In controlled environments with minimal background noise, noise filtering may not be necessary before extracting the audio. However, in less optimal recording conditions, additional preprocessing steps are often required. These steps help filter out distracting background noise and improve audio quality, ensuring accurate transcription. 

To simplify the process and focus primarily on the video editing aspect of the application, I will assume that the videos have clean audio with minimal noise. This assumption allows us to extract the audio as a WAV file without requiring complex noise filtering. Once the WAV file is ready, we can input it into the transcription model to generate the transcription of the video.

We will use the Python library MoviePy to extract audio from our video. This versatile and widely-used tool simplifies video editing tasks, making them both efficient and straightforward. While we will revisit MoviePy later to overlay dubbed audio onto the original video for the final output, our current focus is on audio extraction. Extracting audio requires just a few lines of code.

However, we will enhance the process by structuring it into a pipeline. To achieve this, a Preprocessor class will be created. This class will serve as our preprocessing pipeline, providing an organized and scalable method for extracting audio from videos. The code for building the class will look like this:

from moviepy.editor import VideoFileClip
import os


class Preprocessor:
    def __init__(self, video_path):
        """
        Initialize the Preprocessor with the path to the original video.

        :param video_path: Path to the original video file (e.g., MP4 file)
        """
        self.video_path = video_path
        self.video = None
        self.audio = None

    def extract_audio(self):
        """
        Convert the MP4 video audio to a WAV format and save it.
        """
        # Set a name for the output wav
        base_name = os.path.splitext(self.video_path)[0]
        wav_output_path = f"{base_name}_audio.wav"

        # Use context manager for VideoFileClip
        with VideoFileClip(self.video_path) as video:
            audio = video.audio
            # Use context manager for AudioFileClip
            with audio as audio_clip:
                audio_clip.write_audiofile(wav_output_path)

Encapsulating functionality within a Preprocessor class makes the code modular, reusable, and easier to extend with additional preprocessing methods. The code leverages Python's context managers (with statements) to manage resources like video and audio files efficiently. Context managers ensure proper setup and cleanup of resources, preventing issues such as memory leaks or locked files. Additionally, this approach dynamically generates output file names by appending _audio.wav to the input file name. This reduces the risk of overwriting files or hardcoding names.

To extract audio using this pipeline, simply create a Preprocessor object and call the extract_audio() method:

# Define the video location
path_to_original_video = " original_video.mp4"

# Extract the audio from the original videpreprocessing_pipeline = Preprocessor(path_to_original_video)
preprocessing_pipeline.extract_audio()

Article continues below

How Does Transcription Work

Generating a high-quality transcription is crucial, as the app's core functionality depends on editing the transcription and reflecting those changes in the video by removing the corresponding segments. This requires both an accurate transcription and precise timestamps for each word. The standard Whisper model does not offer this level of precision. Therefore, as mentioned earlier, a modified version called Whisper Timestamped will be used to meet these requirements.

The standard Whisper model was designed to predict approximate timestamps for speech segments, typically accurate to about one second. However, it was not trained to provide word-level timestamps. For the purpose of precise edits, such as removing specific words from a video, word-level timestamps are essential. Whisper Timestamped achieves this by incorporating an algorithm called Dynamic Time Warping (DTW). While we will not dive deeply into DTW, here is a brief overview of how it helps Whisper generate more accurate timestamps.

During transcription, the model's decoder generates cross-attention weights, which show how the model focuses on different audio parts when predicting each word. DTW uses these cross-attention weights to align the sequence of words with corresponding audio frames. This alignment identifies the best match between the audio's temporal structure and the transcribed words. As a result, it allows mapping each word to its exact start and end times. By analyzing these aligned weights, Whisper Timestamped accurately extracts word-level timestamps. This process is achieved without adding extra steps to the model’s decoding. 

At this point, I do have to mention one thing. The Whisper Timestamped model is provided under the GNU AFFERO GENERAL PUBLIC LICENSE, also known as the AGPL-3.0 license. If this is a concern, you might prefer to avoid using this model. Instead, you can use a fine-tuned variant of the standard Whisper model called Whisper large-v3-turbo. This model is capable of generating word-level timestamps and is also covered under the MIT license. This license is more permissive with minimal restrictions, while the AGPL-3.0  is a strong copyleft license. 

For instance, a company could integrate MIT-licensed software into their product, modify it, and sell it without needing to disclose their modifications. In contrast, while the AGPL-3.0 license allows commercial use, it comes with significant conditions. For example, if a company uses AGPL-licensed software in a web application, they must make the source code of their entire application publicly available if it interacts with the AGPL-licensed code.

For demonstration purposes, I will show how to use both the Whisper Timestamped model and the Whisper large-v3-turbo model. Ultimately, your choice of model depends purely on your goals. The Whisper Timestamped model is slightly more accurate, whereas the Whisper large-v3-turbo model is faster and covered under a much more permissive license.

How to Use the Whisper large-v3-turbo Model

Let's first build the Whisper large-v3-turbo model:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Create class for the Whisper Large V3 Turbo transcription model
class WhisperLargeTurbo:
    """
    A class for transcribing audio using the Whisper Large V3 Turbo model from OpenAI.
    This class handles model loading, processing, and transcription of audio files
    using a pre-trained speech-to-text model. The language for transcription can be specified.

    Attributes:
        device (str): The device to run the model on ('cuda:0' if a GPU is available, otherwise 'cpu').
        torch_dtype (torch.dtype): The data type for the model's tensors (float16 if GPU is available, otherwise float32).
        model (AutoModelForSpeechSeq2Seq): The pre-trained Whisper model for speech-to-text tasks.
        processor (AutoProcessor): The processor that handles tokenization and feature extraction for the model.
        pipe (pipeline): A Hugging Face pipeline object for automatic speech recognition with the model.
    """

    def __init__(self, model_name="openai/whisper-large-v3-turbo", language="en"):
        """
        Initializes the WhisperLargeTurbo class by loading the pre-trained Whisper model
        and setting up the necessary components for transcription. The user can specify the
        language for transcription.

        Args:
            model_name (str): The name of the pre-trained model to use. Defaults to "openai/whisper-large-v3-turbo".
            language (str): The language code for transcription (e.g., 'en' for English, 'es' for Spanish). Defaults to "en".
        """
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

        self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_name, torch_dtype=self.torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
        )
        self.model.to(self.device)
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.pipe = pipeline(
            "automatic-speech-recognition",
            model=self.model,
            generate_kwargs={"language": language, "task": "transcribe"},
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            torch_dtype=self.torch_dtype,
            device=self.device,
            return_timestamps=True
        )

    def transcribe(self, audio_path):
        """
        Transcribes the audio from the given file path using the Whisper Large V3 Turbo model.
        The result includes the transcribed text as well as timestamps for when certain words were spoken.

        Args:
            audio_path (str): The file path to the audio file that needs to be transcribed.

        Returns:
            dict: A dictionary containing the transcription result, including:
                - 'text' (str): The transcribed text.
                - 'timestamps' (list): A list of timestamps indicating when certain words were spoken.
        """
        result = self.pipe(audio_path, return_timestamps="word")
        return result

We will make use of the Transformers library to create our model. To be more precise, we will build a class that encapsulates the whisper-large-v3-turbo model, found on Hugging Face, that can be accessed via the transformers library.

Upon initialization, the class verifies GPU availability to enhance performance and configures the appropriate data types. It then loads the pre-trained Whisper model and processor based on the specified model name and language. Additionally, the class sets up a speech recognition pipeline designed to transcribe audio. This pipeline returns a dictionary containing the transcribed text along with timestamps indicating when each word is spoken.

The model is loaded using AutoModelForSpeechSeq2Seq from the Transformers library, while the processor is loaded using AutoProcessor, also from Transformers. The AutoModelForSpeechSeq2Seq class simplifies the process by identifying and loading the appropriate model class based solely on its name from HuggingFace, utilizing the from_pretrained method. Similarly, AutoProcessor automatically selects and loads the correct processor for the specified model. This ensures that the input data is correctly preprocessed to meet the model's requirements.

Once both are defined, we can combine them into a pipeline by creating an instance of the pipeline class. The pipeline class in Hugging Face's Transformers library is a streamlined, high-level interface that simplifies working with pre-trained models for various tasks. It integrates the model, tokenizer, and processing logic into a single framework. This enables easy execution of tasks such as text generation, translation, question answering, and, in this case, automatic speech recognition (ASR). Our pipeline is defined as follows:

self.pipe = pipeline(
    "automatic-speech-recognition",
    model=self.model,
    generate_kwargs={"language": language, "task": "transcribe"},
    tokenizer=self.processor.tokenizer,
    feature_extractor=self.processor.feature_extractor,
    torch_dtype=self.torch_dtype,
    device=self.device,
    return_timestamps=True
)

Let's explain each argument separately:

  •  task: Specifies the task. In this case, it instructs the pipeline to transcribe speech into text using "automatic-speech-recognition".
  • model: Specifies the pre-trained Whisper model used for performing the transcription.
  • tokenizer: The tokenizer that converts text into tokens and vice versa.
  • feature_extractor: The feature extractor that processes raw audio input into features that the model can understand.
  • generate_kwargs: Additional parameters for generating results:
    •  "language": Specifies the transcription language.
    • "task": "transcribe" tells the model to perform transcription.
  • torch_dtype: Ensures the pipeline uses the correct data type for computations (float16 on GPU, float32 otherwise).
  • device: Specifies whether the pipeline runs on the CPU or GPU.
  • return_timestamps: Enables word-level timestamps in the output, providing information about when specific words were spoken in the audio.

The transcribe method is responsible for performing the transcription. It accepts the path to an audio file and processes it using the pipeline. The method returns a dictionary containing the transcribed text along with corresponding timestamps. This enables precise and efficient speech-to-text conversion. Here, the key point is to ensure that word-level timestamps are requested. To do this, set the return_timestamps argument value to "word".

How to Use the Whisper Timestamped Model

Our other class, which will allow us to transcribe text using the Whisper Timestamped model, will look like this:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import whisper_timestamped as whisper

class WhisperTimestamped:
    """
    A class for transcribing audio using the Whisper Timestamped library.

    This class provides functionality to transcribe audio files into text
    while also including detailed timestamp information for each word. The
    transcription results are reformatted to match the WhisperLargeTurbo model's output format.

    Attributes:
        device (str): The device used for computation, either 'cuda' if a GPU is available, or 'cpu'.
        language (str): The language of the audio to be transcribed, default is English ('en').
        model: The Whisper model loaded based on the specified model name.
    """

    def __init__(self, model_name="openai/whisper-large-v3", language="en"):
        """
        Initializes the WhisperTimestamped instance with a specified model and language.

        Args:
            model_name (str, optional): The name of the Whisper model to load. Defaults to 'openai/whisper-large-v3'.
            language (str, optional): The language of the audio to be transcribed. Defaults to English ('en').
        """
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.language = language
        self.model = whisper.load_model(model_name, device=self.device)

    def transcribe(self, audio_path):
        """
        Transcribes the given audio file into text with word-level timestamps.

        Args:
            audio_path (str): The file path of the audio to be transcribed.

        Returns:
            dict: A dictionary containing the transcription text and word-level timestamp information.
        """
        # Load and transcribe audio
        audio = whisper.load_audio(audio_path)
        result = whisper.transcribe(self.model, audio, language=self.language)
        # Reformat the output to match WhisperLargeTurbo
        final_result = self.reformat_result(result)
        return final_result

    def reformat_result(self, timestamped_result):
        """
        Reformats the output of the Whisper Timestamped model to match the WhisperLargeTurbo format.

        Args:
            timestamped_result (dict): The result dictionary from the Whisper Timestamped model.
                Expected structure includes 'text' for the overall transcription and 'segments',
                which contains details of words and their timestamps.

        Returns:
            dict: Reformatted result containing:
                - 'text': The full transcription as a string.
                - 'chunks': A list of dictionaries, where each dictionary represents a word with:
                    - 'text': The word as a string.
                    - 'timestamp': A tuple (start_time, end_time) indicating the word's timing in seconds.
        """
        reformatted_result = {
            "text": timestamped_result["text"],
            "chunks": []
        }

        for segment in timestamped_result["segments"]:
            for word in segment["words"]:
                reformatted_result["chunks"].append({
                    "text": word["text"],
                    "timestamp": (word["start"], word["end"])
                })

        return reformatted_result

This class allows us to create an instance of the Whisper Timestamped model. The model can then be used to transcribe audio files. It supports execution on both GPU and CPU. Additionally, it allows transcription in multiple languages.

You may notice that we specify the whisper-large-v3 model as the base. This is similar to what we did in the previous class. The reason for this is that the Whisper Timestamped model is essentially a Whisper model enhanced with an additional algorithm. This algorithm improves the precision of its timestamps. Therefore, we still need to select a "base" Whisper model to build upon. We then integrate the extra functionality for more accurate timestamps.

However, the process of loading the model is now even simpler, thanks to the whisper-timestamped library. We only need to define the model name. The library then manages the rest automatically. This includes handling tokenizers, processors, and other components. This greatly streamlines the setup process.

The transcribe method is quite simple. It is what we use to transcribe the audio. We only need to define the path to the audio file that needs transcription. Additionally, we specify how precise we want the timestamps to be. For the best possible precision, we will ask the model to return word-level timestamps.

The reformat_result method is not strictly necessary for transcription. However, it is crucial to ensure consistency in the output format. Whisper models from the Transformers library and the Whisper Timestamped model return results in different formats. To allow interchangeable use of the two models, we need to establish a consistent interface for their outputs. This will be formalized into a third class to unify both models. For now, the reformat_result method ensures that the transcription from the Whisper Timestamped model follows the same structure as the output from the Whisper large-v3-turbo model.

The reformat_result method is responsible for converting the raw transcription output of the Whisper Timestamped model into a standardized and user-friendly format. By default, the transcription is divided into segments. Each segment contains metadata and word-level timestamps, which are nested under the "words" key. This method processes these segments and consolidates all words into a single list called chunks.

Each chunk contains the word's text and a tuple with its start and end timestamps. Moreover, the complete transcription text is extracted and stored in the text key. This transformation ensures consistency and makes the output easier to use, especially when integrating with the Whisper large-v3-turbo model's format.

An example of a transcription from the Whisper large-v3-turbo model, which serves as the target format for the output reformatted from the Whisper Timestamped model, looks like this:

{'text': ' A neural network is a computational model inspired by the structure and function of biological neural networks, such as the human brain. It consists of layers of interconnected units called neurons, which process information and learn patterns from data.',
 'chunks': [{'text': ' A', 'timestamp': (0.34, 0.5)},
  {'text': ' neural', 'timestamp': (0.5, 0.72)},
  {'text': ' network', 'timestamp': (0.72, 1.12)},
  {'text': ' is', 'timestamp': (1.12, 1.42)},
  {'text': ' a', 'timestamp': (1.42, 1.52)},
  {'text': ' computational', 'timestamp': (1.52, 2.02)},
  {'text': ' model', 'timestamp': (2.02, 2.48)},
  {'text': ' inspired', 'timestamp': (2.48, 3.2)},
  {'text': ' by', 'timestamp': (3.2, 3.52)},
  {'text': ' the', 'timestamp': (3.52, 3.66)},
  {'text': ' structure', 'timestamp': (3.66, 4.18)},
  {'text': ' and', 'timestamp': (4.18, 4.42)},
  {'text': ' function', 'timestamp': (4.42, 4.8)},
  {'text': ' of', 'timestamp': (4.8, 5.06)},
  {'text': ' biological', 'timestamp': (5.06, 5.58)},
  {'text': ' neural', 'timestamp': (5.58, 6.1)},
  {'text': ' networks,', 'timestamp': (6.1, 6.74)},
  {'text': ' such', 'timestamp': (6.74, 7.04)},
  {'text': ' as', 'timestamp': (7.04, 7.16)},
  {'text': ' the', 'timestamp': (7.16, 7.26)},
  {'text': ' human', 'timestamp': (7.26, 7.56)},
  {'text': ' brain.', 'timestamp': (7.56, 8.2)},
  {'text': ' It', 'timestamp': (8.2, 8.54)},
  {'text': ' consists', 'timestamp': (8.54, 8.9)},
  {'text': ' of', 'timestamp': (8.9, 9.2)},
  {'text': ' layers', 'timestamp': (9.2, 9.7)},
  {'text': ' of', 'timestamp': (9.7, 9.96)},
  {'text': ' interconnected', 'timestamp': (9.96, 10.6)},
  {'text': ' units', 'timestamp': (10.6, 11.26)},
  {'text': ' called', 'timestamp': (11.26, 11.92)},
  {'text': ' neurons,', 'timestamp': (11.92, 12.88)},
  {'text': ' which', 'timestamp': (12.88, 13.34)},
  {'text': ' process', 'timestamp': (13.34, 13.74)},
  {'text': ' information', 'timestamp': (13.74, 14.38)},
  {'text': ' and', 'timestamp': (14.38, 14.74)},
  {'text': ' learn', 'timestamp': (14.74, 14.98)},
  {'text': ' patterns', 'timestamp': (14.98, 15.46)},
  {'text': ' from', 'timestamp': (15.46, 15.96)},
  {'text': ' data.', 'timestamp': (15.96, 17.1)}]}

How to Build a Unified Transcription Interface

To simplify and standardize audio transcription, we can create a unified interface. This interface will provide flexibility in working with different models. Instead of using the two classes we built earlier directly, we will define a Transcriber class to serve as this interface. This approach ensures that transcription outputs follow a consistent format. This consistency makes it easier to swap or add models in the future. If a model's output format differs, we can reformat it before returning the results. Below is the structure of this class:

# Create general class for transcription models
class Transcriber:
    """
    A general class for working with different transcription models. This class provides a unified
    interface for transcribing audio regardless of the specific transcription model being used.

    Attributes:
        recognizer: An instance of a transcription model that implements a 'transcribe' method.
    """

    def __init__(self, recognizer):
        """
        Initializes the Transcriber class with a transcription model.

        Args:
            recognizer: A transcription model instance with a 'transcribe' method.
        """
        self.recognizer = recognizer

    def transcribe(self, audio_path, **kwargs):
        """
        Transcribes audio using the provided transcription model.

        Args:
            audio_path (str): Path to the audio file.
            **kwargs: Additional arguments to pass to the recognizer's transcribe method.

        Returns:
            dict: A dictionary containing the transcription result, including:
                - 'text' (str): The transcribed text.
                - 'timestamps' (list): A list of timestamps indicating when certain words were spoken.
        """
        return self.recognizer.transcribe(audio_path, **kwargs)

The Transcriber class accepts a recognizer object, such as the WhisperLargeTurbo or WhisperTimestamped class, as long as it implements a consistent transcribe method. By delegating the transcription task to the recognizer’s transcribe method, the Transcriber class simplifies integration with various models. This approach allows the core transcription pipeline to remain unchanged. Furthermore, it leverages the shared structure of many models in the Transformers library. As a result, it is easy to replace or extend functionality while maintaining compatibility across the video editing workflow.

In this article, we explored the crucial preprocessing and transcription stages for building a Python-based video editing application. First, we used the Preprocessor class to extract audio, followed by generating precise transcriptions with either the WhisperTimestamped or WhisperLargeTurbo models. This established the foundation for seamless audio-text alignment. Both models have their strengths: the Whisper Timestamped model offers enhanced accuracy, while the Whisper Large Turbo model provides speed and licensing flexibility.

To ensure consistency and adaptability, we introduced the Transcriber class as a unified interface for transcription. This design allows easy integration of future models while maintaining a consistent transcription format and ensuring compatibility across the editing pipeline. With these components in place, we have established a robust framework to manage the initial stages of video editing. This sets the stage for future articles, where we will integrate these transcriptions into an end-to-end editing workflow.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.