Building a Video Editing App in Python: How to Validate and Synchronize Text Edits with Timestamps

Validate and sync text edits with timestamps in video editing.
By Boris Delovski • Updated on Apr 22, 2025
blog image

In the previous article of this series, we developed two pipelines for our video editing app: a preprocessing pipeline and a transcription pipeline. The preprocessing pipeline extracts audio from the original video, while the transcription pipeline converts that audio into text. These steps are essential as they enable the app to display a video's transcription immediately after upload. Furthermore, you can directly modify the video by editing the transcription, just as you would edit the text in a Word file.

In this article, we will continue developing the pipelines required for seamless video editing. After that, we will transition to designing the user interface. Specifically, the focus will be on a set of helper functions. These functions are essential for enabling video editing. They allow simple modifications to the text generated by the transcription pipeline.

How to Validate Text Edits

After a user modifies the transcription provided by the transcription pipeline, the initial task is to validate their changes. It is fundamental to ensure that users are limited to removing specific portions of the text returned as a string by the transcription pipeline. They should not be allowed to add any new content. Simply put, users can only delete existing words from the transcription, without appending additional text. To implement this restriction, we will create a function called is_valid_edit.

The is_valid_edit function will verify whether the edited string is a valid subsequence of the original string (case-insensitive). It will ensure that only deletions are allowed, with no rearrangements of the text. The function will look like this:

from nltk.tokenize import RegexpTokenizer
import re

def is_valid_edit(original, edited):
    """
    Check if the edited string is a valid subsequence of the original string
    (case-insensitive), allowing only deletions and no rearrangements.

    Args:
        original (str): The original transcription.
        edited (str): The edited transcription.

    Returns:
        bool: True if the edited string is a valid modification, False otherwise.
    """
    tokenizer = RegexpTokenizer(r"\w+")
    original_tokens = tokenizer.tokenize(original.lower())
    edited_tokens = tokenizer.tokenize(edited.lower())

    orig_idx, edit_idx = 0, 0
    while edit_idx < len(edited_tokens):
        if orig_idx >= len(original_tokens):
            return False
        if edited_tokens[edit_idx] == original_tokens[orig_idx]:
            edit_idx += 1
        orig_idx += 1
    return True

This function verifies whether the edited string is a valid subsequence of the original string by ensuring the following:

  • the edited string must be case-insensitively derived from the original by removing words.
  • no words can be rearranged or added to the edited string.

The process is simple. We begin by using the RegexpTokenizer from the NLTK library to tokenize the input strings. This step breaks the strings into words (tokens) based on a regular expression. The tokenizer removes punctuation and ensures that the tokens are alphanumeric. To maintain case-insensitivity, we convert both the original and edited text to lowercase during tokenization. Once tokenized, the words from the original string are stored in original_tokens, while those from the edited string are stored in edited_tokens.

Next, we initialize two pointers: orig_idx and edit_idx. These pointers are used to track the current position in original_tokens and edited_tokens, respectively. We then loop through the tokens, comparing them step by step.

There are three scenarios we may encounter during the loop:

  1. If the current word in edited_tokens matches the corresponding word in original_tokens, we move to the next word in edited_tokens.
  2. If the current word in edited_tokens does not match the word in original_tokens, we move to the next word in original_tokens to continue checking.
  3. If we reach the end of original_tokens before matching all the words in edited_tokens, the edit is considered invalid.

In essence, the function ensures that all words in edited_tokens appear in the same order as they do in original_tokens. It permits skipping words from original_tokens but prohibits reordering or adding new words to the edited string.

How to Convert Text Edits to Timestamp Updates

The transcription models we use provide a dictionary containing two key elements: one with the transcription text and another with the words and their corresponding timestamps. A typical example looks like this:

{'text': ' A neural network is a computational model inspired by the structure and function of biological neural networks, such as the human brain. It consists of layers of interconnected units called neurons, which process information and learn patterns from data.',
 'chunks': [{'text': ' A', 'timestamp': (0.34, 0.5)},
  {'text': ' neural', 'timestamp': (0.5, 0.72)},
  {'text': ' network', 'timestamp': (0.72, 1.12)},
  {'text': ' is', 'timestamp': (1.12, 1.42)},
  {'text': ' a', 'timestamp': (1.42, 1.52)},
  {'text': ' computational', 'timestamp': (1.52, 2.02)},
  {'text': ' model', 'timestamp': (2.02, 2.48)},
  {'text': ' inspired', 'timestamp': (2.48, 3.2)},
  {'text': ' by', 'timestamp': (3.2, 3.52)},
  {'text': ' the', 'timestamp': (3.52, 3.66)},
  {'text': ' structure', 'timestamp': (3.66, 4.18)},
  {'text': ' and', 'timestamp': (4.18, 4.42)},
  {'text': ' function', 'timestamp': (4.42, 4.8)},
  {'text': ' of', 'timestamp': (4.8, 5.06)},
  {'text': ' biological', 'timestamp': (5.06, 5.58)},
  {'text': ' neural', 'timestamp': (5.58, 6.1)},
  {'text': ' networks,', 'timestamp': (6.1, 6.74)},
  {'text': ' such', 'timestamp': (6.74, 7.04)},
  {'text': ' as', 'timestamp': (7.04, 7.16)},
  {'text': ' the', 'timestamp': (7.16, 7.26)},
  {'text': ' human', 'timestamp': (7.26, 7.56)},
  {'text': ' brain.', 'timestamp': (7.56, 8.2)},
  {'text': ' It', 'timestamp': (8.2, 8.54)},
  {'text': ' consists', 'timestamp': (8.54, 8.9)},
  {'text': ' of', 'timestamp': (8.9, 9.2)},
  {'text': ' layers', 'timestamp': (9.2, 9.7)},
  {'text': ' of', 'timestamp': (9.7, 9.96)},
  {'text': ' interconnected', 'timestamp': (9.96, 10.6)},
  {'text': ' units', 'timestamp': (10.6, 11.26)},
  {'text': ' called', 'timestamp': (11.26, 11.92)},
  {'text': ' neurons,', 'timestamp': (11.92, 12.88)},
  {'text': ' which', 'timestamp': (12.88, 13.34)},
  {'text': ' process', 'timestamp': (13.34, 13.74)},
  {'text': ' information', 'timestamp': (13.74, 14.38)},
  {'text': ' and', 'timestamp': (14.38, 14.74)},
  {'text': ' learn', 'timestamp': (14.74, 14.98)},
  {'text': ' patterns', 'timestamp': (14.98, 15.46)},
  {'text': ' from', 'timestamp': (15.46, 15.96)},
  {'text': ' data.', 'timestamp': (15.96, 17.1)}]}

The transcription text, which is the first element, is used to verify the validity of a given edit. If the edit is valid, the dictionary must be updated to remove any words that are missing in the edited version. This results in an updated dictionary that can later be used to remove the corresponding parts of the video. 

As an example, let's say that our initial transcription looks like this:

transcription = {
    "text": "This is an example transcription.",
    "chunks": [
        {"word": "This", "timestamp": (0.0, 0.5)},
        {"word": "is", "timestamp": (0.5, 1.0)},
        {"word": "an", "timestamp": (1.0, 1.5)},
        {"word": "example", "timestamp": (1.5, 2.0)},
        {"word": "transcription", "timestamp": (2.0, 2.5)},
    ]
}

If the edited transcription text is "This is example," it represents a valid edit with two words removed: "an" and "transcription." Our goal is to create a class that, given a valid edit, removes these words from the original dictionary and returns an updated version of the chunks, which would look like this:

"chunks": [
        {"word": "This", "timestamp": (0.0, 0.5)},
        {"word": "is", "timestamp": (0.5, 1.0)},
        {"word": "example", "timestamp": (1.5, 2.0)},
    ]

To achieve this, we will create a class called TextToTimestampConverter. This class will convert text transcriptions into timestamps, identify missing words, and update the chunks accordingly. The class will look like this:

from decimal import Decimal, ROUND_UP
import string


class TextToTimestampConverter:
    """
    Convert text transcriptions into timestamps, identify missing words, and update
    or remove associated chunks accordingly.
    """
    def __init__(self, transcription: dict):
        self.full_text = transcription["text"].strip()
        self.chunks = transcription["chunks"]

    @staticmethod
    def clean_text(text):
        """
        Lowercase, remove punctuation, and strip whitespace.

        Args:
            text (str): Input text.

        Returns:
            str: Cleaned text.
        """
        return text.translate(str.maketrans("", "", string.punctuation)).strip().lower()

    @staticmethod
    def round_to_next_0_1(value):
        """
        Round a floating-point value up to the nearest 0.1.

        Args:
            value (float): Value to round.

        Returns:
            float: Rounded value.
        """
        return float((Decimal(value).quantize(Decimal("0.1"), rounding=ROUND_UP)))

    @staticmethod
    def find_missing_word_indices(original_words, user_words):
        """
        Identify indices of words in the original text that are missing in the user input.

        Args:
            original_words (list): Tokenized words from the original text.
            user_words (list): Tokenized words from the edited text.

        Returns:
            list: Indices of missing words.
        """
        missing_indices = []
        user_idx = 0
        for i, orig_word in enumerate(original_words):
            if user_idx >= len(user_words) or orig_word != user_words[user_idx]:
                missing_indices.append(i)
            else:
                user_idx += 1
        return missing_indices

    def get_missing_words_timestamps(self, user_input):
        """
        Retrieve timestamps of words missing from the user input.

        Args:
            user_input (str): Edited transcription.

        Returns:
            list: List of (start_time, end_time) for missing words (rounded to 0.1).
        """
        orig_words = self.clean_text(self.full_text).split()
        user_words = self.clean_text(user_input).split()
        missing_indices = self.find_missing_word_indices(orig_words, user_words)

        missing_timestamps = []
        for i, chunk in enumerate(self.chunks):
            if i in missing_indices:
                start, end = chunk["timestamp"]
                rounded_start = self.round_to_next_0_1(start)
                rounded_end = self.round_to_next_0_1(end) + 0.05
                missing_timestamps.append((rounded_start, rounded_end))
        return missing_timestamps

    def remove_chunks(self, chunks_to_remove_indices):
        """
        Remove specific chunks by their indices.

        Args:
            chunks_to_remove_indices (list): Indices of chunks to remove.

        Returns:
            list: Updated chunks.
        """
        return [
            c for i, c in enumerate(self.chunks)
            if i not in chunks_to_remove_indices
        ]

    def update_chunks(self, user_input):
        """
        Update and remove chunks for missing words based on user edits.

        Args:
            user_input (str): Edited transcription.

        Returns:
            list: Updated chunks without the missing words.
        """
        orig_words = self.clean_text(self.full_text).split()
        user_words = self.clean_text(user_input).split()
        missing_indices = self.find_missing_word_indices(orig_words, user_words)
        return self.remove_chunks(missing_indices)

This class will do the following:

  • Clean and compare the original transcription with the user-edited version.
  • Retrieve timestamps corresponding to the missing words.
  • Remove the missing word “chunks” from the original transcription data if necessary.
  • Update the chunks to reflect only the words remaining in the user-edited text.

Let's break down how each part of the class works in detail.

How Is the Class Defined and Initialized

class TextToTimestampConverter:
    """
    Convert text transcriptions into timestamps, identify missing words, and update
    or remove associated chunks accordingly.
    """
    def __init__(self, transcription: dict):
        self.full_text = transcription["text"].strip()
        self.chunks = transcription["chunks"]

This part of the code initializes the converter with a transcription dictionary. The dictionary contains the full transcription text and a list of timestamp "chunks". Each chunk is associated with a word in the text.  When storing the original transcription text, we will ensure that there are no leading or trailing whitespaces. This will be done by trimming the text using the strip() function.

Article continues below

What Is the clean_text Static Method

@staticmethod
def clean_text(text):
    """
    Lowercase, remove punctuation, and strip whitespace.

    Args:
        text (str): Input text.

    Returns:
        str: Cleaned text.
    """
    return text.translate(str.maketrans("", "", string.punctuation)).strip().lower()

A static method in Python is a method that belongs to the class itself, rather than to instances of the class. It does not require an instance to be called, nor does it have access to an instance. In this case, the static method will normalize our text by performing the following operations:

  • Remove punctuation using str.maketrans
  • Trim leading and trailing whitespaces with strip()
  • Convert the text to lowercase using lower() to ensure comparisons ignore case differences

This cleaning step is crucial for accurately matching words between the original text and its edited version.

What Is the round_to_next_0_1 Static Method

@staticmethod
def round_to_next_0_1(value):
    """
    Round a floating-point value up to the nearest 0.1.

    Args:
        value (float): Value to round.

    Returns:
        float: Rounded value.
    """
    return float((Decimal(value).quantize(Decimal("0.1"), rounding=ROUND_UP)))

Transcription models often generate ultra-precise timestamps. However, this can sometimes lead to awkward results. Although these timestamps accurately capture when a word is spoken, they overlook aspects like breaths before the word or lingering sounds afterward. Relying too rigidly on such granular timestamps can result in choppy edits in audio or video. To address this, we adjust the raw values from our transcription models. We do this by quantizing floating-point numbers to a single decimal place (0.1), always rounding up. Here are a few examples:

  • 1.23 → 1.3
  • 3.01 → 3.1

Converting timestamps in this way ensures consistent and predictable timestamp boundaries.

What Is the find_missing_word_indices Static Method

@staticmethod
def find_missing_word_indices(original_words, user_words):
    """
    Identify indices of words in the original text that are missing in the user input.

    Args:
        original_words (list): Tokenized words from the original text.
        user_words (list): Tokenized words from the edited text.

    Returns:
        list: Indices of missing words.
    """
    missing_indices = []
    user_idx = 0
    for i, orig_word in enumerate(original_words):
        if user_idx >= len(user_words) or orig_word != user_words[user_idx]:
            missing_indices.append(i)
        else:
            user_idx += 1
    return missing_indices

This method is used to compare the tokenized original text (original_words) to the tokenized user-edited text (user_words). It helps collect indices (missing_indices) of words that appear in the original but are missing in the updated version. The process is straightforward:

  • we loop through each word orig_word in original_words using enumerate. At the same time, we use a second pointer user_idx to track the current position in user_words.
  • if the current original word does not match the current user word (or if we have exhausted user_words), we record the index as missing.
  • if there is a match, we increment user_idx to compare the next user word.

In essence, this method will flag all original words that the user has omitted. It will then collect their indices in the original text in missing_indices.

What Is the get_missing_words_timestamps Method

def get_missing_words_timestamps(self, user_input):
    """
    Retrieve timestamps of words missing from the user input.

    Args:
        user_input (str): Edited transcription.

    Returns:
        list: List of (start_time, end_time) for missing words (rounded to 0.1).
    """
    orig_words = self.clean_text(self.full_text).split()
    user_words = self.clean_text(user_input).split()
    missing_indices = self.find_missing_word_indices(orig_words, user_words)

    missing_timestamps = []
    for i, chunk in enumerate(self.chunks):
        if i in missing_indices:
            start, end = chunk["timestamp"]
            rounded_start = self.round_to_next_0_1(start)
            rounded_end = self.round_to_next_0_1(end) + 0.05
            missing_timestamps.append((rounded_start, rounded_end))
    return missing_timestamps

This method obtains the timestamps corresponding to any omitted words. Its approach is as follows:

  • Tokenize both the original and user-edited text after cleaning them with our clean_text method.
  • Use the find_missing_word_indices  method to determine which words are missing from the user-edited text.
  • Iterate over each chunk in self.chunks using its index i.
  • If i is in the missing indices, it means the word was omitted by the user. In this case, do the following:
    • Use the round_to_next_0_1 method to round the timestamps.
    • Collect these rounded timestamps into missing_timestamps.

This method will produce a list of tuples. Each tuple will contain the start and end timestamps for the words that were removed during the editing process.

What Is the remove_chunks Method

def remove_chunks(self, chunks_to_remove_indices):
    """
    Remove specific chunks by their indices.

    Args:
        chunks_to_remove_indices (list): Indices of chunks to remove.

    Returns:
        list: Updated chunks.
    """
    return [
        c for i, c in enumerate(self.chunks)
        if i not in chunks_to_remove_indices
    ]

This method filters out the specified chunks from self.chunks. It iterates over the chunks using enumerate and checks whether each chunk's index is in the chunks_to_remove_indices list. Any chunk whose index appears in that list is excluded. The result is a new list of chunks that do not include the removed ones.

What Is the update_chunks Method

def update_chunks(self, user_input):
    """
    Update and remove chunks for missing words based on user edits.

    Args:
        user_input (str): Edited transcription.

    Returns:
        list: Updated chunks without the missing words.
    """
    orig_words = self.clean_text(self.full_text).split()
    user_words = self.clean_text(user_input).split()
    missing_indices = self.find_missing_word_indices(orig_words, user_words)
    return self.remove_chunks(missing_indices)

This higher-level method performs the removal of missing-word chunks in a single call. First, it cleans and tokenizes both the original transcription and the user’s edited text. Next, it identifies the words that were removed by calling the find_missing_word_indices. After that, it passes those indices to the remove_chunks method to eliminate the corresponding chunks. In practice, this simplifies the workflow. It combines text cleaning, identifying missing words, and removing chunks into one streamlined method call.

In this article, we expanded our video editing pipeline by introducing robust validation checks and a method to synchronize text edits with their corresponding timestamps. We first illustrated how to ensure that an edited transcription only omits existing words—without adding or reordering them—to prevent any new content. Next, we presented the TextToTimestampConverter class, which identifies removed words and efficiently updates their timestamps. This enables smooth and precise video edits. With these elements in place, we have established a strong foundation for managing user modifications to transcribed content. This brings us one step closer to a fully functional, text-driven video editing experience.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.