Python Modules

identify_voice

class multivoice.identify_voice.SpeakerVerifierCLI[source]

Bases: object

extract_embedding(audio_path)[source]

Extracts a speaker embedding from an audio file.

Parameters:

audio_path (str) – The path to the audio file.

Returns:

A numpy array containing the speaker embedding,

or None if extraction fails.

Return type:

numpy.ndarray or None

identify_voice(file_path)[source]

Identifies a speaker from an audio file by comparing its embedding to known speakers.

Parameters:

file_path (str) – The path to the audio file for identification.

Returns:

The ID of the identified speaker if the score meets the threshold,

otherwise returns None.

Return type:

str or None

load_embeddings()[source]

Loads speaker embeddings from a pickle file if it exists.

Returns:

A dictionary containing speaker IDs and their corresponding embeddings.

Returns an empty dictionary if the file does not exist.

Return type:

dict

multivoice.identify_voice.main()[source]

Main function to parse command line arguments and initiate the speaker identification process.

register_voice

class multivoice.register_voice.SpeakerVerifierCLI[source]

Bases: object

extract_embedding(audio_path)[source]

Extracts a speaker embedding from an audio file.

Parameters:

audio_path (str) – The path to the audio file.

Returns:

A numpy array containing the speaker embedding,

or None if extraction fails.

Return type:

numpy.ndarray or None

load_embeddings()[source]

Loads speaker embeddings from a file if it exists.

Returns:

A dictionary of user IDs mapped to their speaker embeddings.

Returns an empty dictionary if the file does not exist.

Return type:

dict

register_voice(file_path, user_id)[source]

Registers a new speaker by extracting their voice embedding and storing it.

Parameters:
  • file_path (str) – The path to the audio file for registration.

  • user_id (str) – The user ID for the speaker being registered.

save_embeddings()[source]

Saves the current speaker embeddings to a file.

multivoice.register_voice.main()[source]

Main function to register a new voice/file using command line arguments.

stt_align_transcription

multivoice.lib.stt_align_transcription.align_transcription(full_transcript, audio_waveform, device, batch_size, info)[source]

Perform forced alignment on a given audio waveform using a pre-defined transcript.

Parameters:
  • full_transcript (str) – The transcription of the audio.

  • audio_waveform (np.ndarray) – The audio waveform as a NumPy array.

  • device (str) – The device to run the model on, e.g., “cpu” or “cuda”.

  • batch_size (int) – The batch size for processing the audio.

  • info (object) – An object containing additional information such as the language.

Returns:

A list of word timestamps corresponding to the aligned transcription.

Return type:

list

stt_args

multivoice.lib.stt_args.parse_arguments()[source]

Parse command-line arguments for the Speech-to-Text (STT) module.

This function sets up and parses command-line arguments using Python’s argparse library. It defines various options such as audio file, model name, language, device, and more, which are essential for configuring the behavior of the STT system.

Returns:

An object containing all the parsed arguments.

Return type:

Namespace

stt_config

multivoice.lib.stt_config.create_config(output_dir)[source]

Creates and configures the configuration for speaker diarization.

This function sets up necessary directories, and modifies the configuration with specific parameters for audio processing and diarization.

Parameters:

output_dir (str) – The directory where configuration files and processed data will be stored.

Returns:

A configuration object containing all necessary settings for speaker diarization.

Return type:

OmegaConf.DictConfig

stt_diarize_audio

multivoice.lib.stt_diarize_audio.diarize_audio(temp_path, device)[source]

Diarizes the audio to identify speaker turns.

Parameters:
  • temp_path (str) – The path where temporary files will be stored.

  • device (str) – The device to use for processing (‘cpu’ or ‘cuda’).

Returns:

A list of lists containing start time, end time, and speaker ID for each identified segment.

Return type:

list

stt_langs

multivoice.lib.stt_langs.process_language_arg(language: str, model_name: str)[source]

Process the language argument to make sure it’s valid and convert language names to language codes.

Parameters: - language (str): The name or code of the language. - model_name (str): The name of the model being used.

Returns: - str: The processed language code.

Raises: - ValueError: If the provided language is not supported.

stt_process_audio

multivoice.lib.stt_process_audio.process_audio(args, temp_path)[source]

Process the input audio to generate speaker-segment mapping with timestamps and transcriptions.

Parameters:
  • args (Namespace) – Command line arguments parsed by argparse.

  • temp_path (str) – Temporary directory path for intermediate files.

Returns:

A list of dictionaries, each representing a sentence segment with its start time,

end time, transcription, and speaker ID.

Return type:

list

stt_punctuation

multivoice.lib.stt_punctuation.get_first_word_idx_of_sentence(word_idx, word_list, speaker_list, max_words)[source]

Finds the index of the first word in a sentence based on speaker continuity and punctuation.

Parameters:
  • word_idx (int) – The current word index.

  • word_list (list) – A list of words.

  • speaker_list (list) – A list of speakers corresponding to each word.

  • max_words (int) – Maximum number of words to consider as part of the same sentence.

Returns:

Index of the first word in the sentence or -1 if not found.

Return type:

int

multivoice.lib.stt_punctuation.get_last_word_idx_of_sentence(word_idx, word_list, max_words)[source]

Finds the index of the last word in a sentence based on punctuation.

Parameters:
  • word_idx (int) – The current word index.

  • word_list (list) – A list of words.

  • max_words (int) – Maximum number of words to consider as part of the same sentence.

Returns:

Index of the last word in the sentence or -1 if not found.

Return type:

int

multivoice.lib.stt_punctuation.get_realigned_ws_mapping_with_punctuation(word_speaker_mapping, max_words_in_sentence=50)[source]

Realigns the speaker mapping with punctuation to ensure consistent speaker labels for sentences.

Parameters:
  • word_speaker_mapping (list) – A list of dictionaries containing words and their corresponding speakers.

  • max_words_in_sentence (int) – Maximum number of words considered in a sentence for alignment.

Returns:

The realigned list of dictionaries with updated speaker information.

Return type:

list

multivoice.lib.stt_punctuation.restore_punctuation(wsm, info)[source]

Restores punctuation to a list of words based on the detected language.

Parameters:
  • wsm (list) – A list of dictionaries containing word and speaker information.

  • info (object) – An object containing information about the language.

Returns:

The modified list of dictionaries with restored punctuation.

Return type:

list

stt_separate_vocals

multivoice.lib.stt_separate_vocals.separate_vocals(audio_file, temp_path, device)[source]

Separates vocals from the given audio file using the Demucs model and saves the result in the specified temporary path.

Parameters:
  • audio_file (str) – The path to the input audio file from which vocals need to be separated.

  • temp_path (str) – The directory where the separated vocal track will be saved.

  • device (str) – The device (CPU or GPU) on which the separation process should run.

Returns:

The path to the separated vocal track if successful, otherwise the original audio file path.

Return type:

str

Notes

This function uses the demucs.separate command-line tool with a specific model (htdemucs) and extracts only the vocals. If the separation fails (non-zero return code), it logs a warning and returns the original audio file path.

stt_speaker_mapping

multivoice.lib.stt_speaker_mapping.get_sentences_speaker_mapping(word_speaker_mapping, spk_ts)[source]

Group words into sentences and map each sentence to its speaker.

Parameters:
  • word_speaker_mapping (list) – A list of dictionaries mapping words to speakers with timestamps.

  • spk_ts (list) – A list of lists containing speaker start and end times and speaker IDs.

Returns:

A list of dictionaries, each representing a sentence with the speaker, start time, end time, and text.

Return type:

list

multivoice.lib.stt_speaker_mapping.get_speaker_aware_transcript(sentences_speaker_mapping, f)[source]

Write a speaker-aware transcript to the provided file object.

Parameters:
  • sentences_speaker_mapping (list) – A list of dictionaries representing sentences with speakers and timestamps.

  • f (file object) – The file object where the transcript will be written.

multivoice.lib.stt_speaker_mapping.get_word_ts_anchor(s, e, option='start')[source]

Determine the anchor time for a word based on the specified option.

Parameters:
  • s (int) – Start time of the word in milliseconds.

  • e (int) – End time of the word in milliseconds.

  • option (str) – The option to determine the anchor time (‘start’, ‘mid’, or ‘end’).

Returns:

The anchor time for the word based on the option.

Return type:

int

multivoice.lib.stt_speaker_mapping.get_words_speaker_mapping(wrd_ts, spk_ts, word_anchor_option='start')[source]

Map each word to its corresponding speaker based on time segments.

Parameters:
  • wrd_ts (list) – A list of dictionaries containing word timestamps and text.

  • spk_ts (list) – A list of lists containing speaker start and end times and speaker IDs.

  • word_anchor_option (str) – The option to determine the anchor time for words (‘start’, ‘mid’, or ‘end’).

Returns:

A list of dictionaries mapping each word to its speaker, with timestamps and text.

Return type:

list

stt_timestamps

multivoice.lib.stt_timestamps.filter_missing_timestamps(word_timestamps, initial_timestamp=0, final_timestamp=None)[source]

Filters and fills in missing start timestamps for words in a list of word timestamps.

Parameters:
  • word_timestamps (list) – A list of dictionaries containing word and its associated timestamps.

  • initial_timestamp (float, optional) – The initial timestamp to be used if the first word lacks a start timestamp. Defaults to 0.

  • final_timestamp (float, optional) – The final timestamp to be used for calculations. No default provided.

Returns:

A filtered list of dictionaries containing words and their associated timestamps with missing start times filled in.

Return type:

list

multivoice.lib.stt_timestamps.format_timestamp(milliseconds: float, always_include_hours: bool = False, decimal_marker: str = '.')[source]

Formats a given time in milliseconds into a human-readable timestamp string.

Parameters:
  • milliseconds (float) – The time in milliseconds to be formatted.

  • always_include_hours (bool, optional) – Whether to include hours even if they are zero. Defaults to False.

  • decimal_marker (str, optional) – The character used as the decimal marker. Defaults to ‘.’.

Returns:

A formatted timestamp string in the form of ‘HH:MM:SS.mmm’ or ‘MM:SS.mmm’.

Return type:

str

stt_transcribe_audio

multivoice.lib.stt_transcribe_audio.find_numeral_symbol_tokens(tokenizer)[source]

Identifies tokens in the tokenizer’s vocabulary that contain numeral symbols.

Parameters:

tokenizer – The tokenizer object containing the model’s vocabulary.

Returns:

A list of token IDs corresponding to tokens that include numeral symbols.

multivoice.lib.stt_transcribe_audio.transcribe_audio(vocal_target, language, suppress_numerals, batch_size, device, mtypes, args)[source]

Transcribes the audio file using a Whisper model.

Parameters:
  • vocal_target – The path to the audio file or the preprocessed audio data.

  • language – The target language for transcription.

  • suppress_numerals – Boolean flag to indicate whether numeral symbols should be suppressed in the transcription.

  • batch_size – The number of segments to process in one batch. If 0, processes without batching.

  • device – The device (CPU or GPU) on which to run the model.

  • mtypes – A dictionary mapping devices to their respective compute types.

  • args – Namespace object containing additional arguments and configurations.

Returns:

The complete transcribed text as a string. info: Additional information about the transcription process, such as language detection details.

Return type:

full_transcript

stt_write_outputs

multivoice.lib.stt_write_outputs.cleanup(path: str)[source]

path could either be relative or absolute.

multivoice.lib.stt_write_outputs.write_outputs(ssm, args)[source]

Write the speaker-aware transcript to both a text file and an SRT file.

Parameters:
  • ssm (object) – An object containing the speech segment mapping.

  • args (argparse.Namespace) – Command-line arguments including output directory and audio file path.

multivoice.lib.stt_write_outputs.write_srt(transcript, file)[source]

Write a transcript to a file in SRT format.

Parameters:
  • transcript (list) – A list of dictionaries containing segment information.

  • file (file object) – The file object where the SRT content will be written.

stt

multivoice.stt.main()[source]

Main function to execute the speech-to-text process. Parses arguments, sets up logging, processes audio, writes outputs, and cleans up.

multivoice.stt.setup_logging(args)[source]

Set up logging based on command line arguments.

Parameters:

args (argparse.Namespace) – The parsed command line arguments.

stt_dir

multivoice.stt_dir.check_media_file(file_path)[source]

Check the media file type using ffprobe.

This function uses ffprobe to determine the types of streams present in a given media file (e.g., audio, video).

Parameters:

file_path (str) – The path to the media file.

Returns:

A set containing the stream types found in the file.

Return type:

set

multivoice.stt_dir.find_best_audio_file(directory)[source]

Find the best audio file in a directory.

This function searches for suitable audio files in a directory, prioritizing those that contain only audio streams.

Parameters:

directory (str) – The path to the directory to search.

Returns:

The path to the best audio file found, or None if none were found.

Return type:

str or None

multivoice.stt_dir.find_deepest_directories(root_directory)[source]

Find all the deepest directories in a given root directory.

This function traverses the directory tree and collects paths to all directories that do not contain any subdirectories (i.e., the deepest ones).

Parameters:

root_directory (str) – The path to the root directory.

Returns:

A list of paths to the deepest directories.

Return type:

list

multivoice.stt_dir.main()[source]

Main function for processing audio files in a directory.

This function parses command line arguments, sets up logging, finds the deepest directories in the specified top-level directory, and processes each of them.

multivoice.stt_dir.parse_arguments()[source]

Parse command line arguments for directory processing.

This function sets up the argument parser and defines all possible options that the user can specify when running the script from the command line.

Returns:

An object containing all parsed arguments as attributes.

Return type:

argparse.Namespace

multivoice.stt_dir.process_directory(dir_path, dest_dir, args)[source]

Process a directory containing media files.

This function finds the best audio file in the given directory and processes it using the multivoice command-line tool.

Parameters:
  • dir_path (str) – The path to the directory to process.

  • dest_dir (str) – The destination output directory for processed files.

  • args (argparse.Namespace) – The parsed command line arguments.

multivoice.stt_dir.setup_logging(args)[source]

Set up logging based on command line arguments.

Parameters:

args (argparse.Namespace) – The parsed command line arguments.