Python Modules
identify_voice
- class multivoice.identify_voice.SpeakerVerifierCLI[source]
Bases:
object
- extract_embedding(audio_path)[source]
Extracts a speaker embedding from an audio file.
- Parameters:
audio_path (str) – The path to the audio file.
- Returns:
- A numpy array containing the speaker embedding,
or None if extraction fails.
- Return type:
numpy.ndarray or None
- identify_voice(file_path)[source]
Identifies a speaker from an audio file by comparing its embedding to known speakers.
- Parameters:
file_path (str) – The path to the audio file for identification.
- Returns:
- The ID of the identified speaker if the score meets the threshold,
otherwise returns None.
- Return type:
str or None
register_voice
- class multivoice.register_voice.SpeakerVerifierCLI[source]
Bases:
object
- extract_embedding(audio_path)[source]
Extracts a speaker embedding from an audio file.
- Parameters:
audio_path (str) – The path to the audio file.
- Returns:
- A numpy array containing the speaker embedding,
or None if extraction fails.
- Return type:
numpy.ndarray or None
- load_embeddings()[source]
Loads speaker embeddings from a file if it exists.
- Returns:
- A dictionary of user IDs mapped to their speaker embeddings.
Returns an empty dictionary if the file does not exist.
- Return type:
dict
stt_align_transcription
- multivoice.lib.stt_align_transcription.align_transcription(full_transcript, audio_waveform, device, batch_size, info)[source]
Perform forced alignment on a given audio waveform using a pre-defined transcript.
- Parameters:
full_transcript (str) – The transcription of the audio.
audio_waveform (np.ndarray) – The audio waveform as a NumPy array.
device (str) – The device to run the model on, e.g., “cpu” or “cuda”.
batch_size (int) – The batch size for processing the audio.
info (object) – An object containing additional information such as the language.
- Returns:
A list of word timestamps corresponding to the aligned transcription.
- Return type:
list
stt_args
- multivoice.lib.stt_args.parse_arguments()[source]
Parse command-line arguments for the Speech-to-Text (STT) module.
This function sets up and parses command-line arguments using Python’s argparse library. It defines various options such as audio file, model name, language, device, and more, which are essential for configuring the behavior of the STT system.
- Returns:
An object containing all the parsed arguments.
- Return type:
Namespace
stt_config
- multivoice.lib.stt_config.create_config(output_dir)[source]
Creates and configures the configuration for speaker diarization.
This function sets up necessary directories, and modifies the configuration with specific parameters for audio processing and diarization.
- Parameters:
output_dir (str) – The directory where configuration files and processed data will be stored.
- Returns:
A configuration object containing all necessary settings for speaker diarization.
- Return type:
OmegaConf.DictConfig
stt_diarize_audio
- multivoice.lib.stt_diarize_audio.diarize_audio(temp_path, device)[source]
Diarizes the audio to identify speaker turns.
- Parameters:
temp_path (str) – The path where temporary files will be stored.
device (str) – The device to use for processing (‘cpu’ or ‘cuda’).
- Returns:
A list of lists containing start time, end time, and speaker ID for each identified segment.
- Return type:
list
stt_langs
- multivoice.lib.stt_langs.process_language_arg(language: str, model_name: str)[source]
Process the language argument to make sure it’s valid and convert language names to language codes.
Parameters: - language (str): The name or code of the language. - model_name (str): The name of the model being used.
Returns: - str: The processed language code.
Raises: - ValueError: If the provided language is not supported.
stt_process_audio
- multivoice.lib.stt_process_audio.process_audio(args, temp_path)[source]
Process the input audio to generate speaker-segment mapping with timestamps and transcriptions.
- Parameters:
args (Namespace) – Command line arguments parsed by argparse.
temp_path (str) – Temporary directory path for intermediate files.
- Returns:
- A list of dictionaries, each representing a sentence segment with its start time,
end time, transcription, and speaker ID.
- Return type:
list
stt_punctuation
- multivoice.lib.stt_punctuation.get_first_word_idx_of_sentence(word_idx, word_list, speaker_list, max_words)[source]
Finds the index of the first word in a sentence based on speaker continuity and punctuation.
- Parameters:
word_idx (int) – The current word index.
word_list (list) – A list of words.
speaker_list (list) – A list of speakers corresponding to each word.
max_words (int) – Maximum number of words to consider as part of the same sentence.
- Returns:
Index of the first word in the sentence or -1 if not found.
- Return type:
int
- multivoice.lib.stt_punctuation.get_last_word_idx_of_sentence(word_idx, word_list, max_words)[source]
Finds the index of the last word in a sentence based on punctuation.
- Parameters:
word_idx (int) – The current word index.
word_list (list) – A list of words.
max_words (int) – Maximum number of words to consider as part of the same sentence.
- Returns:
Index of the last word in the sentence or -1 if not found.
- Return type:
int
- multivoice.lib.stt_punctuation.get_realigned_ws_mapping_with_punctuation(word_speaker_mapping, max_words_in_sentence=50)[source]
Realigns the speaker mapping with punctuation to ensure consistent speaker labels for sentences.
- Parameters:
word_speaker_mapping (list) – A list of dictionaries containing words and their corresponding speakers.
max_words_in_sentence (int) – Maximum number of words considered in a sentence for alignment.
- Returns:
The realigned list of dictionaries with updated speaker information.
- Return type:
list
- multivoice.lib.stt_punctuation.restore_punctuation(wsm, info)[source]
Restores punctuation to a list of words based on the detected language.
- Parameters:
wsm (list) – A list of dictionaries containing word and speaker information.
info (object) – An object containing information about the language.
- Returns:
The modified list of dictionaries with restored punctuation.
- Return type:
list
stt_separate_vocals
- multivoice.lib.stt_separate_vocals.separate_vocals(audio_file, temp_path, device)[source]
Separates vocals from the given audio file using the Demucs model and saves the result in the specified temporary path.
- Parameters:
audio_file (str) – The path to the input audio file from which vocals need to be separated.
temp_path (str) – The directory where the separated vocal track will be saved.
device (str) – The device (CPU or GPU) on which the separation process should run.
- Returns:
The path to the separated vocal track if successful, otherwise the original audio file path.
- Return type:
str
Notes
This function uses the demucs.separate command-line tool with a specific model (htdemucs) and extracts only the vocals. If the separation fails (non-zero return code), it logs a warning and returns the original audio file path.
stt_speaker_mapping
- multivoice.lib.stt_speaker_mapping.get_sentences_speaker_mapping(word_speaker_mapping, spk_ts)[source]
Group words into sentences and map each sentence to its speaker.
- Parameters:
word_speaker_mapping (list) – A list of dictionaries mapping words to speakers with timestamps.
spk_ts (list) – A list of lists containing speaker start and end times and speaker IDs.
- Returns:
A list of dictionaries, each representing a sentence with the speaker, start time, end time, and text.
- Return type:
list
- multivoice.lib.stt_speaker_mapping.get_speaker_aware_transcript(sentences_speaker_mapping, f)[source]
Write a speaker-aware transcript to the provided file object.
- Parameters:
sentences_speaker_mapping (list) – A list of dictionaries representing sentences with speakers and timestamps.
f (file object) – The file object where the transcript will be written.
- multivoice.lib.stt_speaker_mapping.get_word_ts_anchor(s, e, option='start')[source]
Determine the anchor time for a word based on the specified option.
- Parameters:
s (int) – Start time of the word in milliseconds.
e (int) – End time of the word in milliseconds.
option (str) – The option to determine the anchor time (‘start’, ‘mid’, or ‘end’).
- Returns:
The anchor time for the word based on the option.
- Return type:
int
- multivoice.lib.stt_speaker_mapping.get_words_speaker_mapping(wrd_ts, spk_ts, word_anchor_option='start')[source]
Map each word to its corresponding speaker based on time segments.
- Parameters:
wrd_ts (list) – A list of dictionaries containing word timestamps and text.
spk_ts (list) – A list of lists containing speaker start and end times and speaker IDs.
word_anchor_option (str) – The option to determine the anchor time for words (‘start’, ‘mid’, or ‘end’).
- Returns:
A list of dictionaries mapping each word to its speaker, with timestamps and text.
- Return type:
list
stt_timestamps
- multivoice.lib.stt_timestamps.filter_missing_timestamps(word_timestamps, initial_timestamp=0, final_timestamp=None)[source]
Filters and fills in missing start timestamps for words in a list of word timestamps.
- Parameters:
word_timestamps (list) – A list of dictionaries containing word and its associated timestamps.
initial_timestamp (float, optional) – The initial timestamp to be used if the first word lacks a start timestamp. Defaults to 0.
final_timestamp (float, optional) – The final timestamp to be used for calculations. No default provided.
- Returns:
A filtered list of dictionaries containing words and their associated timestamps with missing start times filled in.
- Return type:
list
- multivoice.lib.stt_timestamps.format_timestamp(milliseconds: float, always_include_hours: bool = False, decimal_marker: str = '.')[source]
Formats a given time in milliseconds into a human-readable timestamp string.
- Parameters:
milliseconds (float) – The time in milliseconds to be formatted.
always_include_hours (bool, optional) – Whether to include hours even if they are zero. Defaults to False.
decimal_marker (str, optional) – The character used as the decimal marker. Defaults to ‘.’.
- Returns:
A formatted timestamp string in the form of ‘HH:MM:SS.mmm’ or ‘MM:SS.mmm’.
- Return type:
str
stt_transcribe_audio
- multivoice.lib.stt_transcribe_audio.find_numeral_symbol_tokens(tokenizer)[source]
Identifies tokens in the tokenizer’s vocabulary that contain numeral symbols.
- Parameters:
tokenizer – The tokenizer object containing the model’s vocabulary.
- Returns:
A list of token IDs corresponding to tokens that include numeral symbols.
- multivoice.lib.stt_transcribe_audio.transcribe_audio(vocal_target, language, suppress_numerals, batch_size, device, mtypes, args)[source]
Transcribes the audio file using a Whisper model.
- Parameters:
vocal_target – The path to the audio file or the preprocessed audio data.
language – The target language for transcription.
suppress_numerals – Boolean flag to indicate whether numeral symbols should be suppressed in the transcription.
batch_size – The number of segments to process in one batch. If 0, processes without batching.
device – The device (CPU or GPU) on which to run the model.
mtypes – A dictionary mapping devices to their respective compute types.
args – Namespace object containing additional arguments and configurations.
- Returns:
The complete transcribed text as a string. info: Additional information about the transcription process, such as language detection details.
- Return type:
full_transcript
stt_write_outputs
- multivoice.lib.stt_write_outputs.cleanup(path: str)[source]
path could either be relative or absolute.
- multivoice.lib.stt_write_outputs.write_outputs(ssm, args)[source]
Write the speaker-aware transcript to both a text file and an SRT file.
- Parameters:
ssm (object) – An object containing the speech segment mapping.
args (argparse.Namespace) – Command-line arguments including output directory and audio file path.
stt
stt_dir
- multivoice.stt_dir.check_media_file(file_path)[source]
Check the media file type using ffprobe.
This function uses ffprobe to determine the types of streams present in a given media file (e.g., audio, video).
- Parameters:
file_path (str) – The path to the media file.
- Returns:
A set containing the stream types found in the file.
- Return type:
set
- multivoice.stt_dir.find_best_audio_file(directory)[source]
Find the best audio file in a directory.
This function searches for suitable audio files in a directory, prioritizing those that contain only audio streams.
- Parameters:
directory (str) – The path to the directory to search.
- Returns:
The path to the best audio file found, or None if none were found.
- Return type:
str or None
- multivoice.stt_dir.find_deepest_directories(root_directory)[source]
Find all the deepest directories in a given root directory.
This function traverses the directory tree and collects paths to all directories that do not contain any subdirectories (i.e., the deepest ones).
- Parameters:
root_directory (str) – The path to the root directory.
- Returns:
A list of paths to the deepest directories.
- Return type:
list
- multivoice.stt_dir.main()[source]
Main function for processing audio files in a directory.
This function parses command line arguments, sets up logging, finds the deepest directories in the specified top-level directory, and processes each of them.
- multivoice.stt_dir.parse_arguments()[source]
Parse command line arguments for directory processing.
This function sets up the argument parser and defines all possible options that the user can specify when running the script from the command line.
- Returns:
An object containing all parsed arguments as attributes.
- Return type:
argparse.Namespace
- multivoice.stt_dir.process_directory(dir_path, dest_dir, args)[source]
Process a directory containing media files.
This function finds the best audio file in the given directory and processes it using the multivoice command-line tool.
- Parameters:
dir_path (str) – The path to the directory to process.
dest_dir (str) – The destination output directory for processed files.
args (argparse.Namespace) – The parsed command line arguments.