Annotation

Speech to text

expert.data.annotation.speech_to_text.transcribe_video(video_path: Union[str, os.PathLike], lang: Optional[str] = 'en', model: Optional[str] = 'server', device: Optional[torch.device] = None) Dict[source]

Speech recognition module from video.

Parameters
  • video_path (Union[str, PathLike]) – Path to the local video file.

  • lang (Optional[str]) – Language for speech recognition [‘ru’, ‘en’]. Defaults to ‘en’.

  • model (Optional[str]) – Model configuration for speech recognition [‘server’, ‘local’]. Defaults to ‘server’.

  • device (Optional[Union[torch.device, None]]) – Device type on local machine (GPU recommended). Defaults to None.

Raises
  • NotImplementedError – If ‘lang’ is not equal to ‘en’ or ‘ru’.

  • NotImplementedError – If ‘model’ is not equal to ‘server’ or ‘local’.

expert.data.annotation.speech_to_text.get_all_words(transcribation: Dict) Tuple[List, str][source]

Get all stamps with words from the transcribed text.

Parameters

transcribation (Dict) – Speech recognition module results.

expert.data.annotation.speech_to_text.get_phrases(all_words: list, duration: Optional[int] = 10) list[source]

Split transcribed text into segments of a fixed length.

Parameters
  • all_words (List) – All stamps with words from the transcribed text.

  • duration (int, optional) – Length of intervals for extracting phrases from speech. Defaults to 10.

expert.data.annotation.speech_to_text.between_timestamps(all_words: List, start: float, end: float) str[source]

Get phrase between specific timestamps (start, finish) in seconds. Find closest left index for start stamp and closest right index for end.

Parameters
  • all_words (List) – All stamps with words from the transcribed text.

  • start (float) – Start timestamp of the interval (in seconds).

  • end (float) – End timestamp of the interval (in seconds).

Returns

Phrase between timestamps.

Return type

str

Summarization

Transcribe

expert.data.annotation.transcribe.transcribe_timestamped(model, audio, language=None, task='transcribe', remove_punctuation_from_words=False, compute_word_confidence=True, include_punctuation_in_confidence=False, refine_whisper_precision=0.5, min_word_duration=0.02, word_alignement_most_top_layers=None, seed=1234, detect_disfluencies=False, trust_whisper_timestamps=True, temperature=0.0, best_of=None, patience=None, length_penalty=None, compression_ratio_threshold=2.4, logprob_threshold=- 1.0, no_speech_threshold=0.6, fp16=None, condition_on_previous_text=True, initial_prompt=None, suppress_tokens='-1', sample_len=None)[source]

Transcribe an audio file using Whisper.

Parameters
  • model (str) – The Whisper model instance.

  • audio (str | np.ndarray | torch.Tensor) – The path to the audio file to open, or the audio waveform.

  • language (str, optional) – The language to use for the transcription. If None, the language is detected automatically.

  • task (str, optional) – The task to perform: either “transcribe” or “translate”.

  • remove_punctuation_from_words (bool, optional) – If False, words will be glued with the next punctuation mark (if any). If True, there will be no punctuation mark in the words[:][“text”] list. It only affects these strings; This has no influence on the computation of the word confidence, whatever the value of include_punctuation_in_confidence is.

  • compute_word_confidence (bool, optional) – Whether to compute word confidence. If True, a finer confidence for each segment will be computed as well.

  • detect_disfluencies – bool

  • disfluencies (Whether to detect) –

  • transcription. (that Whisper model might have omitted in the) –

  • accurate. (This should make the word timestamp prediction more) –

  • "[*]". (And probable disfluencies will be marked as special words) –

  • trust_whisper_timestamps – bool

  • first (Whether to rely on Whisper's timestamps to get approximative) –

  • positions (estimate of segment) –

  • include_punctuation_in_confidence (bool, optional) – Whether to include proba of punctuation in the computation of the (previous) word confidence.

  • refine_whisper_precision (float, optional) – How much can we refine Whisper segment positions, in seconds. Must be a multiple of 0.02.

  • min_word_duration (float, optional) – Minimum duration of a word, in seconds. If a word is shorter than this, timestamps will be adjusted.

  • seed (int, optional) – Random seed to use for temperature sampling, for the sake of reproducibility. Choose None for unpredictable randomness.

  • temperature (float, optional) – Temperature for sampling.

  • compression_ratio_threshold (float, optional) – If the gzip compression ratio is above this value, treat as failed.

  • logprob_threshold (float, optional) – If the average log probability over sampled tokens is below this value, treat as failed.

  • no_speech_threshold (float, optional) – If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below logprob_threshold, consider the segment as silent.

  • condition_on_previous_text (bool, optional) – If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

  • initial_prompt (str, optional) – Optional text to provide as a prompt for the first window.

  • suppress_tokens (str, optional) – Comma-separated list of token ids to suppress during sampling; ‘-1’ will suppress most special characters except common punctuations.

Returns

A dictionary containing the resulting text (“text”) and segment-level details (“segments”), and

the spoken language (“language”), which is detected when decode_options[“language”] is None.

Return type

Dict

expert.data.annotation.transcribe.perform_word_alignment(tokens, attention_weights, tokenizer, use_space=True, mfcc=None, refine_whisper_precision_nframes=0, remove_punctuation_from_words=False, include_punctuation_in_timing=False, unfinished_decoding=False, alignment_heads=None, medfilt_width=9, qk_scale=1.0, detect_disfluencies=True, subwords_can_be_empty=True)[source]

Perform word alignment on the given tokens and attention weights. Returns a list of (word, start_time, end_time) tuples.

tokens: list of tokens (integers) attention_weights: list of attention weights (torch tensors) tokenizer: tokenizer used to tokenize the text use_space: whether to use spaces to split the tokens into words (should be true for all languages except Japanese, Chinese, …) mfcc: MFCC features (used to identify padded region) refine_whisper_precision_nframes: precision time remove_punctuation_from_words: whether to remove punctuation from words include_punctuation_in_timing: whether to include punctuation in the timing of (previous) words unfinished_decoding: whether the decoding is unfinished (e.g. because the model is stuck) alignment_heads: list of attention heads to use for alignment medfilt_width: width of the median filter used to smooth the attention weights qk_scale: scale factor applied to the attention weights

expert.data.annotation.transcribe.remove_last_null_duration_words(transcription, words, recompute_text=False)[source]

Remove words with null duration happening at the end of a chunk (probable Whisper hallucinations)

expert.data.annotation.transcribe.find_start_padding(mfcc)[source]

Return start of padding given the mfcc, or None if there is no padding.

expert.data.annotation.transcribe.ensure_increasing_positions(segments, min_duration: int = 0)[source]

Ensure that “start” and “end” come in increasing order.