Machine Learning Engineer – Speech Processing

Intercambios Transorganicos (Nov 2024 - Today)

Python
Voice activity detection
ASR dataset generator

Overview

Transorganic Exchanges (IT) is a transdisciplinary research team. We develop interactive interfaces that allow the incorporation of new media and technologies in the field of health and education through inclusive strategies. We are interested in helping to build in these areas an approach that organically links the human, subjective, individual and identity with the functional aspects. Our goal is to draw new possibilities in the dialogue between art and science.

Webpage

Work Strategy

Audio Input: An audio file is received containing one or more spoken sections with different speeds.

Speed Categorization: The speaker's speed in different parts of the audio is identified and classified into different levels: slow, normal, and fast.

First Voice Detection: A voice activity detection (VAD) is applied to segment the audio based on moments when speech is present.

Result Comparison: The segments detected by the VAD are compared with the segments previously categorized by the speaker's speed.

Re-Optimization of Segments: If a segment detected by the VAD corresponds to a higher speed category, it is re-optimized to improve detection accuracy.

Adjustment of Original Times: The start and end times of each segment are adjusted to maintain correspondence with the original audio and allow for coherent processing.

Generation of Final Results: A list of optimized segments is obtained, each with its start and end time adjusted, and each segment is saved as a new audio file.