Authors : Alessio Desogus, Haoyu Wen and Jiewei Li
-
This study explores the factors that influence venture teams' ability to identify new opportunities in the context of emerging technologies. By utilizing a multimodal dataset and applying machine learning techniques, we introduce three novel metrics and find skewness as the best.
-
We employ WhisperX to improve the quality of audio transcription and diarization. Compared to previous models like Pyannote and Deepgram, WhisperX has improved speaker diarization error rates (SDER) performance by 9% without margin and by 28% with a margin of 1, indicating a significant enhancement in the accuracy of capturing team dialogues.
-
Furthermore, we introduce three novel metrics: Tf-idf, skewness, dominance (along with its squared version) and utilize a Negative Binomial Regression model to predict the number of opportunities identified by these teams. We find that Model 3 with metric skewness achieves the lowest AIC 668.075, and metric skewness is the most significant coefficient (p = 0.034, β = -0.271), indicating that a larger quantity of long continuous speaking time helps develop a more in-depth investigation of ideas and generate more ideas. The random forest model affirms this finding.
- To install
WhisperX
, all the instructions can be found here. - Before running (for reproducing our results) the
transcribing.ipynb
be sure to have added you personal hugging face tokenYOUR_HF_TOKEN
:
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
- To run the
transcribing.ipynb
notebook on Google Colab as we did for more computational power, don't forget to run the following cell:
# Import the Google Drive
from google.colab import drive
drive.mount('/content/drive')
pip install git+https://github.com/m-bain/whisperx.git
- For this project, in addition to
WhisperX
and the basicspandas
,numpy
,matplotlib
and others libraries, we used the following specific Python libraries that can be directly used on Google Colab or installed in a conda environment locally :
nltk
: natural language processing (NLP)
conda install -c anaconda nltk
statsmodels
: negative binomial regression model
conda install -c anaconda statsmodels
scikit-learn
: machine learning models and functions
conda install -c anaconda scikit-learn
- For each notebook, the only thing necessary for running it, is to add your base file path (from the Google Drive if running on Google Colab or the local one) at the following command:
base_path = 'drive/MyDrive/...' # on Google Colab
base_path = '.' # locally
-
Data from the ENTC :
wavs
,jsons
,rttms
anddataset.csv
-
Data generated by us :
json
,transcripts_teams.csv
,transcripts_speakers.csv
,speaking_time.csv
andregression.txt
📌 NOTE: Only the
.csv
files and.txt
are provided in this repository for confidentiality purposes.
File name | File Input Directory | File Input Description | File Output Directory | File Output Description |
---|---|---|---|---|
transcribing.ipynb |
/wavs |
Audio file of each team meeting .wav |
/json |
Diarized transcript from WhisperX .json |
cleaning.ipynb |
/json |
Diarized transcript from WhisperX .json |
/csv |
transcripts_teams.csv : Organized at the team level, this dataset is crafted for team-specific analyses. It provides the team identification numbers, initial transcripts, filtered transcripts, and final clean transcripts. |
cleaning.ipynb |
/json |
Diarized transcript from WhisperX .json |
/csv |
transcripts_speakers.csv : Organized at the speaker level is created for speaker-specific analyses. It offers a view of individual speaker contributions. |
cleaning.ipynb |
/json |
Diarized transcript from WhisperX .json |
/csv |
speaking_time.csv : Provides a temporal perspective on team speakers continuous speaking duration. |
main.ipynb |
/csv |
dataset.csv , transcripts_teams.csv , transcripts_speakers.csv , speaking_time.csv |
cell output | WhisperX Benchmark , Negative Binomial Regression Results and Classification Results |
main.ipynb |
/jsons |
Diarized transcript from Deepgram .json |
cell output | Deepgram Benchmark |
main.ipynb |
/rttms |
Diarized transcript from Pyannote .rttm |
cell output | Pyannote Benchmark |
🚀 BEST PRACTICE: If you want to reproduce the work done, the best practice is to run the
transcribing.ipynb
on Google Colab as it needs some computational power.
- We used approximately 50 units of computational power from Google Colab with a Tesla V100 GPU, and it took us approximately 5 hours to transcript and diarize the 116 audios files with
WhisperX
.
📌 NOTE: This notebook takes approximately 2 minutes to run in its entirety, it can be done locally.
- Description of the dataset:
transcripts_teams.csv
📜
Column | Description |
---|---|
team_id |
Id number of the team (name of the audio file) |
initial_transcript |
The initial transcript from whisperX without any modifications |
filtered_transcript |
The filtered transcript containing only words with a confidence level > 0.5 |
clean_final_transcript |
The final clean transcript filtered with the clean_text() function |
- Description of the dataset:
transcripts_speakers.csv
🎙️
Column | Description |
---|---|
speaker_id |
Id number of the team speaker |
speaker_initial_transcript |
Initial transcript segments of the speaker |
speaker_filtered_transcript |
Filtered transcript segments of the speaker with a confidence level > 0.5 |
speaker_clean_final_transcript |
Final clean transcript segments of the speaker filtered with the clean_text() function |
- Description of the dataset:
speaking_time.csv
⌛
Column | Description |
---|---|
Team_id |
Id number of the team (name of the file) |
speaker |
Id number of the team speaker |
length |
Continuous speaking time for each speaker in seconds |
📌 NOTE: This notebook takes approximately 5 minutes to run in its entirety, it can be done locally.
- This is the main notebook which serves as the central hub for all the results presented in the report. To enhance clarity and maintain modular code, we've also included a helper file (
helpers.py
) that houses various functions used inmain.ipynb
.
The main structure of the notebook is as follows:
- Parameters Initialization
- Idea 1: Uniqueness of Information through TF-IDF
- Speaker Diarization Error Rate (SDER)
- Idea 2: Speaking Time Concentration
- Negative Binomial Regression as Prediction Model
- Features Importance trough Random Forest Classification
- Appendix: Gaussian Mixture Model (GMM)
Feel free to navigate through the sections to explore specific analyses and findings. 🚀