Explaining venture teams’ opportunity identification through multimodal data

Authors : Alessio Desogus, Haoyu Wen and Jiewei Li

Abstract 📝

This study explores the factors that influence venture teams' ability to identify new opportunities in the context of emerging technologies. By utilizing a multimodal dataset and applying machine learning techniques, we introduce three novel metrics and find skewness as the best.
We employ WhisperX to improve the quality of audio transcription and diarization. Compared to previous models like Pyannote and Deepgram, WhisperX has improved speaker diarization error rates (SDER) performance by 9% without margin and by 28% with a margin of 1, indicating a significant enhancement in the accuracy of capturing team dialogues.
Furthermore, we introduce three novel metrics: Tf-idf, skewness, dominance (along with its squared version) and utilize a Negative Binomial Regression model to predict the number of opportunities identified by these teams. We find that Model 3 with metric skewness achieves the lowest AIC 668.075, and metric skewness is the most significant coefficient (p = 0.034, β = -0.271), indicating that a larger quantity of long continuous speaking time helps develop a more in-depth investigation of ideas and generate more ideas. The random forest model affirms this finding.

Setup ⚙️

WhisperX Setup :

To install WhisperX, all the instructions can be found here.
Before running (for reproducing our results) the transcribing.ipynb be sure to have added you personal hugging face token YOUR_HF_TOKEN:

diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

To run the transcribing.ipynb notebook on Google Colab as we did for more computational power, don't forget to run the following cell:

# Import the Google Drive
from google.colab import drive
drive.mount('/content/drive')

pip install git+https://github.com/m-bain/whisperx.git

Python Libraries Setup :

For this project, in addition to WhisperX and the basics pandas, numpy, matplotlib and others libraries, we used the following specific Python libraries that can be directly used on Google Colab or installed in a conda environment locally :

nltk: natural language processing (NLP)

conda install -c anaconda nltk

statsmodels: negative binomial regression model

conda install -c anaconda statsmodels

scikit-learn: machine learning models and functions

conda install -c anaconda scikit-learn

Files Path Setup :

For each notebook, the only thing necessary for running it, is to add your base file path (from the Google Drive if running on Google Colab or the local one) at the following command:

base_path = 'drive/MyDrive/...'  # on Google Colab
base_path = '.'                  # locally

Data Origin and Generation 🔍

Data from the ENTC : wavs, jsons, rttms and dataset.csv
Data generated by us : json, transcripts_teams.csv, transcripts_speakers.csv, speaking_time.csv and regression.txt

📌 NOTE: Only the .csv files and .txt are provided in this repository for confidentiality purposes.

Data Input-Ouput Matrix Overview 🔄

File name	File Input Directory	File Input Description	File Output Directory	File Output Description
`transcribing.ipynb`	`/wavs`	Audio file of each team meeting `.wav`	`/json`	Diarized transcript from WhisperX `.json`
`cleaning.ipynb`	`/json`	Diarized transcript from WhisperX `.json`	`/csv`	`transcripts_teams.csv`: Organized at the team level, this dataset is crafted for team-specific analyses. It provides the team identification numbers, initial transcripts, filtered transcripts, and final clean transcripts.
`cleaning.ipynb`	`/json`	Diarized transcript from WhisperX `.json`	`/csv`	`transcripts_speakers.csv`: Organized at the speaker level is created for speaker-specific analyses. It offers a view of individual speaker contributions.
`cleaning.ipynb`	`/json`	Diarized transcript from WhisperX `.json`	`/csv`	`speaking_time.csv`: Provides a temporal perspective on team speakers continuous speaking duration.
`main.ipynb`	`/csv`	`dataset.csv`, `transcripts_teams.csv`, `transcripts_speakers.csv`, `speaking_time.csv`	cell output	`WhisperX Benchmark`, `Negative Binomial Regression` Results and `Classification` Results
`main.ipynb`	`/jsons`	Diarized transcript from Deepgram `.json`	cell output	`Deepgram Benchmark`
`main.ipynb`	`/rttms`	Diarized transcript from Pyannote `.rttm`	cell output	`Pyannote Benchmark`

Audio Transcription and Diarization `transcribing.ipynb`

🚀 BEST PRACTICE: If you want to reproduce the work done, the best practice is to run the transcribing.ipynb on Google Colab as it needs some computational power.

We used approximately 50 units of computational power from Google Colab with a Tesla V100 GPU, and it took us approximately 5 hours to transcript and diarize the 116 audios files with WhisperX.

Transcript Processing and Cleaning `processing.ipynb`

📌 NOTE: This notebook takes approximately 2 minutes to run in its entirety, it can be done locally.

Description of the dataset: transcripts_teams.csv 📜

Column	Description
`team_id`	Id number of the team (name of the audio file)
`initial_transcript`	The initial transcript from whisperX without any modifications
`filtered_transcript`	The filtered transcript containing only words with a confidence level > 0.5
`clean_final_transcript`	The final clean transcript filtered with the `clean_text()` function

Description of the dataset: transcripts_speakers.csv 🎙️

Column	Description
`speaker_id`	Id number of the team speaker
`speaker_initial_transcript`	Initial transcript segments of the speaker
`speaker_filtered_transcript`	Filtered transcript segments of the speaker with a confidence level > 0.5
`speaker_clean_final_transcript`	Final clean transcript segments of the speaker filtered with the `clean_text()` function

Description of the dataset: speaking_time.csv ⌛

Column	Description
`Team_id`	Id number of the team (name of the file)
`speaker`	Id number of the team speaker
`length`	Continuous speaking time for each speaker in seconds

Main Notebook Overview `main.ipynb`

📌 NOTE: This notebook takes approximately 5 minutes to run in its entirety, it can be done locally.

This is the main notebook which serves as the central hub for all the results presented in the report. To enhance clarity and maintain modular code, we've also included a helper file (helpers.py) that houses various functions used in main.ipynb.

The main structure of the notebook is as follows:

Parameters Initialization
Idea 1: Uniqueness of Information through TF-IDF
Speaker Diarization Error Rate (SDER)
Idea 2: Speaking Time Concentration
Negative Binomial Regression as Prediction Model
Features Importance trough Random Forest Classification
Appendix: Gaussian Mixture Model (GMM)

Feel free to navigate through the sections to explore specific analyses and findings. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explaining venture teams’ opportunity identification through multimodal data

Abstract 📝

Setup ⚙️

WhisperX Setup :

Python Libraries Setup :

Files Path Setup :

Data Origin and Generation 🔍

Data Input-Ouput Matrix Overview 🔄

Audio Transcription and Diarization `transcribing.ipynb`

Transcript Processing and Cleaning `processing.ipynb`

Main Notebook Overview `main.ipynb`

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
csv		csv
ethics		ethics
.DS_Store		.DS_Store
README.md		README.md
helpers.py		helpers.py
main.ipynb		main.ipynb
processing.ipynb		processing.ipynb
regression.txt		regression.txt
transcribing.ipynb		transcribing.ipynb

CS-433/startups-opportunities

Folders and files

Latest commit

History

Repository files navigation

Explaining venture teams’ opportunity identification through multimodal data

Abstract 📝

Setup ⚙️

WhisperX Setup :

Python Libraries Setup :

Files Path Setup :

Data Origin and Generation 🔍

Data Input-Ouput Matrix Overview 🔄

Audio Transcription and Diarization transcribing.ipynb

Transcript Processing and Cleaning processing.ipynb

Main Notebook Overview main.ipynb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Audio Transcription and Diarization `transcribing.ipynb`

Transcript Processing and Cleaning `processing.ipynb`

Main Notebook Overview `main.ipynb`

Packages