Skip to content

Explaining venture teams’ opportunity identification through multimodal data

Notifications You must be signed in to change notification settings

CS-433/startups-opportunities

Repository files navigation

Explaining venture teams’ opportunity identification through multimodal data

Authors : Alessio Desogus, Haoyu Wen and Jiewei Li

Abstract 📝

  • This study explores the factors that influence venture teams' ability to identify new opportunities in the context of emerging technologies. By utilizing a multimodal dataset and applying machine learning techniques, we introduce three novel metrics and find skewness as the best.

  • We employ WhisperX to improve the quality of audio transcription and diarization. Compared to previous models like Pyannote and Deepgram, WhisperX has improved speaker diarization error rates (SDER) performance by 9% without margin and by 28% with a margin of 1, indicating a significant enhancement in the accuracy of capturing team dialogues.

  • Furthermore, we introduce three novel metrics: Tf-idf, skewness, dominance (along with its squared version) and utilize a Negative Binomial Regression model to predict the number of opportunities identified by these teams. We find that Model 3 with metric skewness achieves the lowest AIC 668.075, and metric skewness is the most significant coefficient (p = 0.034, β = -0.271), indicating that a larger quantity of long continuous speaking time helps develop a more in-depth investigation of ideas and generate more ideas. The random forest model affirms this finding.

Setup ⚙️

WhisperX Setup :

  • To install WhisperX, all the instructions can be found here.
  • Before running (for reproducing our results) the transcribing.ipynb be sure to have added you personal hugging face token YOUR_HF_TOKEN:
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
  • To run the transcribing.ipynb notebook on Google Colab as we did for more computational power, don't forget to run the following cell:
# Import the Google Drive
from google.colab import drive
drive.mount('/content/drive')
pip install git+https://github.com/m-bain/whisperx.git

Python Libraries Setup :

  • For this project, in addition to WhisperX and the basics pandas, numpy, matplotlib and others libraries, we used the following specific Python libraries that can be directly used on Google Colab or installed in a conda environment locally :
  1. nltk: natural language processing (NLP)
conda install -c anaconda nltk
  1. statsmodels: negative binomial regression model
conda install -c anaconda statsmodels
  1. scikit-learn: machine learning models and functions
conda install -c anaconda scikit-learn

Files Path Setup :

  • For each notebook, the only thing necessary for running it, is to add your base file path (from the Google Drive if running on Google Colab or the local one) at the following command:
base_path = 'drive/MyDrive/...'  # on Google Colab
base_path = '.'                  # locally

Data Origin and Generation 🔍

  • Data from the ENTC : wavs, jsons, rttms and dataset.csv

  • Data generated by us : json, transcripts_teams.csv, transcripts_speakers.csv, speaking_time.csv and regression.txt

📌 NOTE: Only the .csv files and .txt are provided in this repository for confidentiality purposes.

Data Input-Ouput Matrix Overview 🔄

File name File Input Directory File Input Description File Output Directory File Output Description
transcribing.ipynb /wavs Audio file of each team meeting .wav /json Diarized transcript from WhisperX .json
cleaning.ipynb /json Diarized transcript from WhisperX .json /csv transcripts_teams.csv: Organized at the team level, this dataset is crafted for team-specific analyses. It provides the team identification numbers, initial transcripts, filtered transcripts, and final clean transcripts.
cleaning.ipynb /json Diarized transcript from WhisperX .json /csv transcripts_speakers.csv: Organized at the speaker level is created for speaker-specific analyses. It offers a view of individual speaker contributions.
cleaning.ipynb /json Diarized transcript from WhisperX .json /csv speaking_time.csv: Provides a temporal perspective on team speakers continuous speaking duration.
main.ipynb /csv dataset.csv, transcripts_teams.csv, transcripts_speakers.csv, speaking_time.csv  cell output  WhisperX Benchmark, Negative Binomial Regression Results and Classification Results
main.ipynb /jsons Diarized transcript from Deepgram .json  cell output Deepgram Benchmark 
main.ipynb /rttms Diarized transcript from Pyannote .rttm  cell output Pyannote Benchmark

Audio Transcription and Diarization transcribing.ipynb

🚀 BEST PRACTICE: If you want to reproduce the work done, the best practice is to run the transcribing.ipynb on Google Colab as it needs some computational power.

  • We used approximately 50 units of computational power from Google Colab with a Tesla V100 GPU, and it took us approximately 5 hours to transcript and diarize the 116 audios files with WhisperX.

Transcript Processing and Cleaning processing.ipynb

📌 NOTE: This notebook takes approximately 2 minutes to run in its entirety, it can be done locally.

  • Description of the dataset: transcripts_teams.csv 📜
Column Description
team_id Id number of the team (name of the audio file)
initial_transcript The initial transcript from whisperX without any modifications
filtered_transcript The filtered transcript containing only words with a confidence level > 0.5
clean_final_transcript The final clean transcript filtered with the clean_text() function
  • Description of the dataset: transcripts_speakers.csv 🎙️
Column Description
speaker_id Id number of the team speaker
speaker_initial_transcript Initial transcript segments of the speaker
speaker_filtered_transcript Filtered transcript segments of the speaker with a confidence level > 0.5
speaker_clean_final_transcript Final clean transcript segments of the speaker filtered with the clean_text() function
  • Description of the dataset: speaking_time.csv
Column Description
Team_id Id number of the team (name of the file)
speaker Id number of the team speaker
length Continuous speaking time for each speaker in seconds

Main Notebook Overview main.ipynb

📌 NOTE: This notebook takes approximately 5 minutes to run in its entirety, it can be done locally.

  • This is the main notebook which serves as the central hub for all the results presented in the report. To enhance clarity and maintain modular code, we've also included a helper file (helpers.py) that houses various functions used in main.ipynb.

The main structure of the notebook is as follows:

  1. Parameters Initialization
  2. Idea 1: Uniqueness of Information through TF-IDF
  3. Speaker Diarization Error Rate (SDER)
  4. Idea 2: Speaking Time Concentration
  5. Negative Binomial Regression as Prediction Model
  6. Features Importance trough Random Forest Classification
  7. Appendix: Gaussian Mixture Model (GMM)

Feel free to navigate through the sections to explore specific analyses and findings. 🚀

About

Explaining venture teams’ opportunity identification through multimodal data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published