Graduation thesis project: Title: Explainable AI in Job Recommendation System
Four research questions:
- RQ1. How well do State-Of-The-Art (SOTA) algorithms perform on job recommendations?
- RQ2. Which features contribute mainly to the ranking results?
- RQ3. How to evaluate the explanations generated by different XAI techniques?
- RQ4. How to make explanations digestible for lay users
Data: Kaggle's CareerBuilder 2012
Requirements: important Python libraries
- sklearn, pandas, numpy
- myfm
- interpretml
- SHAP
- LIME
Apart from public libraries, please import all modules in utils folder for generating recommmendations.
- Final report: link
- High-resolution figures in the report: link
- Summary analysis spreadsheets used in discussion: link
Raw data can be obtained directly from Kaggle's website or from folder data_raw.
4.1: Data pre-processing and Feature Engineering: folder
NOTE: Large datasets have been compressed. Please extract them to the original format (eg: csv./ tsv.) before running the notebook.
- 4.1.1 Data cleaning
- 4.1.2 Data augmentation: Negative sampling for interaction data - link
Feature Engineering:
TFIDF for both jobs and user history - link
- 4.1.4 Feature Engineering: Generate location matching features
- 4.1.5 Feature Engineering: Transform text features
LDA for jobs - link, LDA for user history - link
- 4.1.6 Feature Engineering: Discretizing user profile features: link
4.2: Generating potential applications: folder
- 4.2.1 Potential application generation by random sampling with control on positive label
- 4.2.2 Potential application generation by unsupervised KNN models (2 variations: knn_lda, knn_tfidf)
You can re-train the models using notebooks or download the pre-trained models.
- White-Box, Black-box models: 7 models link, pre-trained models
- Factorization Machine models: 4 models. link. You can re-train the models using notebook (~ 5-10 mins) or download pre-trained models. Pickle pre-trained models are large (> 10GB/model), and need to be download separately GoogleDrive link
- Explanable Boosting Machine models: 3 EBM models and 3 DPEBM models link with pre-trained models
Generate top 20 recommendation: 20 jobs/ user
Output format: UserID, JobID, Y_pred, Y_prob, rank
(Y_pred: predicted label, Y_prob: probability of prediction, rank: ranking based on probability)
Each model have 2 potential sources of application. Please import all modules in utils folder for generating recommmendations.
- 4.5.2 Global explanation by model-specific approach: EBM models link
- 4.5.3 Global explanation by model-specific approach: DPEBM models link
- 4.5.4 Global self-explanation by white-box models and XGBoost link
- 4.5.5 KernelSHAP: Local feature importance link, output
- 4.5.6 LIME: Localfeature importance link, output