Skip to content

Latest commit

 

History

History
77 lines (52 loc) · 5.1 KB

README.md

File metadata and controls

77 lines (52 loc) · 5.1 KB

Click above for the app :)!

Joblisting-Modeling

An ML project on the joblistings scraped from the Joblisting Webscraper project and explored in Joblisting Cleaning EDA.

Table of Contents

Motivation

In short, I'm motivated by a want to learn more about the data science pipeline and a desire to venture into the unknown! I've done countless scikit-learn projects, but I pursued this one because I wanted a project that would be more in-depth.

Structure


Figure 1. Data science lifecycle.

This project is part of a larger project! This is only 1 step in that larger project. To check out the other projects in this series:

  1. Joblisting-Webscraper
  2. Joblisting-Cleaning-EDA
  3. Joblisting-Modeling

About the structure of this repo:

  • csv stores the CSVs I generated from the previous project
  • diagrams stores the diagrams from the past 2 projects and this project
  • img stores images from the past 2 projects and this project
  • input stores the dataset I scraped, split and preprocessed data
  • pages stores the subpage for my app
  • pipelines stores the pipelines I tested in modeling.ipynb
  • 1_🧠_Predictive_Modeling.py is the main page of my streamlit app
  • modeling.ipynb is the source code
  • resources.md is a list of all the resources I reference throughout this project
  • utils.py contains a few helper functions for my app

Note: the package versions listed in requirements.txt and imported in my code may not be the exact versions. However, the versioning here is less important. I've listed all used libraries.

Dataset

A little about the dataset: the data was webscraped from Glassdoor.com's job listings for data science jobs. I used my own webscraper for it! That can be found here: https://github.com/alckasoc/Joblisting-Webscraper. The dataset is small and can be found in this repo under input. As an alternative, I've also stored this on Kaggle publicly: https://www.kaggle.com/datasets/vincenttu/glassdoor-joblisting.

Difficulties

Note, I talk more about this in the app! I faced a ton of difficulties going into this project. For one, prior to this, I've only ever made simple projects modeling tidy data in fixed environments without too much depth. Venturing into this unknown meant a lot of searching and reading and learning! Along the way, I ran into countless problems code-wise and model-wise.

What I Learned

Note, I talk more about this in the app! I learned more about each step of the machine learning pipeline. I've never gone this in-depth in any of the subject matters whether that be feature engineering or hyperparameter tuning. This project I aimed to flush out each and every aspect to the best of my ability. I learned tools like optuna, raytune, hyperopt for hyperparameter tuning. I learned various feature engineering methods and libraries like lofo. I learned a bit about AutoML through tools like autofeat and EvalML. More importantly, I learned about the experimentation process and how crucial it is to have a strong cross validation framework for testing what works and what doesn't work. This is something every Kaggler knows!

References

For information on references and resources, refer to resources.md.

Author Info

Contact me:

Gmail: [email protected]
Linkedin: Vincent Tu
Kaggle: vincenttu

Thank you

Psst, I've written another thank you note in my app (check it out). I'd just like to reiterate again that I'm grateful for the tools, documentation, and articles available to me. They have been a great help in my project and without them, this entire project would've been much like any other one I've made! And, thank you, again, Catherine for your help with the visuals and banners! This project is incomplete without you. ❤️

Lastly, thank you for viewing!