Project Title

Job Title Prediction from Tweets Using Word Embedding and Deep Neural Networks

Description

Here is the official implementation of our paper accepted at the ICEE conference.

Getting Started

Dependencies

Twint library
Emojis library
GloVe pre-trained model

Installing

Installing twint :
- Download and Install git in your windows from here
- pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
Installing emojis :
- pip3 install emojis
Download the GloVe zip file from here

Executing program

Folders
- There are four files inside an "Excel_files" folder, the list of users with their features updating during each process. You can find the user's complete information in User.FInfo.xlsx and the final preprocessed dataset in FDF.xlsx. There are five files located in the "Extrcat_data" folder, which are the extracting steps to construct our whole dataset.
- After that, we began preprocessing on our dataset consisting of embedding and cleaning steps located in the "Preprocessing" folder.
- Finally, We fed our dataset to the designed models inside the "Models" folder.
Step-by-step
- Firstly, we start our work by searching for two terms, including emojis and hashtags. We provided the list of hashtags inside the Hashtag.xlsx file, but the emojis were located inside the code. Tweets with less than 50k likes have been removed, which means we were searching for famous groups of users, especially celebrities or influencers on Twitter. The codes for these two methods are located in the "Extract_data" folder named Search_by_hashtags.py and Search_by_emojis.py.
- After searching for users by these two methods, we extracted users' tweets by searching their usernames. We set two limits on removing tweets:
  - only English-written tweets would be extracted.
  - We selected the last 10 to 40 tweets for each user depending on the number of tweets available for that user. Inside the Extract_tweets.py, you can find the whole process in the Extract_tweets.py inside the "Extract_data" folder.
- Next, we found each user's bio by searching the user profile. The code is attached to the "Extract_data" folder named Extract_bio.py.
- The final step in the extraction is to seek the user's jobs. We employed Wikipedia as a reference web page to extract users' information, especially employment. We developed an algorithm to separate the job phrase from the user's Wikipedia summary. After that, we designed a cleaning algorithm to return all the jobs user is involved in, separated by a comma. To clarify this, suppose that the user's Wikipedia summary is: "John Joseph Nicholson is an American retired actor and filmmaker whose career spanned more than 50 years." The output of the process would be "actor, filmmaker. " You can find the code in the "Extract_data" folder named Extract_jobs.py.
  Here is the 20 most frequent unigram appearing inside the job titles:
- After preparing our dataset, it's time to apply preprocessing steps. We first employed the GolVe pre-trained model for the embedding purpose. To generalize our model, we extracted the first two jobs of many that the user can have. Then we converted each position to a single word since we would measure the similarity between two terms. As an example, we converted the string "news anchor " to the combination of "news" and "anchor" and then replaced the most similar word to this combination. After that, we use the Kmeans clustering to categorize our job titles in which The jobs with the same interest lie in the same group. by using the Elbow method, we find 9 unique labels as shown below:
  
  Also, you can see the patterns appearing in each label:
  
  The code is located in the "Preprocessing" folder named User_jobs_embedding.ipynb.
- Also, we proposed three methods dealing with hashtags and compared the performance for each of them. You can access the code in the "Preprocessing" folder named User_finall_processing.ipynb.

Results

Now, it's time to evaluate our dataset by investigating the three designed models in the "Models" folder. Also, the model performance caused by each method and the classification report of the DDN model is attached below :

Help

If you have connection issues like: "Cannot connect to host twitter.com," check out this link.

Authors

Shayan Vassef
- Gmail : [email protected]
- Linkedin : shayan vassef
Ramin Toosi
- Email : [email protected]
- Linkedin : Ramin Toosi

Acknowledgments

I would like to thank Adak Vira Iranian Rahjo (Avir) company for their contribution during this project.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
Excel_files		Excel_files
Extract_data		Extract_data
Models		Models
Preprocessing		Preprocessing
Conference_Representation.pdf		Conference_Representation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Description

Getting Started

Dependencies

Installing

Executing program

Results

Help

Authors

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

vassef/Job-Title-Prediction-from-Tweets-Using-Word-Embedding-and-Deep-Neural-Networks

Folders and files

Latest commit

History

Repository files navigation

Project Title

Description

Getting Started

Dependencies

Installing

Executing program

Results

Help

Authors

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages