Skip to content

Latest commit

 

History

History
66 lines (49 loc) · 2.05 KB

README.md

File metadata and controls

66 lines (49 loc) · 2.05 KB

UCLA 17S CS 249 Final Project

Dataset

Data found from Stack Exchange Data Dump

Overall Framework

  1. Feature Extraction
  2. Learning-to-Rank

Implemented Features

  • User Features
    • user_age [1 numerical features]
    • user_badge [categorical features]
    • user_reputation [1 numerical features]
    • user_views [1 numerical features]
    • user_votes [1 numerical features]
  • User-User Features
    • user-user interactions [1 numerical features]
  • Post Features
    • comment_cnt [1 numerical feature]

Instructions

Downloading Data

Preprocessing

cd src/preprocess
./preprocess.py [name of dataset]
  • Convert the format from XML to JSON
  • Convert HTML-like contents into plaintext
  • Link each question to the corresponding answers
    • See data/[name of dataset]/question_answer_mapping.json after preprocessing
  • Split the whole set into training and testing sets
    • See data/[name of dataset]/train.* and data/[name of dataset]/test.*
    • Questions without the best answer (ground truth) and with less than two answers are removed

Example of Feature Extraction

Extract user_age features

cd src/feature_extraction
./user_age.py [name of dataset]

Directory Descriptions

Here some descriptions briefly show the purpose of each directory.

  • raw/: The directory for the raw data (i.e., Posts.xml, Users.xml, etc)
    • raw/[name of dataset]/: the corresponding raw data for a certain dataset (e.g., StackOverflow)
    • Note that the file names should not be modified.
  • data/: The directory for the preprocessed data
    • data/[name of dataset]/: the corresponding preprocessed data for a certain dataset (e.g., StackOverflow)
  • src/: The directory for all source codes
    • src/preprocess/: Codes for preprocessing raw data
  • model/: The directory for some trained model on large English corpus