A Poicy Retrieval and Recommend System Developed with PLM-BERT and Graph Inference
Explore the docs »
2023/4/15 - 项目刚部署于服务器上:Link. 2023/5/19 - 刚出了结果,要去南京决赛答辩了,爆!
Table of Contents
This system consists of two parts
As shown in Figure,The Retirval System is divided into two Channels: KeyWords Retieval and Semantic Retrieval.We use ElasticSearch ES for KeyWords Search. Our Semantic Retrieval Channel mainly consists of 4 modules: Encoding Module, Community Embedding Module, High-Dimensional Vector Retrieval Module, Triplet Loss Training Module. Then we use Xgboost to fusion the results from different Channels.
The input data consists of Title, Body, and other attributes of the policy. We use PLM Model(e.g., BERT/RoBERTa) to extract Semantic informations in policy's title and body.And we use Onehot Embedding to embed other attributes of a ploicy.
Community Embedding module and High-dimensional vector retrieval module is Implemented by my teammates
From Community Embedding Module, We can split all policies into diffrent communities. To help Language Model better distinguish policies from different communities, we build triplets like $ (anchor, positive, negative) $ from Community Graph and fine-tune PLM by TripletLoss
Besides, considering the difference between Policy Title and Policy BERT, we treat title as a special kinds of keywords, Therefore, We split our Retrieval System into another two Channels : Title2Body and Body2Body.In Title2Body channel, we select the concatenate of policy title and other attributes as anchor points, Body as positive and negative points.In Body2Body channel, we select policy body as anchor, positive and negative points.
Then,two Retrieval Channels provide two models, We send the origin policy data into two models parallelly, and the output of each model is sent into High-dimensional vector retrieval module and get the output of semantic retrival.
In sum up, we obtain a 3 channel result : ES , Title2Body, Body2Body. The intersection of these results is most appropriate, but it's also time-consuming.Thus, We use Xgboost model(or other algorithms like TA) to help the model learn the rank of intersection.In detail, we calculate the average similarity in intersection, and let other policy's similarity that not included in ther intersection as 0.Then, we apply xgboost to fit the Mean Similarity.
The traditional method in recommend fields is using collaborative filtering model, which ignore the origin feature of User and Item.Inspired by GCMC and LSR ,we treat the recommend task as link prediction,using graph convolutional model to gather the featuer between user and items and further treat the structure of the graph as a learnable latent parameter.
My code works with the following environment.
python=3.7
pytorch=1.13.1+cu116
transformers=4.18.0
ubuntu 20.04
pip install requirements.txt //Install Requirements
- Download data from dataLink
-
put the downloaded data under ./init_data_process/data
cd ./init_data_process python preprocess2DBLP.py
-
the result used to build community is stored under
./init_data_process/results -
Produce Random Anchor
python Random_sample.py
-
Random sample result is stored under ./init_data_process/results/random_sample
-
Produce triplets
python read_sample.py
- put
triplets_body.csv/data_sample.csv/category_index.txt
produced in Preprocess under ./train_model/data - put origin policy data
policyinfo_new.tsv
under ./train_model/Conver2vec/data
cd ../
sh run_BERT_MLP.sh gpu_id # train model
sh test_BERT_MLP.sh gpu_id # evaluate
Index_type: 'Title' means Title2Body Channel, 'Body' means Body2Body Channel.
-
Convert Origin policy data into vector:
cd ./Conver2vec sh convert_data.sh gpu_id
- put results of 3 channels under ./Fusion/data
cd ../../Fusion
python TA.py
cd ../UserCF
python UserCF.py
This project is licensed under the MIT License - see the LICENSE file for details.