Source Code For the ICLR Submission "More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness"

Project Structure

.
├── README.md
├── attribution/
    ├── convert_lora.py
    ├── attribute_sft.py
    ├── attribute_ppo.py
    ├── attribute_dpo.py
├── data/
    ├── toxicity/
    ├── stereotype/
    ├── machine_ethics/
    ├── truthfulness/
    ├── privacy/
├── toxicity_eval/
├── bias_eval/
├── ethics_eval/
├── truthfulness_eval/
├── privacy_eval/

Description of the Repo

For RLHF, we rely on existing implementations from https://github.com/eric-mitchell/direct-preference-optimization and https://github.com/lauraaisling/trlx-pythia/.

The evaluation code for each trustworthiness aspect is included in the corresponding _eval/ directory.

To reproduce the results, first run the files end with _exp.py to run the evaluation and store the results, and then run the files end with _analysis.py to parse the model generations and compute the final scores. Remember to replace ... with actual model paths.

The code for conducting our data attribution is included in the attribution/ directory, which consists of the attribution code for SFT, PPO, and DPO, as well as the code to convert our fully fine-tuned models to Lora based models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source Code For the ICLR Submission "More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness"

Project Structure

Description of the Repo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
attribution		attribution
bias_eval		bias_eval
data		data
ethics_eval		ethics_eval
privacy_eval		privacy_eval
toxicity_eval		toxicity_eval
truthfulness_eval		truthfulness_eval
README.md		README.md

AI4LIFE-GROUP/RLHF_Trust

Folders and files

Latest commit

History

Repository files navigation

Source Code For the ICLR Submission "More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness"

Project Structure

Description of the Repo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages