- We have m characters in total, each with a simulated profile and each with an exact personality trait and score.
- We have n real tweets for each character, each with a potential knowledge, and each tweet is strongly related to the user resume, personality trait and potential knowledge.
- Construct the request body based on the information contained above.
- LLMs are asked to publish tweets based on the prompt.
- After collecting the responses from the LLMs, we evaluate the performance of the model according to the following criteria:
-
Overlap.
- Bleu.
- Rouge.
- Distinct.
-
LLM Judger.
- Resume Related (+1).
- Personality Related (+1).
- Potential Knowledge Related (+1).
-
BigFive Personality Consistency.
- Evaluate the personality trait scores of the character based on the n tweets generated by LLMs.
- Compare the consistency of the character's personality trait scores with the ground truth personality trait scores.
-
python main.py \
--platform zhipuai \
--base-url https://open.bigmodel.cn/api/paas/v4 \
--api-key 120985c00120985c00120985c00 \
--model glm-4-flash \
--max-tokens 1024 \
--temperature 0.6 \
--top-p 0.7 \
--platform-critic openai \
--base-url-critic https://api.openai.com/v1 \
--api-key-critic 120985c00120985c00120985c00 \
--model-critic GPT-4o \
--max-tokens-critic 1024 \
--temperature-critic 0.01 \
--convs-per-chunk 10 \
--qps 30 \
--qps-critic 30 \
--max-retry-times 5
To cite this work please use:
@misc{huang2024orcaenhancingroleplayingabilities,
title={Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits},
author={Yuxuan Huang},
year={2024},
eprint={2411.10006},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.10006},
}
OrcaBench is released under Apache-2.0 license, see LICENSE for details.