AI-Safety SCAV

This is the code for our NeurIPS 2024 paper Uncovering Safety Risks of Large Language Models through Concept Activation Vector.

News

[2024-11-17] The code for visualization of embedding-level attack is released.
[2024-10-28] The code for prompt-level attack is released.
[2024-09-30] The code for embedding-level attack is released.
[2024-04-18] The paper is available on arXiv.

Citation

If you find this work helpful, please consider citing our paper:

@inproceedings{Xu2024uncovering,
  title  = {Uncovering Safety Risks of Large Language Models through Concept Activation Vector},
  author = {Zhihao Xu and Ruixuan Huang and Changyu Chen and Xiting Wang},
  year   = {2024},
  url    = {https://openreview.net/forum?id=Uymv9ThB50}
}

Disclaimer

This project may lead to attacks on LLMs and is intended for academic research use only. It is prohibited for illegal purposes. The authors have shared the vulnerabilities with OpenAI and Microsoft.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
embedding-level		embedding-level
prompt-level		prompt-level
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Safety SCAV

News

Citation

Disclaimer

About

Releases

Packages

Contributors 2

Languages

SproutNan/AI-Safety_SCAV

Folders and files

Latest commit

History

Repository files navigation

AI-Safety SCAV

News

Citation

Disclaimer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages