Documentation

This is the dataset for the paper "C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits" that is published in EMNLP 2024 conference.

Annotation

The annotations folder has two subfolders; "DB" and "WS". Each folder contains the annotations for the dataset. The annotations are in the form of individial CSV files with the following columns:

RANumb: The person number who annotated the policy.
Text: The text span that is annotated.
Label: The label of the text span.

Crawl

The crawl folder also has two subfolders; "DB" and "WS". Each folder contains the privacy policy crawler data for the dataset. The crawled data is in the form of CSV files with the following columns:

Link: The URL of the privacy policy.
IsHomepage: Whether the link is the homepage of the website.
Textmatch_P: Set of regulation specific primary keywords that are matched in the policy.
Textmatch_S: Set of regulation specific secondary keywords that are matched in the policy.
Textmatch_PP: Set of generic primary keywords that are matched in the policy.
Link_Match: Set of regulation specific and generic keywords that are matched in the URL.

Htmls

The Htmls folder has two subfolders; "DB" and "WS". Each folder contains the HTML files of the privacy policies. The HTML files are named as numbers e.g., 1.html and they correspond to rownumber+1 in crawl data (2nd row) and annotation data 1.csv.

Citation

If you use this dataset, please cite the following paper:

@inproceedings{c3pa,
  title={C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits},
  author={Maaz Bin Musa, Steven M. Winston, Garrison Allen, Jacob Schiller, Kevin Moore, Sean Quick, Johnathan Melvin,Padmini Srinivasan, Mihailis E. Diamantis, Rishab Nithyanand},
  booktitle={Empirical Methods in Natural Language Processing},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Annotations		Annotations
Crawl		Crawl
Htmls		Htmls
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentation

Annotation

Crawl

Htmls

Citation

About

Releases

Packages

Languages

MaazBinMusa/C3PA_Dataset

Folders and files

Latest commit

History

Repository files navigation

Documentation

Annotation

Crawl

Htmls

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages