We have used real passwords that belong to individuals that were phished and consequently tricked into revealing their passwords. This raises few ethical issues: whether this in depth analysis will hurt those users? will it reveal any other secrets? will it identify any individuals? This dataset dates back to 2009 and therefore it is highly unlikely that the same users would have kept their passwords the same even if the identity of the individual could be revealed. Identities are highly unlikely to be revealed through our data. We have not used any PII or any other information that could link back to individuals. This dataset has also been previously used in other research papers.
This new dataset and repository have been created solely for academic research purposes. The data, sourced from publicly accessible repositories like SecList, does not reveal any novel information. Our analysis also does not expose any new details about the original users associated with these passwords. Importantly, this dataset is devoid of any personally identifiable information (PII).
The use of breached and leaked password datasets is well-established in published research, with some studies even incorporating PII such as email addresses and dates of birth. As such, this work aligns with existing practices in the field and does not necessitate additional scrutiny. Analyzing leaked datasets is crucial for advancing our understanding of how human-chosen secrets are employed, ultimately enabling us to enhance their resilience against malicious actors.
If you have any concerns then please submit an issue through this github repository.