Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replaces wordlist-5-dice with a new word list #39

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sts10
Copy link

@sts10 sts10 commented Jan 17, 2023

Similar to #38, I made a new word list for assets/wordlist/wordlist-5-dice.js. I understand that, for the 5-dice list, I'm going up against the EFF long list, so in that way it's a bit more controversial than #38, but we'll see.

I understand that currently, this program uses a slightly modified version of the EFF shortlist as a 7,776-word list. One interesting property of the original EFF long list is that it is free of prefix words ("We also ensured that no word is an exact prefix of any other word."). This offers a key advantage: users can combine words from the list without delimiters or using camel case (e.g. twigstarfishrefusalretentiontheftfreezing is safe to use) (a better term to describe this property is to say that the list is "uniquely decodable").

However, there is a trade-off to removing all prefix words: since words that are prefixes of other words are often themselves common words, a prefix-word-free list has to use slightly less common words that a list that does contain prefix words.

This diceware web app capitalizes the first character of every word (e.g. AloofUncladCartridgeAlike), meaning that it is free to use word lists that do have prefix codes.

With this in mind, I made a new 7,776-word list for this project. It's based on 2012 Google Ngram data, and I used a tool I call Tidy to create it. The list does contain prefix words. Of course, we may still prefer the EFF list. But again, thought I'd submit this PR.

Some attributes describing the new list:

List length               : 7776 words
Mean word length          : 7.08 characters
Length of shortest word   : 3 characters (act)
Length of longest word    : 11 characters (willingness)
Free of prefix words?     : false
Entropy per word          : 12.925 bits

Pseudorandomly generated sample passphrases
-------------------------------------------
interact developers terrace rolling factory legislative 
alloy almost pathways duties percentages address 
tiny orbit truth resident villa injury 
run suite ran regarded chemicals enclosed 
citizen overt structures substitute figured villagers

@sts10
Copy link
Author

sts10 commented Apr 6, 2023

I've given this list a bit of a refresh for spring, incorporating words from Wikipedia, thanks to this project, and ensuring that some (more) British spellings of English words are not on the list (sorry Britain!).

Information/attributes of the updated list

List length               : 7776 words
Mean word length          : 7.04 characters
Length of shortest word   : 3 characters (ace)
Length of longest word    : 11 characters (willingness)
Uniquely decodable?       : false
Entropy per word          : 12.925 bits
Efficiency per character  : 1.835 bits
Assumed entropy per char  : 4.308 bits
Mean edit distance        : 7.035

Pseudorandomly generated sample passphrases
-------------------------------------------
straw provinces humble impressions ion gradually
transmitted readings defenders thrown whenever leaned
actress things reversed troy management specialist
whatever obvious wide literal risk operational
sensible bodily matched schedules blocked damages

Licensing (updated)

Given that Wikipedia text is licensed as Creative Commons Attribution-ShareAlike 3.0 Unported License ("CC BY-SA"), I'm using that license for this list as well (see updated comment at top of the file). Hope that doesn't disqualify this PR! But given that this project is licensed under GPL-2.0, I think it should be fine!

@dmuth
Copy link
Owner

dmuth commented May 5, 2023

BTW, thanks for making these--I need to dive into them, as well as some other wordlists at some point, but it will require front-end changes. (I got some ideas on that, but suggestions are always welcome)

@sts10
Copy link
Author

sts10 commented May 5, 2023

No worries. I'm happy to have the time to push new changes, hopefully improving each proposed list with each pre-merge commit. Let me know if I can be of assistance diving into them.

...as well as some other wordlists at some point...

Curious to find out what other lists you're considering. Separately from my two PRs to this project, I've been working on a set of wordlists, so I'm interested what passphrase generator developers like yourself are looking for in word lists.

@dmuth
Copy link
Owner

dmuth commented May 5, 2023

Curious to find out what other lists you're considering.

I'm looking at the ones from Strongbox:

https://github.com/strongbox-password-safe/Strongbox/tree/master/resources/wordlists

I've noticed the stats on your wordlists, is there a utility that generates those? We might be getting to the point where it would make sense to start putting the stats into a spreadsheet for analysis purposes.

@sts10
Copy link
Author

sts10 commented May 6, 2023

I've noticed the stats on your wordlists, is there a utility that generates those?

Yes, from a Rust tool I built called Tidy. Once installed, running tidy -AAAA --samples wordlist.txt prints the list, then the full suite of stats, then some passphrase samples. Tidy does far more than just print stats about a word list: You can also combine multiple lists and perform numerous other edits.

We might be getting to the point where it would make sense to start putting the stats into a spreadsheet for analysis purposes.

That sounds like a great project! I coincidentally started a short list of password managers and the word lists they use.

I'm looking at the ones from Strongbox...

I've actually had a look at those word lists recently. While I love that Strongbox offers many non-English lists, those lists in particular seem a bit under-developed.

For example, most lists start with 200 lines of symbols and numbers, plus non-words like "aa" (see their French list for example), which, imo, betrays the promise of a "passphrase". Also, as further evidence of poor list work, at least four of the lists have issues:

  • finnish-diceware.wordlist.utf8.txt has two copies of small words such as "a", "aa", "ab", "abc", "ad", "ar", "cj"
  • french-diceware.wordlist.utf8.txt seems to have a blank line (word) at line 40
  • icelandic-diceware.wordlist.utf8.txt has two copies of small words like "aa", "ad" and "ae"
  • swedish-diceware.wordlist.utf8.txt has two copies of the line "abc"

Likewise, some of the EFF fandom lists in the Strongbox repo have some profane words and some words with non-ASCII characters in 2 or 3 of them. I've offered two solutions to this issue for another password project, if you want to take a look at that. However, note that there are only 4,000 unique words on each of the fandom lists -- they're doubled to make it to 8,000 -- so they couldn't make it to the necessary 7,776 words for a 5-dice list without adding words.

If you want to add word lists in foreign languages, I'd consider starting with the Wikipedia word frequency project I used to create this PR, which has word frequency data from multiple languages. It wouldn't be too difficult to use a tool like Tidy to cut them to 7,776-word lists for this project (something like tidy -C -l --print-first 7776 --locale es-ES -z nfc -d s -m 3 -M 12 --straighten -o spanish-diceware-list.txt eswiki-2022-08-29.txt ) -- my only hesitation is not knowing which words are profane or otherwise inappropriate in languages I don't speak/read.

@dmuth
Copy link
Owner

dmuth commented May 7, 2023

I did a deep dive on my code last night, and have been thinking about this. Here's what I came up with:

  • First, I gotta do some refactoring, so I just opened Refactor Javascript In Preparation for Multiple Wordlists #46 to track that. The big benefit in relation to additional wordlists is that I'll be able to just use text files going forward. (Right now, my wordlist is Javascript)
  • Second, I agree with some of the quality of concerns of the other wordlists that you raised.
  • And that brings me to my third point--I wonder if it might be in the best interests of us, and any other project that has password generation capabilities, to create a separate repo that simply holds password lists as plaintext files with one word per line, along with details that relate to the quality of that password file.

After I finish up #46, I think we can talk about next steps in terms of this specific PR.

@sts10
Copy link
Author

sts10 commented May 7, 2023

First, I gotta do some refactoring... After I finish up #46, I think we can talk about next steps in terms of this specific PR.

Totally get it.

create a separate repo that simply holds password lists as plaintext files with one word per line, along with details that relate to the quality of that password file.

Do you mean word lists that other password managers and generators currently use, plus information about the passphrases they generate (word count, entropy word, etc.)? I can try getting a start on that. I don't think there are too many out there in use...

Update: Here's a first pass at it: https://github.com/sts10/wordlist-information

@dmuth
Copy link
Owner

dmuth commented May 7, 2023

Update: Here's a first pass at it: https://github.com/sts10/wordlist-information

That's a great start!

So I think I'm gonna continue my refactoring work over in #46, then I want to play around with Tidy (I am also teaching myself Rust, so that's great timing!), and run it against the Strongbox lists.

I have some ideas for the wordlist-information repo, but I'm gonna let them bounce around in my head while I work on Diceware for the next few nights..

@dmuth
Copy link
Owner

dmuth commented May 10, 2023

All done with #46, and deployed it last night. I'll start poking at Tidy tomorrow. You may see some PRs from me for the wordlist-information repo later on.

@sts10
Copy link
Author

sts10 commented May 14, 2023

I see that my branch is a bit behind now, especially after #47.

I'd update my branch and PR, but I'm not sure if you want word lists be in .txt files, with no quotes or commas, now? OR maybe my proposed new lists don't make sense anymore. FYI this proposed list lives here.

@dmuth
Copy link
Owner

dmuth commented May 15, 2023

I'm not sure if you want word lists be in .txt files, with no quotes or commas, now

As a peek under the hood, what I do in the Javascript code is create a random number between 1 and 7776 and then pick the line out from the list. The dice rolls that are shown on the page are the results of me converting the number from Base 10 to Base 6. 😹

Going forward, I'm just going to have lists be a text file with one word per line, because that's a format that is the easiest to work with.

I should add a list selection dropdown to the page, so I can just grab your list and put it and the dropdown I'll create. The big question I have is what would you like your list called? I was thinking something like Google 2012 Common Words or something similar? I'm also open to clever/fancy names with me putting details into the README.

-- Doug

@sts10
Copy link
Author

sts10 commented May 15, 2023

I should add a list selection dropdown to the page, so I can just grab your list and put it and the dropdown I'll create. The big question I have is what would you like your list called? I was thinking something like Google 2012 Common Words or something similar? I'm also open to clever/fancy names with me putting details into the README.

The list has frequent word data from Wikipedia mixed in now, so "Google 2012 Common Words" doesn't fit anymore. I'll try to think of a name for it!

Separately, we can consider adding my Orchard Street Medium list instead or in addition.

The Orchard Street Medium list is uniquely decodable, which brings us to an interesting question regarding your project. If your app continues to enforce a delimiter between words, the word lists you use need-not be uniquely decodable. Not being uniquely decodable usually allows the list to have shorter, more common words. This is why, in this PR, I submitted a not uniquely decodable list.

@dmuth
Copy link
Owner

dmuth commented May 17, 2023

If your app continues to enforce a delimiter between words, the word lists you use need-not be uniquely decodable. Not being uniquely decodable usually allows the list to have shorter, more common words.

You mean the CamelCase capitalization? Yep,I plan on keeping that, because it makes the words much easier to read.

@sts10
Copy link
Author

sts10 commented May 17, 2023

You mean the CamelCase capitalization?

Yes, sorry -- CamelCase is effectively a delimiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants