Skip to content

Latest commit

 

History

History
34 lines (24 loc) · 1.33 KB

README.md

File metadata and controls

34 lines (24 loc) · 1.33 KB

To do:

  • Revamp tsv writer in terms of objects, so that all the functionality gets wrapped into each group of words

Most common Greek words

I pulled lists of the most common Greek words appearing in a corpus of web pages from SketchEngine. Then I used WiktionaryParser to pull definitions from Wiktionary. I packaged the results as a tsv that can be uploaded to Anki, a flashcard app.

Files

Data files

  • anki.tsv is the Anki flashcard list
  • db.json is a database with the words, their frequencies, and their definitions
  • words.txt is just a list of the words included in the lists
  • raw/ contains un-tracked files downloaded from SketchEngine that are parsed

Script files

  • parse_html.py turns the html files in raw/ into database entries
  • fetch_definitions.py populates the database with Wiktionary definitions
  • make_anki_tsv.py translates the database into the Anki-ready file

Alternative data sources

I considered but did not ultimately pursue scraping a Greek-English dictionary website (e.g., dict.com or Word Reference) in part because of the difficulty in reliably parsing the pages and also because of licensing concerns.