TODO


# XXX cf "word" translations under "watchword" and "proverb" - what kind of
# link/ext is this?  {{trans-see|watchword}}

# XXX implement parsing {{ja-see-kango|xxx}} specially, see むひ

# XXX check use of sense numbers in translations (check "eagle"/English)

# XXX in Coordinate terms, handle bullet point titles like Previous:, Next:
# (see "ten", both Coordinate terms and Synonyms)

# XXX parse "alter" tags; see where they are used (they seem to be alternate
# forms)

# XXX parse "enum" tags (args: lang, prev, next, value)

# XXX make sure parenthesized parts in the middle of form-of descriptions
# in gloss are handled properly.  Check "vise" (Swedish verb).

# XXX distinguish non-gloss definition from gloss, see e.g. βούλομαι

# XXX check "unsupported tag component 'E' warning / in "word"

# XXX parse "id" field in {{trans-top|id=...}} (and overall parse wikidata ids)

# XXX implement support for image files in {{mul-symbol|...}} head, e.g.,
# Unsupported titles/Cifrão

# XXX Some declination/conjugation template arguments contain templates, e.g.
# mi/Serbo-Croatian (pronoun).  Should probably strip these templates, but
# not strip as aggressively as normal clean_value() does.  HTML tags and links
# should be stripped.

# XXX parse "Derived characters" section.  It may contain multiple
# space-separated characters in same list item.

# XXX extract Readings section (e.g., Kanji characters)

# XXX handle qualifiers starting with "of " specially.  They are quite common
# for adjectives, describing what the adjective can characterize

# XXX parse "object of a preposition", "indirect object of a verb",
# "(direct object of a verb)", "(as the object of a preposition)",
# "(as the direct object of a verbal noun)",
# in parenthesis at the end of gloss

# XXX parse [+infinitive = XXX], [+accusative = XXX], etc in gloss
# see e.g. βούλομαι.  These come from {{+obj|...}}, might also just parse
# the template.

# XXX parse "construed with XXX" from sense qualifiers or capture "construed with" template

# XXX how about gloss "Compound of XXX and YYY".  There are 100k of these.
#   - sometimes followed by semicolon and notes or "-" and notes
#   - sometimes Compound of gerund of XXX and YYY
#   - sometimes Compound of imperative (noi form) of XXX and YYY
#   - sometimes Compound of imperative (tu form) of XXX and YYY
#   - sometimes Compound of imperative (vo form) of XXX and YYY
#   - sometimes Compound of imperative (voi form) of XXX and YYY
#   - sometimes Compound of imperative of XXX and YYY
#   - sometimes Compound of indicative present of XXX and YYY
#   - sometimes Compound of masculine plural past participle of XXX and YYY
#   - sometimes Compound of past participle of XXX and YYY
#   - sometimes Compound of present indicative of XXX and YYY
#   - sometimes Compound of past participle of XXX and YYY
#   - sometimes Compound of plural past participle of XXX and YYY
#   - sometimes Compound of second-person singular imperative of XXX and YYY
#   - sometimes Compound of in base a and YYY
#   - sometimes Compound of in merito a and YYY
#   - sometimes Compound of in mezzo a and YYY
#   - sometimes Compound of in seguito a and YYY
#   - sometimes Compound of nel bel mezzo di and YYY
#   - sometimes Compound of per mezzo di and YYY
#   - sometimes Compound of per opera di and YYY
#   - sometimes Compound of XXX, YYY and ZZZ
#   - sometimes Compound of XXX + YYY
#   - sometimes Compound of the gerund of XXX and YYY
#   - sometimes Compound of the imperfect XXX and the pronoun YYY
#   - sometimes Compound of the infinitive XXX and the pronoun YYY

# XXX handle "Wiktionary appendix of terms relating to animals" ("animal")

# XXX check "Kanji characters outside the ..."

# XXX recognize "See also X" from end of gloss and move to a "See also" section

# Look at "prenderle"/Italian - two heads under same part-of-speech.  Second
# head ends up inside first gloss.

# XXX handle (classifier XXX) at beginning of gloss; also (classifier: XXX)
# and (classifiers: XXX, XXX, XXX)
# E.g.: 煙囪
# Also: 筷子 (this also seems to have synonym problems!)

# XXX parse redirects, create alt_of

# XXX Handle "; also YYY" syntax in initialisms (see AA/Proper name)

# XXX parse {{zh-see|XXX}} - see 共青团

# XXX is the Usage notes section parseable in any useful way?  E.g., there
# is a template {{cmn-toneless-note}}.  Are there other templates that would
# be useful and common enough to parse, e.g., into tags?

# Check te/Spanish/pron, See also section (warning, has unexpected format)

# Check alt_of with "." remaining.  Find them all to a separate list
# and analyze.

# XXX wikipedia, Wikispecies links.  See "permit"/English/Noun

# XXX There are way too many <noinclude> not properly closed warnings
# from Chinese, e.g., 內 - find out what causes these, and either fix or
# ignore

/derived terms (seem to be referenced with {{section link|...}}

數/derived terms
仔/derived terms
人/derived terms
龍/derived terms
馬/derived terms

# In htmlgen, create links from gloss, at minimum when whole gloss matches
# a word form in English (or maybe gloss as an alternative?)

htmlgen: form-of (and alt-of) links in generated html should look at
canonical form in addition to word - now many items with accents are
not properly linked

htmlgen: strange Categories non-disambiguated in: https://kaikki.org/dictionary/English/meaning/v/vo/volva.html#volva-noun

htmlgen: compound-of field of sense not displayed in generated html

htmlgen: implement support for redirects

htmlgen: render audio-ipa for audios

htmlgen: when looking for linkage references, also check canonical
word form

Word entires may be generated from entries like "Thesaurus:ar:make happy" that
are in a wrong language.  Check if they are referenced from somewhere and
maybe restrict generation to entries that are actually referenced.

Consider caching template expansions in an LRU cache.  Make sure
template arguments that can be accessed from Lua are considered
properly.  Cache must be reset per-page.  (Does DISPLAYPAGE break this?)

Check * at the end of some derived forms in wort/English/Noun (e.g., dropwort*)
  - it actually is in the original wikitext and is used as a reference symbol,
    with explanation at the end of the table
  - should probably just remove the *
  - check if there are other similar reference symbols and how common they are

Apparently the following is not parsed correctly:
{{ws sense|any of enclosing symbols "(", ")", "[", "]", "{", "}" etc.}}

Recognize and parse {{zh-dial|...}} in linkage, e.g., 鼎/Chinese
  - Variety = language (these are more like translation than synonyms)
  - Location = tags, parenthesized part separate tag?
  - Words may have Notes references, which may need to be parsed as senses
    (not clear how common they are with other words)

{{syn|...}} not properly handled in nested glosses, see frons/Latin
sense "the forehead, brow, front", search for "vultus"

In translations, should process {{trans-see|...}} (cf. "abound with",
the only translations are with {{trans-see|...}})

ख/Translingual/Letter: DEBUG: unhandled parenthesized prefix: (Devanagari script letters) अ,‎ आ,‎ इ,‎ ई,‎ उ,‎ ऊ,‎ ऋ,‎ ए,‎ ऐ,‎ ओ,‎ औ,‎ अं,‎ अः,‎ अँ,‎ क,‎ ख,‎ ग,‎ घ,‎ ङ,‎ च,‎ छ,‎ ज,‎ झ,‎ ञ,‎ ट,‎ ठ,‎ ड,‎ ढ,‎ ण,‎ त,‎ थ,‎ द,‎ ध,‎ न,‎ प,‎ फ,‎ ब,‎ भ,‎ म,‎ य,‎ र,‎ ल,‎ व,‎ श,‎ ष,‎ स,‎ ह,‎ त्र,‎ ज्ञ,‎ क्ष,‎ क़,‎ ख़,‎ ग़,‎ ज़,‎ झ़,‎ ड़,‎ ढ़,‎ ष़ <nowiki>[</nowiki><nowiki>]</nowiki> at ['ख']
- is nowiki broken in templates?  Perhaps core.py:_template_to_body should
  call preprocess_text for the template body?

Check:
can/English/Verb: ERROR: no language name in translation item: : ... at ['can']
can/English/Verb: ERROR: no language name in translation item: : -bil- (tr) (verbal infix) at ['can']

Make sure rtl (right-to-left) and ltr markers are properly inserted
and matched in words (including translations and linkages), even when
splitting them.

Implement better parsing of compound_of base.  There are over 135k
compound_of entries in Wiktionary, so this is common enough to care
about.

Use something similar to tr "note" detection in linkages (e.g.,
pitää/Finnish search +).  Also, the parenthesized tags in Derived
terms as subheadings are currently not handled.

Check strange and incorrectly processed category link in
अधिगच्छति/Sanskrit (I suspect this may be coding error in Wiktionary,
looks like the category has a nested link)

Fix parsing linkages for letters in scripts, see P/English/Number
  - longer sequences of characters
  - P/Latvian "burti:" prefix handled incorrectly (should be separate linkage)
  - P/Vietnamese first word incorrectly split (never split before semicolon)
  - P/Vietnamese multi-char names inside parentheses handled incorrectly
    (replace parentheses by comma? colon by comma?)
Handle " / " as a separator in linkages; see be/Faroese/related
(Latin-script letter names)

have/English/Translations/"to possess"/Arabic: have "- you have" at end of tr;
similarly "- I have"

language/English has weird IPAs coming from qualifier (see
/ae/ raising)

Add more sanity checks:
  - number of words extracted for a few major languages (English, Korean, Zulu)
  - number of extraction errors of various kinds
  - number of translations extracted for major languages
  - number of linkages extracted for major languages
  - number of suspicious linkages
  - number of suspicious alt-of/form-of
  - number of alt-of
  - number of form-of
  - number of IPAs
  - number of forms
  - number of inflection templates
  - number of non-disambiguated translations
  - number of non-disambiguated linkages
  - number of senses with tags
  - number of senses with topics

お茶/Japanese/Verb:
  - "ocha suru" as romanization, "suru" should not be there
  - "intransitive suru" gets into end of canonical form
  - stem and past are not correctly parsed
  - related terms are not correctly parsed - English goes into alt

Handle and remove right-to-left mark (and left-to-right mark) from titles,
forms, translations, linkages.  Add tag "right-to-left" where they were
present.  (Or should we keep the marks but still add tag?)

Fix support for DISPLAYTITLE - sets page default display title, which
would be useful.  Maybe extract too?  Note that this may prevent
caching templates/Lua calls.

Check "class II noun" and similar "note" in translations, cf. form desc,
define required tags/mappings

Consider improving romanization recognition in translation by
capturing strings inside <span class="tr [Latn]"> and never
interpreting such strings as English.  This would would probably hook
into clean_node().

Review handling of 0x200e (left-to-right mark) and 0x200f
(right-to-left mark) in linkages.  Also, what if only one is present?
cf. test_script5 in test_linkages.py

B/Portugues list of characters is broken
Finish writing tests for character lists

Synonyms are not collected from nested senses.  See cat/English/Noun
sense 1/1 "A domesticated species..."

Move hieroglyph conversion to postprocessing

婬/Japanese/Kanji/Definitions has a table that is not correctly handled,
and is being parsed as head form.

Handle (* Poetic.) and (** Antiquanted.) and */** at end of form, e.g.
камень/Russian

Check topics in word head, cf. अंगूठा/Hindi, which has (anatomy) in head
Fix parsing "wait on hand, foot and finger"/English head parsing
(comma splits related forms).  Maybe similar magic character kludge as we do
for linkages? No, that won't work here... maybe just merge standalone
parts that are comma-separated parts of the title to the previous entry
with a comma (as a pre-processing step after splitting?).  Inflection though
might be in any of the parts, but I think usually the first.
Also: "wait on someone hand, foot and finger"/English.
Ah, similar bug in alt-of/inflection-of parsing.

Translation lists missing a comma between translations are handled incorrectly.
See tree/English/Tr/Punjabi (both Gurmukhi and Shahmukhi).

Implement special case in parsing word head for paren desc:
"на (na) accusative, by"; see помножать/Russian

Something wrong in parsing آتي/Arabic "coming"

Finish datautils.py:split_slashes and use it for translations/linkages/heads
  - warn when clearly ambigious
  - add more tests in test_utils.py

Which language should be used in translation when langcode and language in the
entry differ?  This is particularly an issue for Chinese.  Or should we treat
the second-level entries (e.g., Mandarin) as tags?

Change clean_value in clean.py to receive ctx.  Add "hieroglyphs" tag
if the string contains <hiero> tag.  (Or maybe this needs to go in
clean_node in page.py?  Or maybe these two should be merged?)

Handle Chinese usage examples better - separate Chinese (traditional,
simplified), Pinyin, translation.

Translations for Mandarin seem to have Chinese language but cmn code.
What is correct?

Parse eumhun related forms (Korean) in more detail, particularly
separate romanization and hanja forms, e.g. 度/Korean

Collect RGB values and other similar data

XXX chinese glyphs, see 內
 - I'm getting dial-syn page does not exist in synonyms, but Wiktionary
   shows a big list of synonyms

Some Chinese words have weird /wiki.local/ in sounds, see: 屋
 - also in pronunciations, several IPAs are captured, but they are not
   annotated with dialect information
 - this page also has dial-syn warning in synonyms (also noted elsewhere)
 - in Japanese compounds, roman transliterations incorrectly go into tags
 - in Japanese, does it make sense to have most senses annotated with "kanji"?

Parse readings subsection in Japanese kanji characters, see 內
 - these are apparently pronunciations, but not IPA
 - perhaps these might be parsed as forms with "reading" tag and tag
   for the type of reading?

Korean kanji seem to have Phonetic hangeul in pronunciation sections, see 內
Parse Phonetic hangeul in Pronunciation section (see 死/Korean)

Check extraction of pronunciations and their romanizations in 공중/Korean

# XXX add parsing chinese pronunciations, see 傻瓜 https://en.wiktionary.org/wiki/%E5%82%BB%E7%93%9C#Chinese

Check ignored_cat_re in page.py

Is Rhymes collected and displayed correctly?  cf. environ/English

-abil/Romanian declension table produces wikitext that is parsed incorrectly.
Looks like table cell tags (! at least) can be preceded by whitespace, here \t.

Test Pali Alternative Scripts, see cakka/Pali

Check if form-of imports tags from head (av/Swedish) - seems to be missing

LUA ERRORS:
कट्नु/Nepali/Verb: ERROR: LUA error in #invoke ('ne-conj', 'show') parent ('Template:ne-conj', {'i': 'y', 'a': 'y'}) at ['कट्नु', 'ne-conj', '#invoke']

ogun/Yoruba: ERROR: LUA error in #invoke ('number list', 'show_box') parent ('Template:number box', {1: 'yo', 2: '20'}) at ['ogun', 'number box', '#invoke']

Heleb/Northern Kurdish: ERROR: LUA error in #invoke ('wikipedia', 'wikipedia_box') parent ('Template:wp', {'lang': 'ku'}) at ['Heleb', 'wp', '#invoke']

তাহুন/Assamese/Pronoun: ERROR: LUA error in #invoke ('links/templates

Parse declension/conjugation tables

Interpret "see also" sections.  This seems to be junk in some languages/words,
but related words in others (e.g., tänään/Finnish)

Implement support for "See also translations at" (e.g., god/translations)

htmlgen: Change when htmlgen computes indexes - only do after disambiguation

htmlgen: disambiguate translation & linkage targets

htmlgen: change disambiguated links

check jakes/English - ]] at the beginning of second sense

Improve #time date format parsing.  cf. hard/English.

Parse inflection-like tables from See also sections (e.g.,
sie/Karelian, "Karelian personal pronouns" table)

wikitextprocessor error: '''c''''est is parsed incorrectly (should leave quote).
This needs to be fixed.
  - must change example extraction to use node_to_html and then parse from there
  - this is a major change that needs proper testing!  It is likely to break
    a lot of existing cases.
  - compare current extraction to extraction after the change, checking all
    examples that change.

Working on changes to compute_coltags
  - changed how column headers are merged
  - still something wrong, look at aamupala/Finnish 3rd pers possessive

Fix parsing tell/English/Verb word head "(dialectal or nonstandard)"
(appears very new, not yet in my test environment)

götürmek/Turkish - inflection table "interrogative" not properly inherited
to cells below (possibly other subtitles to).
  - there are also participles per person, before persons are defined???

burn/English
  - pronoun before / (e.g., burn past first-person singular) is not
    handled correctly - second alternative does not get the person tags

grind/English
  - implement handling of lead-in text before table (here "Weak conjugation",
    "Strong conjugation")
  - check tables, they are unusual
  - add a test for this, because of the parenthesized parts

Improve audio file URL generation in wiktextract to handle redirects
in Wikimedia Commons.  Options include using Commons API to query the
actual location or parsing Commons dumps to extract redirects.  (Note
that some files also seems to have missing transcoded versions -
probably about 0.1-0.5% of them)
  - See download at https://kaikki.org/dictionary/rawdata.html, audio
    file bulk download.  We would like to make this download automatically
    updated and to support redirect.
  - Modify wiktextract (git@github.com:tatuylonen/wiktextract) to take
    wikimedia commons redirect list as a command line option
    (--commons-redirect=FILE) and change parse_pronunciation() in
    wiktextract/page.py to use this information when computing mp3_url
    and ogg_url, so that URLs for redirected media files point to
    the correct media file in Wikimedia commons.
  - Take sound-file-downloader.py (under kaikki/webgen/wiktextract) and
    turn it into a proper downloader that only downloads those files that
    it does not yet have.  Add support for argparse, with
      --datadir=DIR
      --rawdata=PATH  (extract.json)
  - The general idea is to be able to run the downloader after extraction
    (should be fast as there are usually only a small number of new files),
    recreate the wiktionary-audios.tar file, and copy it to the website.
    This way we can update the audio files daily.  This does not handle
    modified existing files, so additionally we may want to re-download
    everything every few months.  (Maybe we could have --check-updates option
    that uses HTTP HEAD command to get file size and checks if it has changed;
    it could also check if the date is newer than the existing file
    modification date.  Alternative is to periodically re-download everything,
    which takes 1-2 weeks.)
  - As an alternative, we could perhaps use the existing Wikimedia commons
    download and first attempt to fetch the files from there?  Perhaps
    a --wikimedia-commons-archive=DIR option?


Add tests to ensure that etymology templates and audio URLs are properly extracted.
  - add a new unittest file under tests
  - take some sample articles, pass them to parse_page() (wiktextract/page.py),
    and check that the results are as expected (i..e, etymology templates
    correctly extracted, audio URLs properly extracted)
  - NOTE: you cannot expect templates to be expanded, let's discuss if there
    are major problems

PLAN for 7/2022:
  - continue bug fixing
  - converting tags to UniMorph and Universal Dependencies tags (as additional
    fields, leave existing tags field as-is)
  - start moving constant tables to separate files in preparation for
    parsing non-English Wiktionaries
  - encode links to other sections from translations (e.g.,
    "intelligence -- see intelligence" in wit/English/Translations)
  - make it easier to control tag bleed in inflection tables
  - code refactoring to make maintenance easier (especially page.py)

Moving tags to external files:
  - create directory wiktextract/wiktextract/data
  - language-specific data would go under
    wiktextract/wiktextract/data/<editioncode> (e.g., "en" is an <editioncode>)
  - non-language-specific data (e.g., tags) would go under
    wiktextract/wiktextract/data/shared
  - above directories need to be included in pypi distributions and
    must be installed; when used, code must properly search for them
    using pkg_resources package (see luaexec.py in wiktextract for an example)

  -