2024-11-13: Missing terms are hard to find. #753

HeidiSeibold · 2024-11-13T11:14:36Z

It is hard to find terms that are missing and in which language they are missing.

Ideas on how to solve:

Ask a large language model (for example: please give me the terms in the following yaml that exist in English (en) but not in German (de) )
Read yaml into R/python and reformat into table, then create missing data table (format could be: rows are the words, columns the languages; explanation not needed to find the gaps)

Any other ideas?

tobyhodges · 2024-11-13T11:38:51Z

Somewhat related: the checklist in #239 is quite outdated. A lot of the unchecked terms have now been defined.

HeidiSeibold · 2024-11-13T11:59:15Z

Wrote R code to do this

library(yaml)

# Load the YAML file
yaml_data <- yaml.load_file("https://github.com/HeidiSeibold/glosario/raw/refs/heads/HeidiSeibold-glosario-1/glossary.yml",
                            as.named.list = TRUE)


# Initialize a list to hold the terms and definitions
terms_data <- list()

# Loop through each term entry in the YAML data
#for (entry in yaml_data) {
for (i in 1:length(yaml_data)) {
  entry <- yaml_data[[i]]
  slug <- entry$slug
  print(slug)
  
  # Loop through each language in the entry
  for (lang in names(entry)) {  
    
    # Skip the 'slug' and 'ref' field
    if(lang %in% c("slug", "ref")) {
      terms_data <- terms_data
    } else {
    
    term <- entry[[lang]]$term
    def <- entry[[lang]]$def
    
    # Store in a structured format for later conversion
    terms_data <- append(terms_data, list(data.frame(
      slug = slug,
      language = lang
    )))
    }
  }
}

# Combine all individual data frames into one
terms_df <- do.call(rbind, terms_data)

# Pivot the data to have languages as columns
library(dplyr)
library(tidyr)

# Convert to wide format with 1 for presence and 0 for absence
df_wide <- terms_df %>%
  mutate(present = 1) %>%  # Add a column to indicate presence
  pivot_wider(
    names_from = language,  # Make language codes the columns
    values_from = present,  # Use presence indicator
    values_fill = list(present = 0)  # Fill missing combinations with 0
  )

# View the final table
print(df_wide)

Rows: slugs
Columns: Languages
Values: 1 if the word has an entry in that language, 0 if not

TODO:

Exchange the link to the main repo
Export as table that is visible for contributors so that they can detect and fill missing entries

elletjies · 2024-11-15T09:17:38Z

@HeidiSeibold thank you so much for this - we are are also working on the issue here #752

HeidiSeibold added the question Further information is requested label Nov 13, 2024

elletjies mentioned this issue Nov 15, 2024

Language links removed #752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-11-13: Missing terms are hard to find. #753

2024-11-13: Missing terms are hard to find. #753

HeidiSeibold commented Nov 13, 2024

tobyhodges commented Nov 13, 2024

HeidiSeibold commented Nov 13, 2024 •

edited

Loading

elletjies commented Nov 15, 2024

2024-11-13: Missing terms are hard to find. #753

2024-11-13: Missing terms are hard to find. #753

Comments

HeidiSeibold commented Nov 13, 2024

tobyhodges commented Nov 13, 2024

HeidiSeibold commented Nov 13, 2024 • edited Loading

elletjies commented Nov 15, 2024

HeidiSeibold commented Nov 13, 2024 •

edited

Loading