Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-11-13: Missing terms are hard to find. #753

Open
HeidiSeibold opened this issue Nov 13, 2024 · 3 comments
Open

2024-11-13: Missing terms are hard to find. #753

HeidiSeibold opened this issue Nov 13, 2024 · 3 comments
Labels
question Further information is requested

Comments

@HeidiSeibold
Copy link

It is hard to find terms that are missing and in which language they are missing.

Ideas on how to solve:

  • Ask a large language model (for example: please give me the terms in the following yaml that exist in English (en) but not in German (de) )
  • Read yaml into R/python and reformat into table, then create missing data table (format could be: rows are the words, columns the languages; explanation not needed to find the gaps)

Any other ideas?

@HeidiSeibold HeidiSeibold added the question Further information is requested label Nov 13, 2024
@tobyhodges
Copy link
Member

Somewhat related: the checklist in #239 is quite outdated. A lot of the unchecked terms have now been defined.

@HeidiSeibold
Copy link
Author

HeidiSeibold commented Nov 13, 2024

Wrote R code to do this

library(yaml)

# Load the YAML file
yaml_data <- yaml.load_file("https://github.com/HeidiSeibold/glosario/raw/refs/heads/HeidiSeibold-glosario-1/glossary.yml",
                            as.named.list = TRUE)


# Initialize a list to hold the terms and definitions
terms_data <- list()

# Loop through each term entry in the YAML data
#for (entry in yaml_data) {
for (i in 1:length(yaml_data)) {
  entry <- yaml_data[[i]]
  slug <- entry$slug
  print(slug)
  
  # Loop through each language in the entry
  for (lang in names(entry)) {  
    
    # Skip the 'slug' and 'ref' field
    if(lang %in% c("slug", "ref")) {
      terms_data <- terms_data
    } else {
    
    term <- entry[[lang]]$term
    def <- entry[[lang]]$def
    
    # Store in a structured format for later conversion
    terms_data <- append(terms_data, list(data.frame(
      slug = slug,
      language = lang
    )))
    }
  }
}

# Combine all individual data frames into one
terms_df <- do.call(rbind, terms_data)

# Pivot the data to have languages as columns
library(dplyr)
library(tidyr)

# Convert to wide format with 1 for presence and 0 for absence
df_wide <- terms_df %>%
  mutate(present = 1) %>%  # Add a column to indicate presence
  pivot_wider(
    names_from = language,  # Make language codes the columns
    values_from = present,  # Use presence indicator
    values_fill = list(present = 0)  # Fill missing combinations with 0
  )

# View the final table
print(df_wide)

Rows: slugs
Columns: Languages
Values: 1 if the word has an entry in that language, 0 if not

TODO:

  • Exchange the link to the main repo
  • Export as table that is visible for contributors so that they can detect and fill missing entries

@elletjies
Copy link
Member

@HeidiSeibold thank you so much for this - we are are also working on the issue here #752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants