Skip to content

atw-9/dataset

 
 

Repository files navigation

Darija Open Dataset

Welcome to the Darija Open Dataset (DODa), an ambitious open-source project dedicated to the Moroccan dialect. With about 150,000 entries, DODa is arguably the largest open-source collaborative project for Darija <=> English translation built for Natural Language Processing purposes.

In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, as well as more that 86,000 translated sentences.

Additionally, DODa takes into account the diversity of Darija spellings used in various contexts, making it a versatile resource for language enthusiasts and NLP practitioners. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications.

Our primary goal is to establish DODa as the go-to reference for NLP in Darija. By providing a robust and diverse dataset, we aim to facilitate the development of NLP applications that can cater to the specific linguistic needs of the Moroccan community.

While we have made significant progress in compiling and organizing the dataset, it's important to note that parts of the dataset are still either under review or in progress, especially in the sentences.csv file. We welcome contributions from the Moroccan IT community to help us refine and expand the dataset further, ensuring its accuracy and completeness. Together, we can build a powerful foundation for future NLP innovations tailored to Moroccan culture and language.


How to contribute

You're free to navigate straight to the AtlasIA interface and start your contributions 🔥🔥.

Otherwise, if you prefer using dev tools, we've made a detailed video for you on how to contribute

TL;DW (Too Long Didn't Watch):

  1. Go to Issues
  2. Choose one and comment to have it assigned to you
  3. Fork the Dataset Repository
  4. Translate and fix typos in the file corresponding to your assigned issue
  5. Open a Pull Request

Thank you for your contribution!!!

Guidelines / Recommendations

  1. 3ndk ح dir ح xD (shout-out to this guy 😆), often try to use:
darija 3 7 9 8 2 - 'a' - 'i' 5 - 'kh'
arabic ع ح ق ه همزة خ
  1. Try to use capitalization to differentiate between the following letters:
t T s S d D
ت ط س ص د ض
  1. Arabic characters with two-letters Latin equivalent:
Arabic alphabet ش غ خ
Latin alphabet ch gh kh
  1. Double characters to refer to the emphasis or "الشدة":
darija 7mam 7mmam
english pigeons bathroom
  1. We usually don't add "e" in the end of darija words : louz instead of louze

  2. We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)

  3. When using commas, don't forget to surround the expression by quotation marks (as we are using csv files)

  4. We use spaces as word delimiters, not _ nor - : thank you instead of thank_you

  5. Respect the number of columns in every row you add, you can use empty quotation marks "", or just empty placeholder, in case you don't have extra variations

"sou9","souk","","سوق","market"

sou9,souk,,مارشي,market

  1. In each row, always start with the most used form (in your opinion of course) of the word in question

  2. For future use of this dataset, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

sou9,souk,souq,سوق,market

marchi,,,مارشي,market

  1. verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

ghnna,ghenna,ghanna,,,,غنّا,sing

  1. masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

PyDODa - Python wrapper for the DODa

Python Badge

Pydoda is a comprehensive Python library that simplifies access and analysis of the DODa dataset. It enables effortless exploration of linguistic content for researchers, developers, and language enthusiasts by providing an intuitive interface for accessing various dataset categories, retrieving spellings and translations.

Integrating Pydoda into your Python workflow grants access to a wide range of functionalities, facilitating insights extraction from the DODa dataset, including semantic and syntactic analysis, translation retrieval, spelling variations exploration, and more.

Usage example

Pydoda could easily be installed using pip:

pip install pydoda

Here is a small code snippet:

from pydoda import Category

# Create an instance of Category
my_category = Category('semantic', 'animals')

# Get the Darija translation of a word
darija_translation = my_category.get_darija_translation('dog')
print(darija_translation)
# Output: klb

# Get the English translation of a word
english_translation = my_category.get_english_translation('mch')
print(english_translation)
# Output: 'cat'

For further details, visit the official Pydoda GitHub repository & official Pydoda documentation.

Usage Terms

  • Research and Personal Use: You are welcome to use DODa for research, personal projects, and educational purposes, free of charge, in accordance with the terms of the open-source license

  • Commercial Use: For commercial purposes or any other usage not covered by the open-source license, please contact the copyright holders Aissam or Hamza to discuss licensing options and permissions.

Citation

@misc{outchakoucht2024evolution,
      title={The Evolution of Darija Open Dataset: Introducing Version 2}, 
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2024},
      eprint={2405.13016},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

darija <-> english dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published