Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] Poetry dataset is required #26

Open
ravi-prakash1907 opened this issue Oct 2, 2023 · 2 comments
Open

[Dataset] Poetry dataset is required #26

ravi-prakash1907 opened this issue Oct 2, 2023 · 2 comments
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ravi-prakash1907
Copy link
Contributor

Description: 📝

This project requires an NLP model trained on a poetry dataset, encompassing different languages with a current focus on English and Hindi. The dataset should meet the following constraints:

  1. English short poems.
  2. Hindi short poems.
  3. The datasets must be of high quality, sufficiently large, and diverse.
  4. Ensure that these short poems contain figures of speech.

For longer poems, consider the following contributions:

  1. Provide scripts to preprocess the data and divide lengthy poems into shorter segments.
  2. Provide the processed dataset.

Note:

  • It's crucial to collect short poems as the ultimate goal of dyPixa is to render the input text on the generated image.
  • Utilize publicly available datasets whenever possible.
  • If providing poems not openly available for analysis, ensure proper credit is given to the authors by adding an additional field for author names.
  • Poems may be the intellectual property of their authors; therefore, obtain authors' consent before uploading data to this repository.
  • For further dataset requirements, refer to Dataset: Collection and Description #6 for additional details.
@Nabanita29
Copy link

I plan to curate a diverse dataset of poems from online repositories and public domain collections. With a focus on balanced sentiment representation and accurate annotations, I will ensure the dataset's quality and integrity.

@ravi-prakash1907
Copy link
Contributor Author

That will be nice @Nabanita29. Please try to get the multilingual data and as mentioned, consider the short poems as a priority.
You may join the community's discord server for further discussion, queries, and suggestions!


Note: As the milestone "Dataset Collection" is nearing its deadline, pull requests associated with this issue will be considered a priority. 📅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants