Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset: Collection and Description #6

Open
Team-thedatatribune opened this issue Oct 11, 2021 · 5 comments
Open

Dataset: Collection and Description #6

Team-thedatatribune opened this issue Oct 11, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@Team-thedatatribune
Copy link
Contributor

Team-thedatatribune commented Oct 11, 2021

Dataset Requirements 📦📋

TL; DR 🥱

This issue is one of the great starting point for the beginners in opensource community, here you can:

  • share the authentic data
  • contribute with new APIs for collecting the same
  • even provide (original) scripts (in any preferred language) for data collection and/or preparation like cleaning

Issue Description:

In the context of the dyPixa project, this task revolves around the crucial need to gather and comprehensively document datasets for training and testing the machine learning models. This issue addresses the following key aspects:

  1. Data Collection Scripts: Define a systematic approach and python (preferred) code for sourcing diverse datasets. This may include acquiring text data from social media, product reviews, and news articles, and images with associated sentiments from public image repositories.
  2. Dataset Documentation: Document (in Documentation #7 once uploaded) or raise the concerns related to (existing or required) dataset specifics, source, size, language distribution, and preprocessing. Refer to Issue Documentation #7 for detailed documentation guidelines.
  3. Data Quality Assurance: Ensure dataset integrity and consistency and is taken from authentic sources.
  4. Multilingual Considerations: Explore strategies for multilingual datasets.
  5. Collaboration with Contributors: Engage contributors in dataset sourcing and curation.

Types of Data Needed:

For the NLP and color suggestion models to be highly usable and effective, the following types of data should be considered:

  • Text Data:

    • Social media posts
    • Product reviews
    • News articles
    • Sentiment-labeled text in English and Hindi
    • Multilingual text data to enhance language support
  • Image Data:

    • Images with associated sentiment labels
    • Diverse images representing a wide range of emotions
    • Abstract images showcasing various color combinations

By addressing these components and collecting the appropriate types of data, this issue will lay the foundation for robust machine learning model development and further enhancements in the dyPixa project. Your contributions here will greatly advance the project's capabilities. 🚀🌈

@Team-thedatatribune Team-thedatatribune added enhancement New feature or request good first issue Good for newcomers labels Oct 11, 2021
@ravi-prakash1907 ravi-prakash1907 changed the title Define the Dataset Dataset: Collection and Description Sep 29, 2023
@ravi-prakash1907 ravi-prakash1907 pinned this issue Sep 29, 2023
@ravi-prakash1907 ravi-prakash1907 unpinned this issue Oct 4, 2023
@dharmraj617
Copy link

Hey, I am currently working on ML applications. I have some experience in Data Collection. Please Assign this issue to me.

@Addy0000
Copy link
Contributor

heya, i'd like to work on writing python scripts for collecting data.

@Team-thedatatribune
Copy link
Contributor Author

Hey, I am currently working on ML applications. I have some experience in Data Collection. Please Assign this issue to me.

@dharmraj617, we require a diverse dataset of poetic content gathered from various platforms, including:

  1. Social media platforms such as Twitter.
  2. News editorial sections.
  3. Haiku poetry, and more.

Your assistance in creating this dataset would be greatly appreciated, with the following key considerations in mind:

  1. Each data point (in this case, poems) should be concise, consisting of no more than 3-4 lines.
  2. We are primarily focused on English poems.
  3. Ensure proper data cleaning, such as removing emojis and extraneous characters.

For further discussion and information, please join the dyPixa Discord server. We look forward to your valuable contributions! 🙌

@ravi-prakash1907
Copy link
Contributor

ravi-prakash1907 commented Oct 16, 2023

heya, i'd like to work on writing python scripts for collecting data.

@Addy000, we currently have a program (here) that's been trained on go_emotions, capable of classifying any given (English) text into one of 28 different emotions.

Now, we're on an exciting new mission. We need a dataset to generate and recommend colors for each of these sentiments. It would be fantastic if you could contribute by providing:

  1. Images/Thumbnails corresponding to each of the 28 emotions.
  2. The finest color sets corresponding to each emotion (at least 5 for each emotion).

For a detailed description, I recommend visiting issue #58.

You can find the complete list of all 28 emotions at https://huggingface.co/SamLowe/roberta-base-go_emotions. 🎨

I'll assign you the issue if you're interested.

@Addy0000
Copy link
Contributor

@ravi-prakash1907 i went through it, would like to work on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants