Skip to content

Enterprise-grade training data curation for LLM fine-tuning with professional web scraping & AI-powered workflows.

Notifications You must be signed in to change notification settings

Pratik3c/DataMint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 DataMint – Training Data Curation Bot

DataMint Logo

Enterprise-grade training data curation for LLM fine-tuning with professional web scraping & AI-powered workflows.


🌟 Overview

DataMint transforms raw content into high-quality training datasets for machine learning models. With professional web scraping, AI-powered content generation, and enterprise-grade quality control, it gives you everything needed to build world-class datasets.


✨ Key Features

🌐 Professional Web Scraping

  • Decodo Integration – Enterprise-grade scraping with authentication.
  • Smart Fallback System – Never lose content due to scraping errors.
  • Optimized for Target Types – News, e-commerce, blogs, and more.
  • Multi-Region Support – Scrape data globally.
  • Parallel Processing – High-performance concurrent scraping.

🎯 AI-Powered Content Generation

  • Multiple AI Providers – OpenAI GPT-4, GPT-3.5, Anthropic Claude.
  • Task Diversity – Q&A, classification, summarization, NER, red teaming.
  • Quality Validation – Automated scoring & filtering.
  • Cost Tracking – Real-time monitoring & budgeting.
  • Prompt Templates – Optimized prompts for different tasks.

πŸ“Š Enterprise Dashboard

  • Dual Input – Upload files or scrape URLs.
  • Live Progress – Real-time updates.
  • Analytics – Interactive charts and reports.
  • Export Options – JSONL, CSV, HuggingFace format.

πŸ” Quality Control

  • Multi-Metric Evaluation – Toxicity, bias, coherence, diversity.
  • Automated Filtering – Enforce quality thresholds.
  • Detailed Reports – Exportable validation reports.
  • Error Recovery – Robust logging & error handling.

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

Create a .env file in the project root:

# OpenAI API Key
OPENAI_API_KEY=your_openai_api_key_here

# Decodo Web Scraping
DECODO_USERNAME=your_decodo_username
DECODO_PASSWORD=your_decodo_password
DECODO_BASIC_AUTH=encoded_basic_auth_token

3. Launch Dashboard

streamlit run src/training_data_bot/dashboard/app.py

4. Use CLI

# Process documents
tdb process --source-dir ./documents --output-dir ./results

# Generate tasks
tdb generate qa --input-file document.txt --output-file qa.jsonl

# Evaluate quality
tdb evaluate --dataset-file results.jsonl --output-report quality.html

πŸ“‚ Supported Content Types

File Types: PDF, DOCX, TXT, MD, HTML, CSV, JSON Web: Wikipedia, blogs, news, e-commerce, technical docs, research papers


🎨 Training Data Types

  • Q&A Generation – comprehension, factual, analytical, MCQs.
  • Classification – topic, sentiment, difficulty, content type.
  • Summarization – extractive, abstractive, multi-length.
  • NER – people, orgs, locations, dates.
  • Red Teaming – safety, bias, adversarial prompts.

πŸ“Š Analytics & Reporting

  • Quality Metrics – toxicity, bias, coherence, diversity, relevance.
  • Cost Tracking – usage reports, provider comparison.
  • Performance – processing speed, success rates, trend analysis.

πŸ—οΈ Architecture

DataMint Architecture
β”œβ”€β”€ 🌐 Web Scraping Layer (Decodo + fallback)
β”œβ”€β”€ πŸ“„ Document Processing Pipeline
β”œβ”€β”€ πŸ”„ Text Preprocessing & Chunking
β”œβ”€β”€ 🎯 Task Management System
β”œβ”€β”€ πŸ€– AI Generation Modules
β”œβ”€β”€ πŸ” Quality Control Layer
β”œβ”€β”€ πŸ“Š Export & Storage
└── πŸ–₯️ Streamlit Dashboard

πŸ“‹ Project Structure

DataMint/
β”‚-- src/training_data_bot/   # Core logic & pipelines
β”‚-- dashboard/               # Streamlit UI
β”‚-- cli/                     # Command line interface
β”‚-- tests/                   # Unit tests
β”‚-- configs/                 # YAML configurations
β”‚-- requirements.txt         # Python dependencies
β”‚-- .env.example             # Environment example

πŸ”’ Configuration

Environment Variables

OPENAI_API_KEY=sk-...
DECODO_USERNAME=...
DECODO_PASSWORD=...
ANTHROPIC_API_KEY=sk-ant-...
MAX_WORKERS=8
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
QUALITY_THRESHOLD=0.7

Config File (configs/config.yaml)

processing:
  chunk_size: 1000
  chunk_overlap: 200
  max_workers: 8
quality:
  threshold: 0.7
  metrics: [toxicity, bias, diversity, coherence, relevance]
ai:
  provider: openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 1000

🎯 Use Cases

  • Education – textbooks β†’ Q&A, quizzes, flashcards.
  • Enterprise – documentation β†’ searchable KB, training data.
  • Research – summarizing papers, annotation, bias testing.
  • Compliance – regulatory training datasets.

πŸ› οΈ Advanced Features

Batch Processing

sources = [
  "https://wikipedia.org/wiki/AI",
  "documents/research.pdf",
  "https://news.example.com/article"
]
dataset = await bot.process_sources(sources, task_types=["qa_generation", "classification"], quality_threshold=0.8)

Custom Task

custom_task = TaskTemplate(
  name="Custom QA Generator",
  task_type=TaskType.QA_GENERATION,
  prompt_template="""
  Create domain-specific questions from: {text}
  Focus: {domain}
  Difficulty: {difficulty}
  """,
  parameters={"domain": "machine_learning", "difficulty": "advanced"}
)

Export Formats

  • JSONL
  • CSV
  • HuggingFace datasets
  • Custom formats

🀝 Contributing

  1. Fork the repo
  2. Create feature branch (git checkout -b feature-name)
  3. Commit (git commit -m 'Add feature')
  4. Push & open PR

πŸ“„ License

MIT License Β© 2025 Pratik Mandalkar


πŸ™ Acknowledgments

  • OpenAI – GPT models
  • Anthropic – Claude
  • Decodo – web scraping infra
  • Streamlit – dashboard
  • HuggingFace – ML ecosystem

Transform raw content into professional training data with DataMint πŸš€

About

Enterprise-grade training data curation for LLM fine-tuning with professional web scraping & AI-powered workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages