Enterprise-grade training data curation for LLM fine-tuning with professional web scraping & AI-powered workflows.
DataMint transforms raw content into high-quality training datasets for machine learning models. With professional web scraping, AI-powered content generation, and enterprise-grade quality control, it gives you everything needed to build world-class datasets.
- Decodo Integration β Enterprise-grade scraping with authentication.
- Smart Fallback System β Never lose content due to scraping errors.
- Optimized for Target Types β News, e-commerce, blogs, and more.
- Multi-Region Support β Scrape data globally.
- Parallel Processing β High-performance concurrent scraping.
- Multiple AI Providers β OpenAI GPT-4, GPT-3.5, Anthropic Claude.
- Task Diversity β Q&A, classification, summarization, NER, red teaming.
- Quality Validation β Automated scoring & filtering.
- Cost Tracking β Real-time monitoring & budgeting.
- Prompt Templates β Optimized prompts for different tasks.
- Dual Input β Upload files or scrape URLs.
- Live Progress β Real-time updates.
- Analytics β Interactive charts and reports.
- Export Options β JSONL, CSV, HuggingFace format.
- Multi-Metric Evaluation β Toxicity, bias, coherence, diversity.
- Automated Filtering β Enforce quality thresholds.
- Detailed Reports β Exportable validation reports.
- Error Recovery β Robust logging & error handling.
pip install -r requirements.txt
Create a .env
file in the project root:
# OpenAI API Key
OPENAI_API_KEY=your_openai_api_key_here
# Decodo Web Scraping
DECODO_USERNAME=your_decodo_username
DECODO_PASSWORD=your_decodo_password
DECODO_BASIC_AUTH=encoded_basic_auth_token
streamlit run src/training_data_bot/dashboard/app.py
# Process documents
tdb process --source-dir ./documents --output-dir ./results
# Generate tasks
tdb generate qa --input-file document.txt --output-file qa.jsonl
# Evaluate quality
tdb evaluate --dataset-file results.jsonl --output-report quality.html
File Types: PDF, DOCX, TXT, MD, HTML, CSV, JSON Web: Wikipedia, blogs, news, e-commerce, technical docs, research papers
- Q&A Generation β comprehension, factual, analytical, MCQs.
- Classification β topic, sentiment, difficulty, content type.
- Summarization β extractive, abstractive, multi-length.
- NER β people, orgs, locations, dates.
- Red Teaming β safety, bias, adversarial prompts.
- Quality Metrics β toxicity, bias, coherence, diversity, relevance.
- Cost Tracking β usage reports, provider comparison.
- Performance β processing speed, success rates, trend analysis.
DataMint Architecture
βββ π Web Scraping Layer (Decodo + fallback)
βββ π Document Processing Pipeline
βββ π Text Preprocessing & Chunking
βββ π― Task Management System
βββ π€ AI Generation Modules
βββ π Quality Control Layer
βββ π Export & Storage
βββ π₯οΈ Streamlit Dashboard
DataMint/
β-- src/training_data_bot/ # Core logic & pipelines
β-- dashboard/ # Streamlit UI
β-- cli/ # Command line interface
β-- tests/ # Unit tests
β-- configs/ # YAML configurations
β-- requirements.txt # Python dependencies
β-- .env.example # Environment example
OPENAI_API_KEY=sk-...
DECODO_USERNAME=...
DECODO_PASSWORD=...
ANTHROPIC_API_KEY=sk-ant-...
MAX_WORKERS=8
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
QUALITY_THRESHOLD=0.7
processing:
chunk_size: 1000
chunk_overlap: 200
max_workers: 8
quality:
threshold: 0.7
metrics: [toxicity, bias, diversity, coherence, relevance]
ai:
provider: openai
model: gpt-4
temperature: 0.7
max_tokens: 1000
- Education β textbooks β Q&A, quizzes, flashcards.
- Enterprise β documentation β searchable KB, training data.
- Research β summarizing papers, annotation, bias testing.
- Compliance β regulatory training datasets.
sources = [
"https://wikipedia.org/wiki/AI",
"documents/research.pdf",
"https://news.example.com/article"
]
dataset = await bot.process_sources(sources, task_types=["qa_generation", "classification"], quality_threshold=0.8)
custom_task = TaskTemplate(
name="Custom QA Generator",
task_type=TaskType.QA_GENERATION,
prompt_template="""
Create domain-specific questions from: {text}
Focus: {domain}
Difficulty: {difficulty}
""",
parameters={"domain": "machine_learning", "difficulty": "advanced"}
)
- JSONL
- CSV
- HuggingFace datasets
- Custom formats
- Fork the repo
- Create feature branch (
git checkout -b feature-name
) - Commit (
git commit -m 'Add feature'
) - Push & open PR
MIT License Β© 2025 Pratik Mandalkar
- OpenAI β GPT models
- Anthropic β Claude
- Decodo β web scraping infra
- Streamlit β dashboard
- HuggingFace β ML ecosystem
Transform raw content into professional training data with DataMint π