- Overview
- Features
- Installation
- Usage
- Project Structure
- Contributing
- License
- Acknowledgements
- Citations
The Thinking Dataset Project creates a dataset to help with various data tasks and analyses. The project uses advanced technologies and tools to manage, process, and analyze data efficiently. Our Thinking Dataset technology utilizes two key components: STAR self-teaching and STaR Case Studies.
STAR self-teaching is a method where the dataset acts as a model and uses other models (Mixture of Models, MOM) to generate new datasets. This process helps the model improve its evaluation scores and create synthetic datasets that are more accurate and effective than those created by humans.
STaR Case Studies (Situation, Task, Action, and Result) are structured narratives used to illustrate how specific business challenges were addressed and the outcomes achieved. These case studies apply to our various datasets like Cablegate, which provide real-world seed data for generating comprehensive business insights.
For more details, see the OVERVIEW.
- Structured Data Management: Centralized data storage using SQLite.
- Enhanced Logging: Integrated
rich
for robust console outputs and error handling. - Automated Download/Upload: Fetch, download, upload, and create datasets using Hugging Face CLI.
- Modular Codebase: Organized scripts and modules for better readability and maintenance.
- Environment Configuration: Flexible management of directories and environment variables.
- Database Operations: Modularized SQL database operations with a finite state machine for session management.
- Parquet File Processing: Tooling for working with parquet files and ingesting them into database tables.
- Python 3.10 or later
- Git
- A cloud-based account (e.g., OpenAI) or a GPU (RTX 3090 or greater) for processing, or both
-
Clone the repository:
git clone https://github.com/MultiTonic/thinking-dataset.git cd thinking-dataset
-
Install the
uv
package management tool usingpip
:pip install uv
-
Install the required packages using
uv
andthinking-dataset.toml
:uv install -f thinking-dataset.toml
-
Set up environment variables:
Copy the
.env.sample
file to.env
and change the values as needed:cp .env.sample .env
Update the
.env
file with the following variables:HF_TOKEN=your_huggingface_token HF_DATASET=your_dataset_name HF_ORGANIZATION=your_organization_name ROOT_DIR=your_root_directory DATA_DIR=your_data_directory
To download all parquet files from the Cablegate dataset using Hugging Face CLI:
thinking-dataset download
To execute all CLI commands for the project:
python assets/scripts/run_cli_commands.py
For detailed usage instructions, please refer to the USAGE in the docs
directory.
The following directory structure provides an overview of how the project is organized:
thinking-dataset/
├── config/ # Configuration files
├── assets/ # Assets directory for external resources
│ ├── prompts/ # Prompt templates
│ ├── scripts/ # Utility scripts
│ ├── resources/ # External project data
├── data/ # Data directory
├── docs/ # Project documentation
├── reports/ # Generated reports
├── tests/ # Test files
├── thinking_dataset/ # Core project code
│ ├── commands/ # CLI command implementations
│ ├── connectors/ # Data connectors
│ ├── config/ # Configuration loaders and management
│ ├── datasets/ # Dataset definitions and processing
│ │ ├── operations/ # Data operations and transformations
│ ├── db/ # Database support
│ │ ├── operations/ # Database operations and transactions
│ ├── io/ # File I/O operations
│ ├── pipeworks/ # Pipelines and pipes for data processing
│ │ ├── pipelines/ # Pipeline management and control
│ │ ├── pipes/ # Pipes used for data frame processing
│ ├── tonics/ # Data utility functions and helpers
│ ├── utilities/ # General-purpose utility helpers
│ ├── main.py # Main execution file
└── setup.py # Project setup
└── .env # Private environment variables file
Contributions are welcome! Please fork the repository and create a pull request with your changes. Ensure your code adheres to the project's coding standards and includes appropriate tests. See CONTRIBUTING for detailed guidelines.
This project is licensed under the MIT License. See the LICENSE file for details.
- Kara Rawson - Lead Engineer
- Joseph Pollack - Creator & Business Leader
- MultiTonic Team - Support and Collaboration
- Hugging Face - Providing robust tools and infrastructure for dataset management
Please use the following citation format for referencing this project:
@misc{thinking-dataset,
author = {Kara Rawson, Joseph Pollack, and et al.},
title = {Thinking-Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation},
year = {2025},
howpublished = {\url{https://github.com/MultiTonic/thinking-dataset}},
note = {Accessed: 2025-01-05}
}