Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation

Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation

Overview

The Thinking Dataset Project creates a dataset to help with various data tasks and analyses. The project uses advanced technologies and tools to manage, process, and analyze data efficiently. Our Thinking Dataset technology utilizes two key components: STAR self-teaching and STaR Case Studies.

STAR self-teaching is a method where the dataset acts as a model and uses other models (Mixture of Models, MOM) to generate new datasets. This process helps the model improve its evaluation scores and create synthetic datasets that are more accurate and effective than those created by humans.

STaR Case Studies (Situation, Task, Action, and Result) are structured narratives used to illustrate how specific business challenges were addressed and the outcomes achieved. These case studies apply to our various datasets like Cablegate, which provide real-world seed data for generating comprehensive business insights.

For more details, see the OVERVIEW.

Features

Structured Data Management: Centralized data storage using SQLite.
Enhanced Logging: Integrated rich for robust console outputs and error handling.
Automated Download/Upload: Fetch, download, upload, and create datasets using Hugging Face CLI.
Modular Codebase: Organized scripts and modules for better readability and maintenance.
Environment Configuration: Flexible management of directories and environment variables.
Database Operations: Modularized SQL database operations with a finite state machine for session management.
Parquet File Processing: Tooling for working with parquet files and ingesting them into database tables.

Installation

Prerequisites

Python 3.10 or later
Git
A cloud-based account (e.g., OpenAI) or a GPU (RTX 3090 or greater) for processing, or both

Setup

Clone the repository:

git clone https://github.com/MultiTonic/thinking-dataset.git
cd thinking-dataset

Install the uv package management tool using pip:
```
pip install uv
```
Install the required packages using uv and thinking-dataset.toml:
```
uv install -f thinking-dataset.toml
```

Set up environment variables:

Copy the .env.sample file to .env and change the values as needed:

cp .env.sample .env

Update the .env file with the following variables:

HF_TOKEN=your_huggingface_token
HF_DATASET=your_dataset_name
HF_ORGANIZATION=your_organization_name
ROOT_DIR=your_root_directory
DATA_DIR=your_data_directory

Usage

Running the Download Command

To download all parquet files from the Cablegate dataset using Hugging Face CLI:

thinking-dataset download

Running All CLI Commands

To execute all CLI commands for the project:

python assets/scripts/run_cli_commands.py

For detailed usage instructions, please refer to the USAGE in the docs directory.

Project Directory Structure

The following directory structure provides an overview of how the project is organized:

thinking-dataset/
├── config/                 # Configuration files
├── assets/                 # Assets directory for external resources
│   ├── prompts/            # Prompt templates
│   ├── scripts/            # Utility scripts
│   ├── resources/          # External project data
├── data/                   # Data directory
├── docs/                   # Project documentation
├── reports/                # Generated reports
├── tests/                  # Test files
├── thinking_dataset/       # Core project code
│   ├── commands/           # CLI command implementations
│   ├── connectors/         # Data connectors
│   ├── config/             # Configuration loaders and management
│   ├── datasets/           # Dataset definitions and processing
│   │   ├── operations/     # Data operations and transformations
│   ├── db/                 # Database support
│   │   ├── operations/     # Database operations and transactions
│   ├── io/                 # File I/O operations
│   ├── pipeworks/          # Pipelines and pipes for data processing
│   │   ├── pipelines/      # Pipeline management and control
│   │   ├── pipes/          # Pipes used for data frame processing
│   ├── tonics/             # Data utility functions and helpers
│   ├── utilities/          # General-purpose utility helpers
│   ├── main.py             # Main execution file
└── setup.py                # Project setup
└── .env                    # Private environment variables file

Contributing

Contributions are welcome! Please fork the repository and create a pull request with your changes. Ensure your code adheres to the project's coding standards and includes appropriate tests. See CONTRIBUTING for detailed guidelines.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

Kara Rawson - Lead Engineer
Joseph Pollack - Creator & Business Leader
MultiTonic Team - Support and Collaboration
Hugging Face - Providing robust tools and infrastructure for dataset management

Citations

Please use the following citation format for referencing this project:

@misc{thinking-dataset,
  author = {Kara Rawson, Joseph Pollack, and et al.},
  title = {Thinking-Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation},
  year = {2025},
  howpublished = {\url{https://github.com/MultiTonic/thinking-dataset}},
  note = {Accessed: 2025-01-05}
}

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
config		config
docs		docs
reports		reports
tests		tests
thinking_dataset		thinking_dataset
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation

Table of Contents

Overview

Features

Installation

Prerequisites

Setup

Usage

Running the Download Command

Running All CLI Commands

Project Directory Structure

Contributing

License

Acknowledgements

Citations

About

Releases

Packages

Contributors 6

Languages

License

MultiTonic/thinking-dataset

Folders and files

Latest commit

History

Repository files navigation

Thinking Dataset: Leveraging Real-World Data for Strategic Business Insights and STaR Case Study Generation

Table of Contents

Overview

Features

Installation

Prerequisites

Setup

Usage

Running the Download Command

Running All CLI Commands

Project Directory Structure

Contributing

License

Acknowledgements

Citations

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages