Hugging Face Datasets Documentation

1. StackOverflow Q&A Dataset

Description

This dataset consists of Q&A data extracted from StackOverflow, related to different projects CNCF (Cloud Native Computing Foundation) landscape. It includes the following three columns:

Question: The question asked on StackOverflow.
Answer: The corresponding answer to the question.
Tag: The name of the project to which the question and answer are related.

The data was collected using the Git Exchange API to extract Q&A pairs from StackOverflow based on different projects CNCF (Cloud Native Computing Foundation) landscape. This process involved using project-specific tags to scrape StackOverflow for questions and their accepted answers, then categorizing them under the relevant project name (tag).

License

This dataset is available under the MIT license.

Links

Hugging Face Dataset Page

2. CNCF Raw Dataset

Description: This dataset, named cncf-raw-data-for-llm-training, consists of markdown (MD) and PDF content extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. The data was collected by fetching MD and PDF files from different CNCF project repositories and converting them into JSON format. This dataset is intended as raw data for training large language models (LLMs).

The dataset includes the following two columns:
1. Tag: A JSON object that categorizes the file. For example:
```
{
  "category": "Runtime",
  "file_name": "log-attach-design.md",
  "project_name": "rkt",
  "subcategory": "Container Runtime"
}
```
  - Category: A broad classification representing the main functional area of the project.
  - File Name: The name of the file as it appears in the original repository.
  - Project Name: The name of the specific project to which the file belongs.
  - Subcategory: A more specific classification within the main category.
2. Content: The actual content of the file (MD or PDF) in text format.

License

This dataset is available under the MIT license.

Links

Hugging Face Dataset Page

3. CNCF QA Dataset for LLM Tuning

Description

This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model.

The dataset includes the following six columns:

Question: The question derived from the content of the files.
Answer: The corresponding answer to the question.
Project: The name of the project from which the data was sourced.
File Name: The name of the file from which the data was extracted.
Category: A broad classification representing the main functional area of the project (e.g., Runtime, Orchestration, Storage, Networking).
Subcategory: A more specific classification within the main category (e.g., Container Runtime, Service Mesh, Monitoring).

How It Is Generated

The dataset was generated using a Python script that extracts content from PDF and MD files in CNCF project repositories. The script processes this content with a language model to create question-answer pairs. Each piece of information is transformed into a QA format and stored in a structured CSV file with relevant metadata such as project name, file name, category, and subcategory.

License

This dataset is available under the MIT license.

Links

Hugging Face Dataset Page

4. Merged_QAs Dataset

Description

The Merged_QAs dataset combines Q&A pairs from two primary sources: StackOverflow Q&A related to various projects within the CNCF (Cloud Native Computing Foundation) landscape and the cncf-qa-dataset-for-llm-tuning designed for fine-tuning large language models (LLMs).

StackOverflow Q&A Dataset

This dataset includes questions and their corresponding answers sourced from StackOverflow discussions pertaining to CNCF projects. Each entry is categorized under the respective project.

CNCF QA Dataset for LLM Tuning

Formatted in a question-answer (QA) style, this dataset originates from PDF and markdown (MD) files extracted from CNCF project repositories. It includes questions, answers, project names, file names, and classification details (category and subcategory).

Usage in Model Fine-Tuning

The Merged_QAs dataset is utilized for fine-tuning large language models (LLMs), incorporating diverse Q&A pairs from both StackOverflow and CNCF datasets. This integration enriches the dataset with a wide range of technical questions and expert answers relevant to cloud-native computing.

License

This dataset is available under the MIT license.

Links

Hugging Face Dataset Page

5. Q&A Dataset for Benchmarking DeepCNCF

This is a question-and-answer dataset using multiple-choice questions created for benchmarking our DeepCNCF LLM. Since there is no reliable LLM benchmark specified for CNCF projects we decided to use it to measure the performance of our model based on its performance on these questions. This dataset was gathered from openly available online courses about CNCF projects. So they are created by humans to measure students' understanding from these projects and could be a good measure to test knowledge of our model about CNCF topics.

Description

It includes the following two columns:

Question: Multiple choice questions..
Answer: The corresponding answer to the question. Could have several correct answer(for example: a,b,d)

License

This dataset is available under the MIT license.

Links

Hugging Face Dataset Page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face Datasets Documentation

1. StackOverflow Q&A Dataset

Description

License

Links

2. CNCF Raw Dataset

License

Links

3. CNCF QA Dataset for LLM Tuning

Description

How It Is Generated

License

Links

4. Merged_QAs Dataset

Description

StackOverflow Q&A Dataset

CNCF QA Dataset for LLM Tuning

Usage in Model Fine-Tuning

License

Links

5. Q&A Dataset for Benchmarking DeepCNCF

Description

License

Links

Clone this wiki locally