-
Notifications
You must be signed in to change notification settings - Fork 1
Hugging Face Datasets Documentation
This dataset consists of Q&A data extracted from StackOverflow, related to different projects CNCF (Cloud Native Computing Foundation) landscape. It includes the following three columns:
- Question: The question asked on StackOverflow.
- Answer: The corresponding answer to the question.
- Tag: The name of the project to which the question and answer are related.
The data was collected using the Git Exchange API to extract Q&A pairs from StackOverflow based on different projects CNCF (Cloud Native Computing Foundation) landscape. This process involved using project-specific tags to scrape StackOverflow for questions and their accepted answers, then categorizing them under the relevant project name (tag).
This dataset is available under the MIT license.
-
Description: This dataset, named
cncf-raw-data-for-llm-training
, consists of markdown (MD) and PDF content extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. The data was collected by fetching MD and PDF files from different CNCF project repositories and converting them into JSON format. This dataset is intended as raw data for training large language models (LLMs).The dataset includes the following two columns:
-
Tag: A JSON object that categorizes the file. For example:
{ "category": "Runtime", "file_name": "log-attach-design.md", "project_name": "rkt", "subcategory": "Container Runtime" }
- Category: A broad classification representing the main functional area of the project.
- File Name: The name of the file as it appears in the original repository.
- Project Name: The name of the specific project to which the file belongs.
- Subcategory: A more specific classification within the main category.
- Content: The actual content of the file (MD or PDF) in text format.
-
Tag: A JSON object that categorizes the file. For example:
This dataset is available under the MIT license.
This dataset, named cncf-qa-dataset-for-llm-tuning
, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model.
The dataset includes the following six columns:
- Question: The question derived from the content of the files.
- Answer: The corresponding answer to the question.
- Project: The name of the project from which the data was sourced.
- File Name: The name of the file from which the data was extracted.
- Category: A broad classification representing the main functional area of the project (e.g., Runtime, Orchestration, Storage, Networking).
- Subcategory: A more specific classification within the main category (e.g., Container Runtime, Service Mesh, Monitoring).
The dataset was generated using a Python script that extracts content from PDF and MD files in CNCF project repositories. The script processes this content with a language model to create question-answer pairs. Each piece of information is transformed into a QA format and stored in a structured CSV file with relevant metadata such as project name, file name, category, and subcategory.
This dataset is available under the MIT license.
The Merged_QAs
dataset combines Q&A pairs from two primary sources: StackOverflow Q&A related to various projects within the CNCF (Cloud Native Computing Foundation) landscape and the cncf-qa-dataset-for-llm-tuning
designed for fine-tuning large language models (LLMs).
This dataset includes questions and their corresponding answers sourced from StackOverflow discussions pertaining to CNCF projects. Each entry is categorized under the respective project.
Formatted in a question-answer (QA) style, this dataset originates from PDF and markdown (MD) files extracted from CNCF project repositories. It includes questions, answers, project names, file names, and classification details (category and subcategory).
The Merged_QAs
dataset is utilized for fine-tuning large language models (LLMs), incorporating diverse Q&A pairs from both StackOverflow and CNCF datasets. This integration enriches the dataset with a wide range of technical questions and expert answers relevant to cloud-native computing.
This dataset is available under the MIT license.
This is a question-and-answer dataset using multiple-choice questions created for benchmarking our DeepCNCF LLM. Since there is no reliable LLM benchmark specified for CNCF projects we decided to use it to measure the performance of our model based on its performance on these questions. This dataset was gathered from openly available online courses about CNCF projects. So they are created by humans to measure students' understanding from these projects and could be a good measure to test knowledge of our model about CNCF topics.
It includes the following two columns:
- Question: Multiple choice questions..
- Answer: The corresponding answer to the question. Could have several correct answer(for example: a,b,d)
This dataset is available under the MIT license.