Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the format of input for all metrics #46

Closed
faneshion opened this issue Feb 22, 2024 · 1 comment
Closed

Define the format of input for all metrics #46

faneshion opened this issue Feb 22, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@faneshion
Copy link
Collaborator

Abstract the dataset by defining a new data structure with datapack. Intuitively, a DataPack consists of five parts: question, answer, gt_answer, contexts, and gt_contexts. Currently, we leave search_query and gt_search_query as future work.

Examples:

        >>> question = [
        ...     ['qid1', 'question 1'],
        ...     ['qid2', 'question 2']
        ... ]
        >>> answer = [
        ...     ['aid1', 'answer 1'],
        ...     ['aid2', 'answer 2']
        ... ]
        >>> question = pd.DataFrame(question)
        >>> answer = pd.DataFrame(answer)
        >>> dp = DataPack(
        ...     question=question,
        ...     answer=answer,
        ...     gt_answer=gt_answer,
        ...     contexts=contexts,
        ...     gt_contexts=gt_contexts,
        ... )
        >>> len(dp)
        2

In this way, we can add many inplace functions to process datapack. Some basic usage are as follows:

        >>> import rageval as rl
        >>> data_pack = rl.datasets.toy.load_data()
        >>> data_pack.apply_on_question(preprocess_func)
        >>> data_pack.drop_label(inplace=True)
        >>> data_pack.has_label
        False
@faneshion faneshion added enhancement New feature or request good first issue Good for newcomers labels Feb 22, 2024
@faneshion faneshion added this to the Version 0.1 milestone Feb 22, 2024
@faneshion
Copy link
Collaborator Author

faneshion commented Feb 22, 2024

The dataset from huggingface is a good choice to replace the DataPack.

URL: https://github.com/huggingface/datasets

At now, we require each metrics to implement the api:define compute(self, dataset: Dataset) -> (score, Dataset). The column names of the dataset object should be in ["questions", "contexts", "gt_contexts", "answers", "gt_answers"]. An example of dataset are as follows:

>>> from datasets import Dataset
>>> data = {
    "questions": ["what is snoopy", "where is beijing"], 
    "contexts": ["snoopy one", "snoopy two"], 
    "gt_contexts": [{"id": 1, "text": "snoopy 1", "label": 1}, {"id": 2, "text": "snoopy 2", "label": 2}],
    "answers": ["a1", "a2"], 
    "gt_answers": ["a11", "aa2"]
}
>>> dataset = Dataset.from_dict(data)
>>> len(dataset)
2
>>> dataset
Dataset({
    features: ['questions', 'contexts', 'gt_contexts', 'answers', 'gt_answers'],
    num_rows: 2
})

It is worthy to note that each colunm can be extended to more complicated data structures.

@faneshion faneshion pinned this issue Feb 22, 2024
@faneshion faneshion changed the title Use DataPack to organize dataset. Define the format of input for all metrics Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants