Define the format of input for all metrics #46

faneshion · 2024-02-22T01:48:22Z

Abstract the dataset by defining a new data structure with datapack. Intuitively, a DataPack consists of five parts: question, answer, gt_answer, contexts, and gt_contexts. Currently, we leave search_query and gt_search_query as future work.

Examples:

        >>> question = [
        ...     ['qid1', 'question 1'],
        ...     ['qid2', 'question 2']
        ... ]
        >>> answer = [
        ...     ['aid1', 'answer 1'],
        ...     ['aid2', 'answer 2']
        ... ]
        >>> question = pd.DataFrame(question)
        >>> answer = pd.DataFrame(answer)
        >>> dp = DataPack(
        ...     question=question,
        ...     answer=answer,
        ...     gt_answer=gt_answer,
        ...     contexts=contexts,
        ...     gt_contexts=gt_contexts,
        ... )
        >>> len(dp)
        2

In this way, we can add many inplace functions to process datapack. Some basic usage are as follows:

        >>> import rageval as rl
        >>> data_pack = rl.datasets.toy.load_data()
        >>> data_pack.apply_on_question(preprocess_func)
        >>> data_pack.drop_label(inplace=True)
        >>> data_pack.has_label
        False

The text was updated successfully, but these errors were encountered:

faneshion · 2024-02-22T08:13:48Z

The dataset from huggingface is a good choice to replace the DataPack.

URL: https://github.com/huggingface/datasets

At now, we require each metrics to implement the api:define compute(self, dataset: Dataset) -> (score, Dataset). The column names of the dataset object should be in ["questions", "contexts", "gt_contexts", "answers", "gt_answers"]. An example of dataset are as follows:

>>> from datasets import Dataset
>>> data = {
    "questions": ["what is snoopy", "where is beijing"], 
    "contexts": ["snoopy one", "snoopy two"], 
    "gt_contexts": [{"id": 1, "text": "snoopy 1", "label": 1}, {"id": 2, "text": "snoopy 2", "label": 2}],
    "answers": ["a1", "a2"], 
    "gt_answers": ["a11", "aa2"]
}
>>> dataset = Dataset.from_dict(data)
>>> len(dataset)
2
>>> dataset
Dataset({
    features: ['questions', 'contexts', 'gt_contexts', 'answers', 'gt_answers'],
    num_rows: 2
})

It is worthy to note that each colunm can be extended to more complicated data structures.

faneshion added enhancement New feature or request good first issue Good for newcomers labels Feb 22, 2024

faneshion added this to the Version 0.1 milestone Feb 22, 2024

faneshion assigned Wenshansilvia Feb 22, 2024

faneshion pinned this issue Feb 22, 2024

faneshion changed the title ~~Use DataPack to organize dataset.~~ Define the format of input for all metrics Feb 22, 2024

faneshion mentioned this issue Feb 22, 2024

Hotfix/standardize metric input #49

Merged

Wenshansilvia mentioned this issue Feb 23, 2024

add answer exact match metric #51

Merged

Wenshansilvia closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the format of input for all metrics #46

Define the format of input for all metrics #46

faneshion commented Feb 22, 2024

faneshion commented Feb 22, 2024 •

edited

Loading

Define the format of input for all metrics #46

Define the format of input for all metrics #46

Comments

faneshion commented Feb 22, 2024

faneshion commented Feb 22, 2024 • edited Loading

faneshion commented Feb 22, 2024 •

edited

Loading