Question-answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.
Some question-answering models can generate answers without context! (Hugging Face)
- Extractive Question Answering
- Open Generative Question Answering
- Closed Generative Question Answering
Our system is the Extractive Question Answering system, which means that you have a context and a question to ask, and the model assumes that the answer is inside the context provided.
The question-answering system in this project is evaluated using the Stanford Question Answering Dataset(SQUAD)
. SQUAD is a widely used benchmark dataset for evaluating machine reading comprehension and question-answering systems.
The SQUAD dataset contains a diverse set of passages from a variety of topics and genres.
- Context Paragraph: A corpus that contains the information from which the answer can be extracted.
- Question: A question related to the context, formulated to prompt the model to extract the relevant answer.
- Answer Span: The exact span of text within the context paragraph that serves as the answer to the question.
- Python version => 3.6 is recommended
- Operating System: Windows
- datasets==2.14.4
- numpy==1.24.4
- pandas==2.0.3
- tokenizers==0.13.3
- torch==2.0.1
- transformers
- import torch
- pytest
-
Clone the source
git clone https://github.com/geehaad/Question-Answering.git
Go to the directory you cloned the repo in, open cmd:
cd Question-Answering
-
Create a virtual environment (replace
venv
with your virtual environment name):- Using conda, in CMD write:
conda create -p venv python==3.8
- Using conda, in CMD write:
-
Activate the virtual environment:
conda activate venv\
-
Run the main script:
python src/components/main.py
-
The output is a CSV file called 'output' in your directory.
-
To run the testing file:
pytest src/tests/test_answer_questions.py
- Model used:
- Model name:
'distilbert-base-cased-distilled-squad'
- a variant of the DistilBERT model that has been fine-tuned specifically for the SQuAD. This model is designed to accurately extract answers from a given context.
- Model name:
- How the System Works:
-
Our Question Answering system takes a context paragraph and a question as inputs and aims to extract relevant answers from the context.
Tokenization:
The system takes a question and a context as input, then the context paragraph and question are tokenized into subword tokens using the tokenizer provided by the Hugging Face Transformers library of AutoModelForQuestionAnswering.Process input through the model:
The tokenized inputs are passed through the distilbert-base-cased-distilled-squad model. This model has been fine-tuned on the squad dataset.Extract the answer span:
The model's output consists of logits (probabilities) for each token in the context paragraph. The tokens with the highest start and end logits are to the beginning and end of the answer span within the context.Generating Answer:
By decoding the answer span tokens, we generate the final answer string. This answer is then returned as the output of the system.Evaluate the model:
using the first 100 rows of the squad dataset to evaluate the performance of the QAS.Testing:
By using pytest with multiple test cases.
Using multiple steps:
Below is an overview of the key folders and files within the project:
Question-Answering/ |-- notebooks/ | |-- trails.ipynb |-- src/ | |-- __init__.py | |-- components/ | | |-- __init__.py | | |-- helper.py | | |-- main.py | |-- tests/ | | |-- __init__.py | | |-- test_answer_questions.py |-- requirements.txt |-- README.md
src/
: This folder contains the main source code of the project which are:
components/
: The heart of the project, where the primary functionality resides, and contains:-
helper.py
: This file contains the core functions that enable the question-answering system, These functions are:- The
answer_questions
function takes a context and a question as input, tokenizes, and extracts answers using the chosen model. - The
apply_answer_questions
function applies the answer_questions function to a dataset, generating dictionaries containing the question, original answer, and detected answer.
- The
-
main.py
: The entry point of the project, where the main function utilizes theapply_answer_questions
function on a subset of the dataset, 100 rows and saves the results in a CSV file.
-
tests/
:test_answer_questions.py
: Contains pytest test cases that validate the accuracy of the question answering system, the function in this file uses parameterized testing to check the behavior of the answer_questions function in different situations.
notebooks/
:
trails.ipynb
: The Jupyter notebooktrails.ipynb
is a sandbox for experimentation. It's used to explore the dataset and try different models before integrating them into the main system.
requirements.txt
: Lists the Python packages required for the project to run successfully.
README.md
: The central documentation file containing essential information about the project, its usage, and directory structure.