Creating a custom dataset for Alight Question Answering tasks.
- Note: This is still under construction! We are currently looking into methods to assess the generated questions and answers.
When you try to use the Questgen.ai, there in currently an issue:
we need to change the source code in mcq.py file from from similarity.normalized_levenshtein import NormalizedLevenshtein
to from strsimpy.normalized_levenshtein import NormalizedLevenshtein
.**
Instead of using the pdf_to_clean_text
method, you may use your own method to gather clean text. There are, however, points to keep in mind:
-
The current model has maximum input token length of 512
-
Deprecated: Divide the passage into different paragraphs/strides
-
The model is currently generating the QnAs on the sentence level
-
Deprecated: Since we are using the stride method, it is recommended to use the
clean_text()
method after you get the data formatted liketext = [paragraph1,paragraph2, ...]
. Your output would be the following:clean_text(text) = [paragraph1,paragraph2, ...]
, and for each paragraph, last sentence of the previous paragraph would be the first sentence of the current, etc.
There are currently 3 classes within the QGen.py
code class, QuestionGenerator(text)
, BoolQAnswer(df)
, and ImpossibleQuestions(df)
.
- Update: The
get_sentences()
which is a funtion of theQuestionGenerator(text)
class now automatically takes the paragraph and creates a list of sentences to prepare the input for the model.
To use BoolQAnswer(df)
, we have to use the output generated from QuestionGenerator(text)
or the QuestionGenerator.df
, which is a dataframe of 'Questions','Answers_FAQ','Answers_AP','Contexts'
columns, as an input for the BoolQAnswer(df)
.
- i.e.
BoolQAnswer(QuestionGenerator.df)
- Keep in mind that the dataframe generated would output
NaN
for the non-boolean questions under the Answers_BoolQ and Scores_BoolQ columns.
To use ImpossibleQuestions(df)
, we have to use the output generated from BoolQAnswer(df)
which is the same as BoolQAnswer.boolq_df
, which is a dataframe of 'Questions','Answers_FAQ','Answers_AP','Contexts', 'Answers_BoolQ' ,'Scores_BoolQ'
columns, as an input for the ImpossibleQuestions(df)
.
- i.e.
ImpossibleQuestions(BoolQAnswer.boolq_df)
- Outputs atrribute
impos_df
# Imports:
from alight_transformers.QGen import QuestionGenerator, BoolQAnswer, ImpossibleQuestions
from alight_transformers.QGen import pdf_to_clean_text
path = "path_to_file"
# We need pdf_to_clean_text(path_to_file, start_page, last_page)
text = pdf_to_clean_text(path, 10,11)
# For purpuses of the example, I will use a short hypothetical text:
text = ['HR teams can use Questgen to create assessments from compliance documents. Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies.',
'Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. They can have a small in-house team and save hugely on time and cost.',
'Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. They can avoid repetitive questions chosen from a fixed question bank every year.']
QnA = QuestionGenerator(text)
ZERO
C:\Users\a1052739\Anaconda3\envs\huggingface\lib\site-packages\transformers\tokenization_utils_base.py:2198: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
warnings.warn(
C:\Users\a1052739\Anaconda3\envs\huggingface\lib\site-packages\transformers\models\t5\tokenization_t5.py:190: UserWarning: This sequence already has </s>. In future versions this behavior may lead to duplicated eos tokens being added.
warnings.warn(
Running model for generation
{'questions': [{'Question': 'What can be generated to make sure that employees have read and understood the new policies?', 'Answer': 'assessments', 'id': 1, 'context': 'Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies.'}, {'Question': 'Who can be given an assessment every time a change in policies is made?', 'Answer': 'employees', 'id': 2, 'context': 'Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies.'}]}
Running model for generation
{'questions': [{'Question': 'What are some examples of companies that can use Questgen instead of outsourcing the assessment creation process?', 'Answer': 'textbook publishers', 'id': 1, 'context': 'Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process.'}]}
Running model for generation
{'questions': [{'Question': 'What is the best way to save time and money?', 'Answer': 'house team', 'id': 1, 'context': 'They can have a small in-house team and save hugely on time and cost.'}, {'Question': 'How can I save time and money by having a small in-house team?', 'Answer': 'cost', 'id': 2, 'context': 'They can have a small in-house team and save hugely on time and cost.'}]}
Running model for generation
{'questions': [{'Question': 'What can teachers create with Questgen?', 'Answer': 'worksheets', 'id': 1, 'context': 'Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds.'}]}
Running model for generation
{'questions': [{'Question': 'What is the best way to avoid repetitive questions?', 'Answer': 'question bank', 'id': 1, 'context': 'They can avoid repetitive questions chosen from a fixed question bank every year.'}, {'Question': 'How often do people avoid repetitive questions from a fixed question bank?', 'Answer': 'year', 'id': 2, 'context': 'They can avoid repetitive questions chosen from a fixed question bank every year.'}]}
ZERO
QnA.df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Questions | Answers_FAQ | Answers_AP | Contexts | |
---|---|---|---|---|
0 | What can be generated to make sure that employees have read and understood the new policies? | assessments | Assessments | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. |
1 | Who can be given an assessment every time a change in policies is made? | employees | Employees | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. |
2 | What are some examples of companies that can use Questgen instead of outsourcing the assessment creation process? | textbook publishers | Textbook publishers and edtech companies can use questgen instead of outsourcing the assessment creation process. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. |
3 | What is the best way to save time and money? | house team | They can have a small in-house team and save hugely on time and cost. | They can have a small in-house team and save hugely on time and cost. |
4 | How can I save time and money by having a small in-house team? | cost | They can have a small in-house team and save hugely on time and cost. | They can have a small in-house team and save hugely on time and cost. |
5 | What can teachers create with Questgen? | worksheets | Teachers and schools can use the questgen authoring tool to create worksheets easily in less than 5 seconds. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. |
6 | What is the best way to avoid repetitive questions? | question bank | They can avoid repetitive questions chosen from a fixed question bank every year. | They can avoid repetitive questions chosen from a fixed question bank every year. |
7 | How often do people avoid repetitive questions from a fixed question bank? | year | Every year | They can avoid repetitive questions chosen from a fixed question bank every year. |
8 | Is there a difference between true and false? | None | there is a difference between true and false. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. |
9 | Is there such a thing as an assessment? | None | assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. |
10 | Is there such a thing as an assessment of employee policies? | None | assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. |
11 | Is questgen the same as testgen? | None | questgen is the same as testgen. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. |
12 | Is questgen the same as questgen? | None | questgen is the same as questgen. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. |
13 | Can you use questgen to create an assessment? | None | you can use questgen to create an assessment. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. |
14 | Is it possible to have a small team in house? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. |
15 | Is it possible to have a small in house team? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. |
16 | Is it possible to have a team in house? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. |
17 | Can you use the questgen authoring tool in your school? | None | teachers and schools can use the questgen authoring tool in your school. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. |
18 | Can you use questgen in a school? | None | teachers and schools can use the questgen authoring tool to create worksheets easily in less than 5 seconds. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. |
19 | Is questgen true or false? | None | True | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. |
20 | Can you avoid repeating the same question every year? | None | they can avoid repeating the same question every year. | They can avoid repetitive questions chosen from a fixed question bank every year. |
21 | Is there such thing as a fixed question bank? | None | there is such thing as a fixed question bank. | They can avoid repetitive questions chosen from a fixed question bank every year. |
22 | Do you have to answer the same question every year? | None | they can avoid repetitive questions chosen from a fixed question bank every year. | They can avoid repetitive questions chosen from a fixed question bank every year. |
QnA_boolq = BoolQAnswer(QnA.df)
C:\Users\a1052739\Projects\Question Generator\Questgen Folder\Package\alight_transformers\QGen.py:228: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
boolq_df['Answers_BoolQ'] = bool_answers
C:\Users\a1052739\Projects\Question Generator\Questgen Folder\Package\alight_transformers\QGen.py:229: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
boolq_df['Scores_BoolQ'] = scores
QnA_boolq.boolq_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Questions | Answers_FAQ | Answers_AP | Contexts | Answers_BoolQ | Scores_BoolQ | |
---|---|---|---|---|---|---|
0 | What can be generated to make sure that employees have read and understood the new policies? | assessments | Assessments | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | NaN | NaN |
1 | Who can be given an assessment every time a change in policies is made? | employees | Employees | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | NaN | NaN |
2 | What are some examples of companies that can use Questgen instead of outsourcing the assessment creation process? | textbook publishers | Textbook publishers and edtech companies can use questgen instead of outsourcing the assessment creation process. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | NaN | NaN |
3 | What is the best way to save time and money? | house team | They can have a small in-house team and save hugely on time and cost. | They can have a small in-house team and save hugely on time and cost. | NaN | NaN |
4 | How can I save time and money by having a small in-house team? | cost | They can have a small in-house team and save hugely on time and cost. | They can have a small in-house team and save hugely on time and cost. | NaN | NaN |
5 | What can teachers create with Questgen? | worksheets | Teachers and schools can use the questgen authoring tool to create worksheets easily in less than 5 seconds. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | NaN | NaN |
6 | What is the best way to avoid repetitive questions? | question bank | They can avoid repetitive questions chosen from a fixed question bank every year. | They can avoid repetitive questions chosen from a fixed question bank every year. | NaN | NaN |
7 | How often do people avoid repetitive questions from a fixed question bank? | year | Every year | They can avoid repetitive questions chosen from a fixed question bank every year. | NaN | NaN |
8 | Is there a difference between true and false? | None | there is a difference between true and false. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Yes | 0.89 |
9 | Is there such a thing as an assessment? | None | assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Yes | 0.92 |
10 | Is there such a thing as an assessment of employee policies? | None | assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Yes | 0.85 |
11 | Is questgen the same as testgen? | None | questgen is the same as testgen. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | No | 0.72 |
12 | Is questgen the same as questgen? | None | questgen is the same as questgen. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | Yes | 0.89 |
13 | Can you use questgen to create an assessment? | None | you can use questgen to create an assessment. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | Yes | 0.98 |
14 | Is it possible to have a small team in house? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. | Yes | 0.99 |
15 | Is it possible to have a small in house team? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. | Yes | 0.99 |
16 | Is it possible to have a team in house? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. | Yes | 0.99 |
17 | Can you use the questgen authoring tool in your school? | None | teachers and schools can use the questgen authoring tool in your school. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | Yes | 0.99 |
18 | Can you use questgen in a school? | None | teachers and schools can use the questgen authoring tool to create worksheets easily in less than 5 seconds. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | Yes | 0.99 |
19 | Is questgen true or false? | None | True | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | Yes | 0.92 |
20 | Can you avoid repeating the same question every year? | None | they can avoid repeating the same question every year. | They can avoid repetitive questions chosen from a fixed question bank every year. | Yes | 0.94 |
21 | Is there such thing as a fixed question bank? | None | there is such thing as a fixed question bank. | They can avoid repetitive questions chosen from a fixed question bank every year. | Yes | 0.99 |
22 | Do you have to answer the same question every year? | None | they can avoid repetitive questions chosen from a fixed question bank every year. | They can avoid repetitive questions chosen from a fixed question bank every year. | No | 0.99 |
QnA_pos = ImpossibleQuestions(QnA_boolq.boolq_df)
QnA_pos.impos_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Questions | Answers_FAQ | Answers_AP | Contexts | Answers_BoolQ | Scores_BoolQ | Possible | Probability | |
---|---|---|---|---|---|---|---|---|
0 | What can be generated to make sure that employees have read and understood the new policies? | assessments | Assessments | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | NaN | NaN | Possible | 0.991926 |
1 | Who can be given an assessment every time a change in policies is made? | employees | Employees | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | NaN | NaN | Possible | 0.982875 |
2 | What are some examples of companies that can use Questgen instead of outsourcing the assessment creation process? | textbook publishers | Textbook publishers and edtech companies can use questgen instead of outsourcing the assessment creation process. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | NaN | NaN | Possible | 0.986688 |
3 | What is the best way to save time and money? | house team | They can have a small in-house team and save hugely on time and cost. | They can have a small in-house team and save hugely on time and cost. | NaN | NaN | Possible | 0.966886 |
4 | How can I save time and money by having a small in-house team? | cost | They can have a small in-house team and save hugely on time and cost. | They can have a small in-house team and save hugely on time and cost. | NaN | NaN | Impossible | 0.935677 |
5 | What can teachers create with Questgen? | worksheets | Teachers and schools can use the questgen authoring tool to create worksheets easily in less than 5 seconds. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | NaN | NaN | Possible | 0.981222 |
6 | What is the best way to avoid repetitive questions? | question bank | They can avoid repetitive questions chosen from a fixed question bank every year. | They can avoid repetitive questions chosen from a fixed question bank every year. | NaN | NaN | Possible | 0.976209 |
7 | How often do people avoid repetitive questions from a fixed question bank? | year | Every year | They can avoid repetitive questions chosen from a fixed question bank every year. | NaN | NaN | Possible | 0.900476 |
8 | Is there a difference between true and false? | None | there is a difference between true and false. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Yes | 0.89 | Possible | 0.961348 |
9 | Is there such a thing as an assessment? | None | assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Yes | 0.92 | Possible | 0.993999 |
10 | Is there such a thing as an assessment of employee policies? | None | assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Every time there is a change in policies, assessments could be generated and given to employees to make sure that they have read and understood the new policies. | Yes | 0.85 | Possible | 0.990870 |
11 | Is questgen the same as testgen? | None | questgen is the same as testgen. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | No | 0.72 | Possible | 0.928253 |
12 | Is questgen the same as questgen? | None | questgen is the same as questgen. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | Yes | 0.89 | Possible | 0.943139 |
13 | Can you use questgen to create an assessment? | None | you can use questgen to create an assessment. | Textbook publishers and edtech companies can use Questgen instead of outsourcing the assessment creation process. | Yes | 0.98 | Possible | 0.986241 |
14 | Is it possible to have a small team in house? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. | Yes | 0.99 | Possible | 0.991944 |
15 | Is it possible to have a small in house team? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. | Yes | 0.99 | Possible | 0.996165 |
16 | Is it possible to have a team in house? | None | they can have a small in-house team. | They can have a small in-house team and save hugely on time and cost. | Yes | 0.99 | Possible | 0.984167 |
17 | Can you use the questgen authoring tool in your school? | None | teachers and schools can use the questgen authoring tool in your school. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | Yes | 0.99 | Possible | 0.995307 |
18 | Can you use questgen in a school? | None | teachers and schools can use the questgen authoring tool to create worksheets easily in less than 5 seconds. | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | Yes | 0.99 | Possible | 0.994938 |
19 | Is questgen true or false? | None | True | Teachers and Schools can use the Questgen authoring tool to create worksheets easily in less than 5 seconds. | Yes | 0.92 | Possible | 0.952583 |
20 | Can you avoid repeating the same question every year? | None | they can avoid repeating the same question every year. | They can avoid repetitive questions chosen from a fixed question bank every year. | Yes | 0.94 | Possible | 0.980597 |
21 | Is there such thing as a fixed question bank? | None | there is such thing as a fixed question bank. | They can avoid repetitive questions chosen from a fixed question bank every year. | Yes | 0.99 | Possible | 0.981145 |
22 | Do you have to answer the same question every year? | None | they can avoid repetitive questions chosen from a fixed question bank every year. | They can avoid repetitive questions chosen from a fixed question bank every year. | No | 0.99 | Possible | 0.935112 |