Building a Context-based Question Answering System on SQuAD 2.0

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of more than 100k questions. The answer to each question is a segment of text from the corresponding passage. The passages are taken from Wikipedia articles. The file is present in JSON format. Dataset can be downloaded from the following link.

The structure of the dataset is as follows

i) Title — These are the topics on which the Passages are based upon. There are a total of 442 titles in the data. Each title has multiple comprehensions present under them.

ii)Context — These are the passages that are present in each of the titles. The number of passages for each title varies from 10-to 149. The questions are based on these contexts only.

iii)Question and Answers– For each context, there are a few questions associated with it and the corresponding answers are present.

iv) Answers starting index and is_possible — Each of the rows contains the starting index of the answer from the context and a boolean variable depicting whether it’s possible to answer or not.

Flowchart to show the structure of SQuAD dataset

The dataset after a little preprocessing looks as follows. The code for preprocessing can be found in the notebook.

First, we will check the total number of datapoints and the number of unique titles and contexts present in our dataframe.

Since we are dealing with text data, we will check the number of words present in the context,questions and answers column in the data frame. We will plot the distribution curve of these columns to get an idea.

As it is very clear from the above plots that our plots are right skewed and there are outliers present in the dataset. We can remove the datapoints where the number of words in context are greater than 200. Similarly we can remove the datapoints where number of words in question and answer are greater than 50 and 20 respectively.

This is the most important step for our project. As a part of preprocessing we will first remove the outliers discussed in the previous section. We will also convert all the words to lowercase characters.

We will start vectorization (converting text to numbers) on our preprocessed data. We are using keras.tokenizer to convert our text to tokens. We will not use the methods directly available in keras.preprocessing . We will define a class Vectorization and inside the class the following methods will be defined

  1. Create_vocabulary– as the name suggests this method will take all the train text data as input and will create a vocabulary out of it. We are also defining the max_words to set the upper limit on number of tokenizers in the vocabulary.

2. Text to Seq and Seq to Text- This is pretty straightforward function in which we convert the text to sequences using the vocabulary that we have created. We will also perform the reverse operation and convert the seq back to text (during the inference process).

A wide variety of models have been already trained on SQuAD dataset. A few models have performed better than humans in term of answering questions given it’s context .You can find more details in this link. In this project we will define a model from scracth using Bidirectinal GRU units and Attention layers. The architecture of the model will be as follows.

We have taken context and questions as input and there are three output variables. First is the start index of the answer, second is end index of the answer and third is a dense layer with 2 units and activation as softmax to check whether the answer is possible or not. We will fit our model after defining the model checkpoint callback.

Performance metrics for the project

Exact Match — The percentage of answers that are exactly correct as our ground truth word to word. The ground truth and the predicted answer should have the same starting and ending index. The higher the percentage value the better will be our model’s performance in understanding the context and the questions and giving exact answers.

F1 Score — We use macro averaged F1 score. It captures the precision and recall that words chosen as being part of the answer are actually part of the answer. It measures the average overlap between the prediction and ground truth answer. We treat the prediction and ground truth as bags of tokens, and compute their F1. We take the maximum F1 over all of the ground truth answers for a given question and then average over all of the questions.

After training the model we have saved the weights and architecture as a .hdf5 file so that we can load the model and check our predictions. We will directly load the model and preprocess the test data to predict the answers for the given questions and context.

After running the evaluation script that was present in the website to check the performance of the model we got the following results.

Predicted output for a few datapoints

We can get a decent performance if we are training the model from scratch. The model is able to predict one word or two word answers but is not generalized for all lengths of question answers. We can train the model for more number of epochs , we can use more number of GRU units or we can increase the max sequence length for context text and question text for better performance of the model.

We can also use the pre-trained models such as BiDAF, bert-large-uncased, or DrQA models and then fine tune them to get better performance on our data.

  1. The Colab notebook for the base model code –
  2. SQuAD dataset link —
  3. HuggingFace model made specifically for Squad —

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Article

Hip-Hop Reacts to Kevin Samuels' Death

Next Article

Inside Elon Musk’s Big Plans for Twitter

Related Posts