This is a compilation of the text of Mahabharata from the following sources
- Complete Translation by K. M. Ganguli: Complete text is here from this github repository
- Laura Gibbs Tiny Tales: This is a retelling of the Mahabharata using two hundred episodes that are each 100 words long.
- Kaggle data repo by Tilak: All 18 parvas of Mahabharata in txt format for NLP
- Wikipedia Parva Summaries
The text copied from the sources mentioned above.
Python Notebooks for parsing the data into CSV files
Notebooks for processing the data. This directory contains the NER notebook for calculating named entitied for the text chunks. I am using the following model for Named Entity Recognition
2rtl3/mn-xlm-roberta-base-named-entity using the Hugging Face transformers library
Contains the final output of data parsing notebooks into pandas dataframes as |
delimited CSV files. All the metadata, including the source, chapter, section, etc. are maintained as columns in the csv. Each csv has a text column containing the text chunk with 100 to 500 tokens each. Each row also has a chunk_id
, which is a uuid. This chunk id is used to index the named entities in the named entities dataframes.
Each data csv also has a corrosponding Named Entities csv. The chunk_id
is used as an index for tagging named entities to corrosponding chunks.
Note: If you regenerate the data csv files, you must also regenerate the named entities, or else the chunk_id in the named entity dataframes will not corrospond to the regenerated csv rows