wikiNLPpy: Training Word2Vec model on wikipedia data

A python library to train and store a word2vec model trained on wiki data. Model includes most common bigrams.

Getting Started

Prerequisites

Python 3.x with Anaconda
Gensim
Wikipedia .xml.bz2 file

Wikipedia file can be downloaded from https://dumps.wikimedia.org/backup-index.html File needs to be of type .xml.bz2 (for storage)

For instance, file can be downloaded with

wget 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2'

Training Model

In the trainModel folder, runAll.sh is the script that will launch all relevant python files. Only input needed is the name of the wikipedia file

# INPUT
xmlFile="wikiIni.xml.bz2"

# OUTPUT
wikiTextDir=$(python genWikiText.py $xmlFile)
bigramDir=$(python findBigrams.py $wikiTextDir)
python trainWikiText.py $wikiTextDir $bigramDir

genWikiText.py uses WikiCorpus from gensim to extract text from wikipedia file. All articles are joined and written to text file.
findBigrams.py goes through text file and finds the most common bigrams in the text. Bigram model is saved in vector format.
trainWikiText.py trains wikiCorpus on Word2Vec after passing model through bigrams.

End result is a Word2Vec model, saved in a vector format to be loaded and used in another instance.

Using model

Here we use the model to determine the similarity of a sentence to a particular theme using the Word2Vec model.

This is done by using the runAll.py file in the usingModel folder. The inputs are the wiki file, the bigram model, a text file of themes to examine and a folder that contains csvs of data.

Main Inputs:

A trained Word2Vec model
A trained phrase colocation model (trained to find bigrams)
A .txt file that has a list of all the themes that need to be investigated
A folder with CSVs, where each CSV represents a key and a sample of text to analyse

The interests (or themes) are defined in a seperate text file and the similarity of each word in the second column of the csv is used.

# INPUTS
modelFile='[/path/to/wikiModel].vector'
bigramFile='[/path/to/bigramModel].model'
themesFile='[/path/to/list/of/themes/].txt'
folderName='[folder_containts_csv_data]'

# COMMAND
python findTextSim.py $modelFile $bigramFile $themesFile $folderName

Outputs is a folder for each csv that contains affinity of sample text for every theme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wikiNLPpy: Training Word2Vec model on wikipedia data

Getting Started

Prerequisites

Training Model

Using model

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
trainModel		trainModel
usingModel		usingModel
README.md		README.md

KeepFloyding/wikiNLPpy

Folders and files

Latest commit

History

Repository files navigation

wikiNLPpy: Training Word2Vec model on wikipedia data

Getting Started

Prerequisites

Training Model

Using model

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages