_ _
__ __| |_ _ _ __ _ __ | |_ ___ _ _
\ \ /| _|| '_|/ _` |/ _|| _|/ _ \| '_|
/_\_\ \__||_| \__,_|\__| \__|\___/|_|
xtractor
Topic extractor with the idea of generating labels using genism.n_similarity
by Peter Nagy
xtractor is little package which aims to label text automatically harnessing the power of pre-trained word vectors.
The idea is the following:
- You must provide one or more genism compatible pre-trained word vectors
- You must define categories with keywords
- You must provide a tokenized text features you want to label
- Run the extractor to label input text
- The extractor digests the cosine distance of each word (vector) in the sentence and each keyword (vector)
- Then it chooses the most "similar" category as label
$ pip install xtractor
See example.py
for a more detailed example.
from xtractor import TopicExtractor as te
extractor = te.TopicExtractor(models=models, categories=categories)
labels = extractor.extract(pandas_data_frame)
- list of genism compatible models
- list of categories Format:
- input pandas data frame or python list
- in case X is a pandas dataframe, it must have only one column (the feature column)
- X can be a regular python
list
- the features are expected to be tokenized string (e.g. following format:
['Tokenized', 'string']
) - the return value is a regular
list
containing the category names (labels) for each input row respectively (e.g. in case of a 2 row input['economy', 'sport']
)
It really depends on the quality of you pre-trained word vector and on the quality of your intuitively defined category keywords.
In my use case I have used these vectors and played with several iterations of keywords.
I have reached around 69% precision which is not bad. With more carefully picked keywords it can be enhanced.
- Q: Why did you make this? A: Because I looked for a way to automatically label huge amount of (hungarian) text and I found no simple way.
- peter nagy | [email protected]