You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am new in NLP and it's the first time I use sklearn vectorizer, following a tutorial with another corpus for sentiment analysis. For some reason the arrays are almost only zeros (a 1 here and there, but very few of them).
The following code is what I used to preprocess the corpus.
def get_part_of_speech(word):
probable_part_of_speech = wordnet.synsets(word)
pos_counts = Counter()
pos_counts["n"] = len( [ item for item in probable_part_of_speech if item.pos()=="n"] )
pos_counts["v"] = len( [ item for item in probable_part_of_speech if item.pos()=="v"] )
pos_counts["a"] = len( [ item for item in probable_part_of_speech if item.pos()=="a"] )
pos_counts["r"] = len( [ item for item in probable_part_of_speech if item.pos()=="r"] )
most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech
def preprocess_text(text):
cleaned = re.sub(r'\W+', ' ', text).lower()
tokenized = word_tokenize(cleaned)
lemmatized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized if token not in stopwords.words('english')]
normalized = ' '.join(lemmatized)
return normalized
And here is the code with the vectorizer (the vocab with the full vocabulary is one of the solutions I found in the threads, but still not working).
pos = open('NLTK/short_reviews/positive.txt', 'r', encoding='latin-1').read()
neg = open('NLTK/short_reviews/negative.txt', 'r', encoding='latin-1').read()
pos_clean = [preprocess_text(sen) for sen in pos.split('\n') if sen != '']
neg_clean = [preprocess_text(sen) for sen in neg.split('\n') if sen != '']
x_clean = pos_clean + neg_clean
labels = [1] * len(pos_clean) + [0] * len(neg_clean)
vocab = []
for sentence in x_clean:
for word in sentence.split(' '):
if word not in vocab:
vocab.append(word)
x_train, x_test, y_train, y_test = train_test_split(
x_clean, labels, test_size=0.2, random_state=42
)
vectorizer = CountVectorizer(vocabulary=vocab)
x_vec = vectorizer.fit_transform(x_train).toarray()
# xt_vec = vectorizer.transform(x_test).toarray()
with numpy.printoptions(threshold=numpy.inf):
print(x_vec[0])
Thanks a lot in advance, and please do not hesitate if there is some lack of information!
I hope I can understand what is going on...
The text was updated successfully, but these errors were encountered:
Hi, everyone!
I am new in NLP and it's the first time I use sklearn vectorizer, following a tutorial with another corpus for sentiment analysis. For some reason the arrays are almost only zeros (a 1 here and there, but very few of them).
The following code is what I used to preprocess the corpus.
And here is the code with the vectorizer (the vocab with the full vocabulary is one of the solutions I found in the threads, but still not working).
Thanks a lot in advance, and please do not hesitate if there is some lack of information!
I hope I can understand what is going on...
The text was updated successfully, but these errors were encountered: