-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.txt
142 lines (98 loc) · 5.42 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
** I/O Files **
INPUT files (file to evaluate) are ALREADY in ./input/
Nevertheless, you can modify directly this files.
Or you can use interactive python shell and use the function getResult() (line 306 in sent_analysis.py) and give as input a list of strings.
Currently, the interactive python shell implement only text (not tweet).
OUTPUT files
- a new eval/hxt*.txt is created for tweet (according * with the option)
- a new eval/review-z.txt is created for text
----------------------------
** Running **
TAKE CARE: FOR THE FIRST USE TRAINING IS MANDATORY FOR TWEET.
Echo offer a sentiment analysis on tweet and on text (specially text coming from shs article review)
For start sentiment anlysis in command line, running on linux console:
$python sent_analysis.py
-Choose which type of corpus do you want to use (text ot tweet)
-Choose training or not (mandatory if you run for the first time Echo)
-Define process's type to run sentiment anlaysis.
----------------------------
** Training **
If you choose tweet option:
- For the first time, you need to training for tweet (at least once for each option) then you could processing sentiment analysis at any time.
Training will create your differents files model:
3 options are avalaible for training :z-score, polarity and twitter Dictionnary ; POS option is not working properly currently.
If you choose text option:
- training is include. You don't need to train, you process sentiment analysis directly.
----------------------------
** Analysing **
- Tweet, 3 option, process is avalaible (Z-score process ; Polarity ; Twitter)
- Text, zscore option is choosen by default. There is no other option for text.
It could take a while, specialy if the input file is important.
RETURN: Name of the ouptut/file.
----------------------------
** Evaluation **
Evaluation is automatically done after analysing.
- Tweet, evaluation is done by control refence (see perl programm eval/score-semeval2014-task9-subtaskB.pl)
- Text, evaluation is done by cross-validation
Evaluation is displayed at the end of the process.
----------------------------
** In an interactive shell (only for txt option) **
You must have PROCESS AT LEAST ONCE in cli the text option (to create different configuration file)
In your virtualenv:
$python
$import import sent_analysis
$my_list=[u'mysentence0', u'mysentencei', ....]
$sent_analysis.getResult(my_list)
OUTPUT : list= [(opinioni, sentencei)]
----------------------------
** Using the Command Line Interface (CLI) **
You can access all details by typing :
```
python sent_analysis.py -h
```
This will give you the following informations :
usage: sent_analysis.py [-h] [-train TRAIN] [-c MODES] [-test TEST] [-f FEATS]
[-t] [-v] [-o OUTPUT]
echo by Hussam Hamdam. Forked by Gaël Guibon in order to add a CLI and speed
optimization Sentiment analysis classifier by polarity.
optional arguments:
-h, --help
show this help message and exit
-train TRAIN, --train TRAIN
train file path
-c MODES, --corpus MODES
modes; "txt" for text or "tw" for tweets)
-test TEST, --test TEST
test file path
-f FEATS, --feature FEATS
type of features : "zs" for z-score, "pol" for polarity or "dic" for twitterDictionary or combine them "zs+pol+dic"
-t, --trainingFlag use this flag to enable training
-v, --verbose use this flag to enable progressionBar (will slightly slow computation)
-o OUTPUT, --output OUTPUT
output file path
This example command:
$python sent_analysis.py -c tw -f pol+zs+dic -t -train corpus/twitter-train-cleansed-B.txt -test input/semeval-tweet-test-B-input.txt
Will give you these results:
eval/hxt-z-pol-dic.txt LiveJournal2014 61.86 SMS2013 56.30 Twitter2013 59.51 Twitter2014 61.42 Twitter2014Sarcasm 42.11
----------------------------
** About the different files **
Some ressources and input-data files are uses uses during process:
./lexicon: -> Usefull only for POLARITY OPTION):
-negative-words.txt: Dictionnary of negative words for tweet.
-positive-words.txt: Dictionnary of positive words for tweet.
-subjclueslen1-HLTEMNLP05.tff: The Subjectivity Lexicon (list of subjectivity clues) that is part of OpinionFinder
-SentiWordNet_3.0.0_20130122.txt SentiWordNet is a lexical resource for opinion mining (http://sentiwordnet.isti.cnr.it/)
./data:
-twitterDic.txt is a lexicon of expression and emotion icons (only for method twitter Dictionnary)
-frstopŵords.txt is a corpus uses in pre-process for tokenize text. It is used with nltk library. (To fit the input).
-cr*.txt is a zscore reference construct for TEXT during the training phase. Normally we reconstruct this files each times we launched training proccess (not implemented yet).
-z*.txt is a zscore reference construct for TWEET during the training phase. Normally we reconstruct this files each times we launched training proccess (not implemented yet).
./input:
-semeval-tweet-test-B-input.txt is an input file for tweet (for example).
-review-input.txt is a an input file for review text. The opinion tagged are present but not used for evaluation.
./eval: -> used to evaluate the result of sentiment analysing on tweet:
-semeval-tweet-test-B-reference.txt (text reference to marked tweet sentiment analyse)
-score-semeval2014-task9-subtaskB.pl (perl code to evaluate the result)
./corpus:
-review.txt is a review's annotated hand-made corpus
-twitter-train-cleansed-B.txt is a tweet's annotated hand-made corpus