We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
你好,在实践中对参数‘corpus_files’ 和 ‘quality_phrase_files有些疑问。
实践代码如下: from autophrasex import *
autophrase = AutoPhrase( reader=DefaultCorpusReader(tokenizer=JiebaTokenizer()), selector=DefaultPhraseSelector(), extractors=[ NgramsExtractor(N=4), IDFExtractor(), EntropyExtractor() ] )
predictions = autophrase.mine( corpus_files=['answers.txt'], quality_phrase_files='userDic.txt', #quality_phrase_files??像是停用词 callbacks=[ LoggingCallback(), ConstantThresholdScheduler(), EarlyStopping(patience=2, min_delta=3) # EarlyStopping() ] )
for pred in predictions: print(pred)
非常感谢大家的帮助,谢谢!
The text was updated successfully, but these errors were encountered:
我不是作者,但你的第二个问题我可以回答你: 源代码里,他是这么干的:首先对整个待抽取的文本,得到所有的ngram,然后过滤一下,之后将这些ngram与quality_phrase_files中的词匹配,如果ngram存在于quality_phrase则它被放入正例池,否则放入负例池。最后抽取的时候,只会对原负例池中的词打分,如果它被放入正例池了,就不会被再预测! 顺便说一下,这份代码的原理和autoPhrase差距巨大,甚至说毫不相关也可以,可能是作者没完成吧,就是很简单地用idf,pmi,左右熵做特征,用随机森林打了个分,关于POS的部分完全不存在,而且正例中没包含全部的知识库词汇,也没有限制训练集的大小,也没有多线程设置,建议作者继续改进一下代码吧......
Sorry, something went wrong.
当然你用for也很简单吧,先把带切分数据切成N份,然后for循环每次重新mine一下,应该没有困难的地方...
No branches or pull requests
你好,在实践中对参数‘corpus_files’ 和 ‘quality_phrase_files有些疑问。
实践代码如下:
from autophrasex import *
构造autophrase
autophrase = AutoPhrase(
reader=DefaultCorpusReader(tokenizer=JiebaTokenizer()),
selector=DefaultPhraseSelector(),
extractors=[
NgramsExtractor(N=4),
IDFExtractor(),
EntropyExtractor()
]
)
开始挖掘
predictions = autophrase.mine(
corpus_files=['answers.txt'],
quality_phrase_files='userDic.txt', #quality_phrase_files??像是停用词
callbacks=[
LoggingCallback(),
ConstantThresholdScheduler(),
EarlyStopping(patience=2, min_delta=3)
# EarlyStopping()
]
)
输出挖掘结果
for pred in predictions:
print(pred)
非常感谢大家的帮助,谢谢!
The text was updated successfully, but these errors were encountered: