We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
按照Readme.md配置
但是分词分出了 " " (空格) 也分出了 "的" 还有标点符号 怎么样才能把这些词过滤掉呢?
The text was updated successfully, but these errors were encountered:
https://github.com/hankcs/HanLP/blob/master/src/test/java/com/hankcs/demo/DemoStopWord.java
Sorry, something went wrong.
配置 stopWordDictionaryPath 为 stopwords_hanlp.txt 之后,只能过滤掉一个空格,如果连续两个空格就会出现 [2020], 配置如下:
[2020]
<analyzer type="index"> <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true" stopWordDictionaryPath="/var/solr/stopwords_hanlp.txt" /> </analyzer>
错误结果如图,中间出现[2020],请问 “[2020]” 是什么字符?
尝试通过 solr.StopFilterFactory filter 来过滤字符,但是问题依旧,[20]或[2020]都过滤不了,配置如下:
solr.StopFilterFactory
[20]
<analyzer type="index"> <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="/var/solr/stopwords_hanlp.txt" /> </analyzer>
最终导致“空格”成为索引最多的字符:
No branches or pull requests
按照Readme.md配置
但是分词分出了 " " (空格) 也分出了 "的" 还有标点符号 怎么样才能把这些词过滤掉呢?
The text was updated successfully, but these errors were encountered: