Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: empty vocabulary; perhaps the documents only contain stop words #2

Open
drdlfy opened this issue Apr 20, 2018 · 12 comments

Comments

@drdlfy
Copy link

drdlfy commented Apr 20, 2018

为什么我直接跑这个baseline,没有改动,会报这个错呢?非常感谢你的回答!
cv.fit(data[feature]),获取词频向量时候报错的。

@YouChouNoBB
Copy link
Owner

用python3重跑

@drdlfy
Copy link
Author

drdlfy commented Apr 21, 2018

@YouChouNoBB 嗯,好的,非常感谢。昨天晚上我跑了好几次都是这个问题,早晨我重新跑了下,没有问题了。为什么会这样呢?

@YouChouNoBB
Copy link
Owner

研究了一下
比如传入的数据是['1 3','2 4']这样就会有问题,会把单独的数字当成stop words
但是传入的数据是['12 34','265 3']这样就不会有问题

@drdlfy
Copy link
Author

drdlfy commented Apr 23, 2018

@YouChouNoBB 好的,非常感谢,但还是有点疑问,每次都读入的数据都一样,前面报这个错后面就不报了?

@YouChouNoBB
Copy link
Owner

不太清楚这个问题

@drdlfy
Copy link
Author

drdlfy commented Apr 23, 2018

@YouChouNoBB 好的,非常感谢,后来就没有碰到这种问题。

@qioooo
Copy link

qioooo commented Apr 24, 2018

用anaconda3跑也出现了这种情况,python3.6,能从程序上改下么?

@klvn930815
Copy link

@YouChouNoBB 你好,我顺便想问一下,代码会把单独的数字当成stop words过滤掉,但是单独数字在数据里面是有物理意义的,所以应该不能被过滤掉吧?

@YouChouNoBB
Copy link
Owner

是的,对于这种数据建议单独处理

@drdlfy
Copy link
Author

drdlfy commented Apr 25, 2018

@YouChouNoBB @klvn930815 你们好,我想问一下为什么独热编码会把单独的数字当成stop words,我没查到相关资料,能给下相关资料的链接吗?非常感谢你们的指教。

@YouChouNoBB
Copy link
Owner

升级下sklearn包试试

@drdlfy
Copy link
Author

drdlfy commented Apr 26, 2018

@YouChouNoBB 你好我没太理解你的意思,升级sklearn包就可以知道独热编码为什么把单独的数字当成stop words了吗?有没有相关的资料链接呢?非常感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants