Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于data_pro.py的若干问题 #5

Open
zhhhzhang opened this issue Jun 15, 2020 · 8 comments
Open

关于data_pro.py的若干问题 #5

zhhhzhang opened this issue Jun 15, 2020 · 8 comments

Comments

@zhhhzhang
Copy link

作者你好! 首先感谢您的分享!我在别的数据集运行了data_pro.py,比如reviews_Amazon_Instant_Video_data,报错了,有几个问题想请教一下作者

  1. index = data_test.index[data_test['item_id'] == iid].tolist()[0] 这个地方为什么只取索引[0]?这样处理并没有真正把所有的测试集都去掉,之去掉了相同item_id的第一个
  2. data_train = pd.concat([data_train, data_test.iloc[uid_concat_index]]) ,iloc是按照位置进行查找的,我看了作者的代码应该想按照索引index进行删除,此处会报错,是不是应该替换成loc?
  3. 作者换份train test和val,但是并没有用val,不知道是怎么考虑的?

谢谢!

@ShomyLiu
Copy link
Owner

ShomyLiu commented Jun 15, 2020

你好,谢谢提出问题。
(1,2)这两个地方我会再check一下,建议先用pr1之前的代码:
https://github.com/ShomyLiu/Neu-Review-Rec/tree/9313266f307acbe504759fa0eddf0c562c524748

(3) val与test是55分的, 因此后续模型验证,在val上进行的调参验证。不过后续没有使用test再做的进一步的测试

@ShomyLiu
Copy link
Owner

ShomyLiu commented Jul 7, 2020

@zhhhzhang 你好!
新发布的代码已经修改。数据处理的问题已经解决。
此外,新加入了使用test 进行测试。val 进行验证。

@zhhhzhang
Copy link
Author

zhhhzhang commented Jul 7, 2020 via email

@PanTings
Copy link

大佬你好,小白问一个比较愚蠢的问题,the pretrained word2vec.bin 是什么文件/文本的词嵌入呢?

@ShomyLiu
Copy link
Owner

@PanTings 就是常用的谷歌的word2vec.bin, 导入代码:
https://github.com/ShomyLiu/Neu-Review-Rec/blob/master/pro_data/data_pro.py#L472

具体的word2vec.bin 可以在网上下载,比如
https://github.com/mmihaltz/word2vec-GoogleNews-vectors

@PanTings
Copy link

你好,另外,运行data_pro报错,错误出在执行numerize的时候,KeyError:'.....'(json里第一条数据的reviewerID)。

@ShomyLiu
Copy link
Owner

@PanTings 你好,这个问题我这里无法复现呀,能在详细一些?比如用的是哪个数据集,运行环境等

@PanTings
Copy link

@ShomyLiu ShomyLiu 大佬下午好,我用的就是亚马逊的review data,试了两个类musical_instrument_5.json和digital_music_5.json,都是出现一样的问题。环境是windows10,pycharm2019,python3.7,tf2.2.0(此函数好像用不到tf....
===============End: rawData size========================
Traceback (most recent call last):
File "data_pro.py", line 224, in
data = numerize(data)
File "data_pro.py", line 34, in numerize
uid = list(map(lambda x: user2id[x], data['user_id']))
File "data_pro.py", line 34, in
uid = list(map(lambda x: user2id[x], data['user_id']))
KeyError: 'A3EBHHCZO6V2A4' # 这是digital_musica_5.json的第一条数据的reviewerID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants