Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature need: wechat or qq support need #2

Open
turswiming opened this issue Mar 28, 2023 · 5 comments
Open

feature need: wechat or qq support need #2

turswiming opened this issue Mar 28, 2023 · 5 comments

Comments

@turswiming
Copy link

Thank you! I really enjoy it and can`t hesitate to train a version of myself!
could you add a feather to use the message log from WeChat or qq?
This is because most of us in mainland use these software more frequently, and these feature will make more people available

@ljsabc
Copy link
Owner

ljsabc commented Mar 28, 2023

To be frank it's initially a PoC project to demonstrate the ability to extrapolate (or interpolate) from general LLM knowledge to enjoy "personalities".

From my point of view, I think it's a good idea to parse any sort of text (or conversations), but I really do not have too much time in building up additional parsers.
Would you mind (or someone please) showing a proof-of-concept, or a repo that can parse wechat/qq dialogues, such that we can consider integrating into this project?

I will leave this issue opened, and suppose there's any chance, I will also do some investigation myself.

@nahakyuu
Copy link

QQ的聊天记录可以用QQ自带导出,支持好几种方式
问题是需要转换成什么样的格式?
还有群聊中的上下文如何处理

@ljsabc
Copy link
Owner

ljsabc commented Mar 29, 2023

其实本质上只需要做两件事情:

  • 确定一串聊天属于同一个内容,可以根据时间分割,也可以只考虑前N条信息,即便有假阳性问题也不大
  • 知道哪句话是自己发的哪句话是别人发的

在确定了这件事情之后,那问题就方便许多了,只需要在每一组属于同一个内容的对话里:

  1. 随机选择一条你自己的发言
  2. 在这条发言之前随机选择N(N不宜过大,建议是一个泊松分布,太大了网络也不好训练)条聊天记录,用"\n"连接起来作为instruction。如果你有更好的生成instruction,或者拼接字符串的建议,那自然更棒了。
  3. 之前的N条聊天记录可以包含自己的发言
  4. 将随机选择的你自己的发言作为response
  5. 根据Sample file requirements #1 的要求准备数据集

注意,如果你的instruction的生成方式不是"\n"拼接,那么测试(inference)的时候也要做相同的instruction。
大概就是这样的思路。至于导出什么格式,取决于哪种格式最好parse,并且能实现最开始说的那两点要求。

@F1Justin
Copy link

F1Justin commented Jun 9, 2023

想问一下如果提供qq群记录的sql文件会不会处理起来方便许多

@turswiming
Copy link
Author

turswiming commented Jun 9, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants