Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RR22_LOWCOMM_PSI_2PC对抖动高敏感 #137

Open
Fissure45 opened this issue May 30, 2024 · 8 comments
Open

[Bug]: RR22_LOWCOMM_PSI_2PC对抖动高敏感 #137

Fissure45 opened this issue May 30, 2024 · 8 comments
Assignees

Comments

@Fissure45
Copy link

Fissure45 commented May 30, 2024

Issue Type

Usability

Modules Involved

PSI

Have you reproduced the bug with SPU HEAD?

No

Have you searched existing issues?

Yes

SPU Version

spu 0.8.0b0

OS Platform and Distribution

Centos

Python Version

3.9

Compiler Version

No response

Current Behavior?

在测试有限网络环境下不同PSI算法的表现时,我发现RR22_LOWCOMM似乎对抖动敏感。在10Mbps下,180ms延迟+45ms抖动会使求交任务链接建立但数据传输出现问题。这是实现问题还是算法局限?
我会进一步对不同的延迟和抖动进行测试。

Standalone code to reproduce the issue

如上

Relevant log output

如上
@6fj
Copy link
Member

6fj commented Jun 3, 2024

hi @Fissure45

你说的抖动是指最高延迟会到225ms的意思吗?可以放一下错误日志吗?感谢。

@6fj
Copy link
Member

6fj commented Jun 3, 2024

可以贴一下两边完整的log吗,感谢

@anakinxc anakinxc transferred this issue from secretflow/spu Jun 3, 2024
@Fissure45
Copy link
Author

接收方
[2024-05-30 14:52:07.199] [info] [launch.cc:164] LEGACY PSI config: {"psi_type":"RR22_LOWCOMM_PSI_2PC","receiver_rank":1,"broadcast_result":true,"input_params":{"path":"/opt/1000w.csv","select_fields":["id"]},"output_params":{"path":"/opt/tmp/1000w.csv","need_sort":true},"curve_type":"CURVE_25519","bucket_size":1048576}
[2024-05-30 14:52:07.199] [info] [bucket_psi.cc:400] bucket size set to 1048576
[2024-05-30 14:52:07.595] [info] [bucket_psi.cc:293] begin progress callback loop thread, interval:5000
[2024-05-30 14:52:07.595] [info] [bucket_psi.cc:252] Begin sanity check for input file: /opt/1000w.csv, precheck_switch:false
bucket psi config is protocol: RR22_LOWCOMM_PSI_2PC, broadcast_result: True, receiver_rank: 1, selected_fields: ['id'], precheck_input: False, output_sort: True, bucket_size: 1048576
id_0 = 10.218.184.238:1213
id_1 = 0.0.0.0:1213
progress callback ---- percentage: 0, total: 3, finished: 0, running: 0, description: Precheck, 0%
progress callback ---- percentage: 0, total: 3, finished: 0, running: 0, description: Precheck, 0%
progress callback ---- percentage: 0, total: 3, finished: 0, running: 0, description: Precheck, 0%
progress callback ---- percentage: 0, total: 3, finished: 0, running: 0, description: Precheck, 0%
progress callback ---- percentage: 0, total: 3, finished: 0, running: 0, description: Precheck, 0%
progress callback ---- percentage: 0, total: 3, finished: 0, running: 0, description: Precheck, 0%
后面省略了若干行相同的callback
发送方
[2024-05-30 08:52:07.418] [info] [launch.cc:164] LEGACY PSI config: {"psi_type":"RR22_LOWCOMM_PSI_2PC","receiver_rank":1,"broadcast_result":true,"input_params":{"path":"/opt/1000w.csv","select_fields":["id"]},"output_params":{"path":"/opt/tmp/1000w.csv","need_sort":true},"curve_type":"CURVE_25519","bucket_size":1048576}
[2024-05-30 08:52:07.418] [info] [bucket_psi.cc:400] bucket size set to 1048576

@lq0404510
Copy link

模拟在10Mbps下,180ms延迟+45ms抖动的情况下,我这边未能复现您的这种情况,您那边在没有前面的抖动的约束下,此算法是可以正常任务的吗

@Fissure45
Copy link
Author

是的,前日测试中不设置抖动,10Mbps+180ms延迟可以正常任务;将抖动分别加到90ms、45ms,两次执行均失败。需要说明的是,由于数据量比较大(1亿vs1kw),在创建任务时1亿数据分割成了10份,执行时自动依序发起;90ms延迟下,第一个子任务直接失败;45ms延迟下,第一个子任务成功,第二个子任务失败,所以45ms下的失败可能是抖动干扰了调度产生的。如果您是执行单个任务,可以尝试90ms或更高的延迟、并加大数据量来尝试复现。

@lq0404510
Copy link

hi,我这边通过180ms延迟+90ms抖动的情况下,数据量一亿vs一百万,也未能复现,我这使用的模拟抖动工具是tc,您那边是什么?

@Fissure45
Copy link
Author

模拟抖动工具也是tc;抖动平滑25%。我暂时没有更多信息要补充了,测试条件或结果有更新我会进一步反馈。

@lq0404510
Copy link

好的,我们将持续关注此条issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants