Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2P模式下 隐私求交报错 #163

Open
zengjunjie525 opened this issue Nov 19, 2024 · 32 comments
Open

P2P模式下 隐私求交报错 #163

zengjunjie525 opened this issue Nov 19, 2024 · 32 comments

Comments

@zengjunjie525
Copy link

Issue Type

Running

Have you searched for existing documents and issues?

Yes

OS Platform and Distribution

Linux ubantu

All_in_one Version

kuscia:0.12.0b0

Module type

secretpad

Module version

1.10.0b1

What happend and What you expected to happen.

在P2P模式下,双方在联合项目下,进行隐私求交的时候,报错,页面无报错信息,问题定位需要协助

Log output.

页面无报错反馈
@zengjunjie525
Copy link
Author

0.log 这个kuscia 的pods 日志
image

@zengjunjie525
Copy link
Author

image

@lanyy9527
Copy link

可以根据任务id,提供下双方的引擎日志信息 /home/kuscia/var/stdout/pods/alice_xxxx/xxx/*.log
日志获取参考:https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.12.0b0/deployment/logdescription#id3

@zengjunjie525
Copy link
Author

我有上传日志,您帮忙看看呢

@lanyy9527
Copy link

上面提供的不是sf的日志,sf的日志获取方式可以参考:

可以根据任务id,提供下双方的引擎日志信息 /home/kuscia/var/stdout/pods/alice_xxxx/xxx/*.log
日志获取参考:https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.12.0b0/deployment/logdescription#id3

@zengjunjie525
Copy link
Author

/home/kuscia/var/stdout/pods/alice_xxxx/xxx/*.log 按这个而路径提供的log

@lanyy9527
Copy link

上面提供的 0.log 是dataproxy的日志,需要根据任务id获取对应路径下的sf日志;

@zengjunjie525
Copy link
Author

/home/kuscia/var/stdout/pods 下面只有一个文件,再下面只有dataproxy 然后下面有3个log 文件,我按页面上这个ID 没有这个文件
image

@lanyy9527
Copy link

重新跑下任务,看是否能获取到对应的日志

@zengjunjie525
Copy link
Author

执行也没新的文件出来,/home/kuscia/var/stdout/pods 下面还是只有一个,然后下面只有dataproxy 我执行的这个任务,一直跑不完,半个小时了都没出结果,之前5分钟就会失败,现在一直不结束,我用来测试的,数据量和数据字段很少
image

@lanyy9527
Copy link

lanyy9527 commented Nov 20, 2024

  1. 在我的机构中,检查两个节点的状态是否可用;
  2. 在合作节点中,检查两个合作节点的通讯状态是否可用;

@zengjunjie525
Copy link
Author

通讯是通的,重跑还是找不到你说那个任务的log的,还是只有dataproxy 文件
image

@wangzul
Copy link

wangzul commented Nov 20, 2024

执行一下kubectl get appimage看一下

@zengjunjie525
Copy link
Author

image

@wangzul
Copy link

wangzul commented Nov 21, 2024

image

另一方的kubectl get appimage也可以提供一下,同时看一下另一方是否存在日志和pod,有的话补充上来看看。

@zengjunjie525
Copy link
Author

这个是另外一个节点
8434e835ea937074eedd583e2fdcde97

@zengjunjie525
Copy link
Author

看了一下他那边也是 /home/kuscia/var/stdout/pods 下面还是只有一个,然后下面只有dataproxy,然后下面有3个log 文件, 你需要这个log 文件么 dataproxy 的log 文件不是你要的是不

@zimu-yuxi
Copy link

贴一下双方的任务详情,容器内kubectl get kj ddnc -oyaml -n cross-domain

@zengjunjie525
Copy link
Author

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaJob
metadata:
annotations:
kuscia.secretflow/initiator: gausscode
kuscia.secretflow/interconn-kuscia-parties: alice
kuscia.secretflow/interconn-self-parties: gausscode
kuscia.secretflow/self-cluster-as-initiator: "true"
creationTimestamp: "2024-11-20T07:37:31Z"
generation: 1
name: ddnc
namespace: cross-domain
resourceVersion: "2077965"
uid: 109d5f11-3f09-4719-9641-9616f869889c
spec:
initiator: gausscode
maxParallelism: 1
scheduleMode: BestEffort
tasks:

  • alias: ddnc-deaxvkkf-node-35
    appImage: secretflow-image
    parties:
    • domainID: alice
    • domainID: gausscode
      taskID: ddnc-deaxvkkf-node-35
      taskInputConfig: |-
      {
      "sf_datasource_config": {
      "alice": {
      "id": "default-data-source"
      },
      "gausscode": {
      "id": "default-data-source"
      }
      },
      "sf_cluster_desc": {
      "parties": ["alice", "gausscode"],
      "devices": [{
      "name": "spu",
      "type": "spu",
      "parties": ["alice", "gausscode"],
      "config": "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
      }, {
      "name": "heu",
      "type": "heu",
      "parties": ["alice", "gausscode"],
      "config": "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
      }],
      "ray_fed_config": {
      "cross_silo_comm_backend": "brpc_link"
      }
      },
      "sf_node_eval_param": {
      "domain": "data_prep",
      "name": "psi",
      "version": "0.0.8",
      "attr_paths": ["input/input_table_1/key", "input/input_table_2/key", "protocol", "sort_result", "allow_empty_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "allow_duplicate_keys/no/receiver_parties", "ecdh_curve"],
      "attrs": [{
      "is_na": false,
      "ss": ["secret_id"]
      }, {
      "is_na": false,
      "ss": ["id"]
      }, {
      "is_na": false,
      "s": "PROTOCOL_RR22"
      }, {
      "b": true,
      "is_na": false
      }, {
      "is_na": true
      }, {
      "is_na": false,
      "s": "no"
      }, {
      "is_na": true
      }, {
      "is_na": false,
      "ss": ["gausscode"]
      }, {
      "is_na": false,
      "s": "CURVE_FOURQ"
      }],
      "inputs": [{
      "type": "sf.table.individual",
      "meta": {
      "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
      "line_count": "-1"
      },
      "data_refs": [{
      "uri": "高科地址反欺诈结果_20240823_result_test_v3_1786516806.csv",
      "party": "alice",
      "format": "csv"
      }]
      }, {
      "type": "sf.table.individual",
      "meta": {
      "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
      "line_count": "-1"
      },
      "data_refs": [{
      "uri": "mayi_test_data1_810807131.csv",
      "party": "gausscode",
      "format": "csv"
      }]
      }],
      "checkpoint_uri": "ckddnc-deaxvkkf-node-35-output-0"
      },
      "sf_output_uris": ["ddnc_deaxvkkf_node_35_output_0"],
      "sf_input_ids": ["pttfunwz", "yzanligj"],
      "sf_input_partitions_spec": ["", ""],
      "sf_output_ids": ["ddnc-deaxvkkf-node-35-output-0"],
      "table_attrs": [{
      "table_id": "yzanligj",
      "column_attrs": [{
      "col_name": "id",
      "col_type": "id"
      }, {
      "col_name": "current_ranking",
      "col_type": "feature"
      }, {
      "col_name": "risk_score",
      "col_type": "feature"
      }]
      }, {
      "table_id": "pttfunwz",
      "column_attrs": [{
      "col_name": "secret_id",
      "col_type": "id"
      }, {
      "col_name": "lr_conversion",
      "col_type": "label"
      }]
      }]
      }
      tolerable: false
  • alias: ddnc-deaxvkkf-node-36
    appImage: secretflow-image
    parties:
    • domainID: alice
      taskID: ddnc-deaxvkkf-node-36
      taskInputConfig: |-
      {
      "sf_datasource_config": {
      "alice": {
      "id": "default-data-source"
      }
      },
      "sf_cluster_desc": {
      "parties": ["alice"],
      "devices": [{
      "name": "spu",
      "type": "spu",
      "parties": ["alice"],
      "config": "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
      }, {
      "name": "heu",
      "type": "heu",
      "parties": ["alice"],
      "config": "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
      }],
      "ray_fed_config": {
      "cross_silo_comm_backend": "brpc_link"
      }
      },
      "sf_node_eval_param": {
      "domain": "stats",
      "name": "groupby_statistics",
      "version": "1.0.0",
      "attr_paths": ["input/input_ds/by", "aggregation_config", "max_group_size"],
      "attrs": [{
      "is_na": false,
      "ss": ["secret_id"]
      }, {
      "s": "{\n "column_queries": [{\n "column_name": "lr_conversion",\n "function": "MEAN"\n }, {\n "column_name": "lr_conversion",\n "function": "SUM"\n }]\n}"
      }, {
      "i64": 10000.0,
      "is_na": false
      }],
      "inputs": [{
      "type": "sf.table.individual",
      "meta": {
      "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
      "line_count": "-1"
      },
      "data_refs": [{
      "uri": "高科地址反欺诈结果_20240823_result_test_v3_1786516806.csv",
      "party": "alice",
      "format": "csv"
      }]
      }],
      "checkpoint_uri": "ckddnc-deaxvkkf-node-36-output-0"
      },
      "sf_output_uris": ["ddnc_deaxvkkf_node_36_output_0"],
      "sf_input_ids": ["pttfunwz"],
      "sf_input_partitions_spec": [""],
      "sf_output_ids": ["ddnc-deaxvkkf-node-36-output-0"],
      "table_attrs": [{
      "table_id": "yzanligj",
      "column_attrs": [{
      "col_name": "id",
      "col_type": "id"
      }, {
      "col_name": "current_ranking",
      "col_type": "feature"
      }, {
      "col_name": "risk_score",
      "col_type": "feature"
      }]
      }, {
      "table_id": "pttfunwz",
      "column_attrs": [{
      "col_name": "secret_id",
      "col_type": "id"
      }, {
      "col_name": "lr_conversion",
      "col_type": "label"
      }]
      }]
      }
      tolerable: false
      status:
      approveStatus:
      gausscode: JobAccepted
      conditions:
  • lastTransitionTime: "2024-11-20T07:37:31Z"
    status: "True"
    type: JobValidated
    lastReconcileTime: "2024-11-20T07:37:31Z"
    phase: AwaitingApproval
    stageStatus:
    gausscode: JobCreateStageSucceeded
    startTime: "2024-11-20T07:37:31Z"

@zimu-yuxi
Copy link

1.双方网络拓扑是什么样子的?有网关或者代理吗?可以参考这里检查下网络是否有问题
2.docker stats或者容器内top看下,可以删除一些failed,running,pending,AwaitingApproval的任务和对应的pod,kubectl delete kj 任务名 -n cross-domain,kubectl delete pod pod名(注意不要删除dataproxy的pod)

@zengjunjie525
Copy link
Author

另一方 cross domain 里面没有ddnc 这个东西

@zimu-yuxi
Copy link

双方kubectl get kj -A看下,给下截图

@zengjunjie525
Copy link
Author

image

@zengjunjie525
Copy link
Author

另一方的
01b4e7c91a0de2a609293200065a2bc4

@zimu-yuxi
Copy link

1.双方网络是直连的吗?
2.在没有ddnc这一方,top看下

@zengjunjie525
Copy link
Author

1、两个服务器在同一个局域网
2、2f901432ed41659c89ae0cbf09cae201

@zimu-yuxi
Copy link

双方kuscia容器docker update --memory 12g --memory-swap 12g,然后docker restart 双方容器

@zengjunjie525
Copy link
Author

这俩容器都要升级么
image

@zengjunjie525
Copy link
Author

俩容器都升级了,重跑了任务,还是不行

@zimu-yuxi
Copy link

目前任务状态是什么样子的?

@zengjunjie525
Copy link
Author

目前任务状态是什么样子的?

跟上面截图一样的,界面到5分钟就会显示为失败,我还添加了一个组件分组统计,也没有成功

@lanyy9527
Copy link

你的问题可能涉及到网络及系统等复杂场景,难以精准定位问题,为了更高效的提供技术支持,可以添加官方微信账号(secretflow8)与我们联系,进行详细沟通和解决。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants