fix bug error do not import #647

seiriosPlus · 2018-03-19T03:57:49Z

Yancey1989 · 2018-03-19T05:33:33Z

doc/pipe_reader_cn.md

@@ -0,0 +1,66 @@
+# PIPE_READER 读取HDFS数据指南
+> pipe_reader 是以数据流的形式读取数据，然后用定义好的parser将数据处理成所需的格式，通过Python的yield进行数据的返回


It's an introduction section, so please remove >.

Yancey1989 · 2018-03-19T05:34:13Z

doc/pipe_reader_cn.md

@@ -0,0 +1,66 @@
+# PIPE_READER 读取HDFS数据指南
+> pipe_reader 是以数据流的形式读取数据，然后用定义好的parser将数据处理成所需的格式，通过Python的yield进行数据的返回


parser => `parser`

Yancey1989 · 2018-03-19T05:38:45Z

doc/pipe_reader_cn.md

+### 数据准备
+1. 按照集群版本的数据准备方法进行[数据准备](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E5%87%86%E5%A4%87%E8%AE%AD%E7%BB%83%E6%95%B0%E6%8D%AE)
+2. 准备数据文件用于控制集群任务的并发度，**此文件的数量与1中准备的要一致**
+在集群上进行训练时，数据在机器之间的分配以文件为最小粒度，paddle的集群版会根据文件的数量来决定启动多少个节点来执行训练。padlle当前对数据的读取方案是将用户指定的数据按照文件分别下载集群中的节点上，然后进行本地读取，但我们采用pipe_reader可直接从HDFS上以数据流的形式进行数据读取，因此我们需要准备两份训练数据数据。一份用于指定给paddle来分配集群执行节点的数据（此份数据会被paddle下载到节点本地），另外一份是实际的训练数据，将交给pipe_reader进行读取。


paddle的集群版会根据文件的数量来决定启动多少个节点来执行训练

PaddleCloud是通过parallelism参数来决定启动多少trainer节点的。而且应该也不需要准备两份数据了。

Yancey1989 · 2018-03-19T05:39:42Z

doc/pipe_reader_cn.md

+        event_handler=event_handler)
+```
+### 提交集群训练任务
+同[通过Receiver提交集群训练任务](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1)


这个不是通过receiver提交任务，PaddleCloud中没有receiver的概念，可以直接写成：使用paddlectl提交任务。

Yancey1989 · 2018-03-19T05:40:01Z

doc/pipe_reader_cn.md

+### 提交集群训练任务
+同[通过Receiver提交集群训练任务](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E6%8F%90%E4%BA%A4%E4%BB%BB%E5%8A%A1)
+**注意：**
+--train_data_path ${HDFS_train_path} 此参数需要指定的是用于指定集群执行数量的数据目录


没有train_data_path这个参数。

seiriosPlus · 2018-03-19T07:38:49Z

I have fixed all the mistakes you mentioned before @Yancey1989

Yancey1989 · 2018-03-19T07:43:41Z

Thanks @seiriosPlus, I'll take a look right now.

Yancey1989 · 2018-03-19T07:44:27Z

doc/pipe_reader_cn.md

+### 数据准备
+1. 按照集群版本的数据准备方法进行[数据准备](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E5%87%86%E5%A4%87%E8%AE%AD%E7%BB%83%E6%95%B0%E6%8D%AE)
+### 代码准备
+1. 添加pipe_reader实现到代码中


看代码应该是PipeReader而不是pipe_reader ?

Yancey1989 · 2018-03-19T07:47:04Z

doc/pipe_reader_cn.md

+### 代码准备
+1. 添加pipe_reader实现到代码中
+**由于集群上paddle版本的原因**，暂时没有paddle的集群版中没有pipe_reader的实现，可自行在代码中将pipe_reader的实现加入
+需要加入**自己的**代码中的pipe_reader代码，代码见**api_train_pipe_reader.py**


cloud是repo里应该也没有api_train_pipe_reader.py吧？

而且PaddleCloud上的应该也不会存在paddle版本的原因，因为可以通过更新镜像的方式来升级paddle，比较方便。

Yancey1989 · 2018-03-19T07:59:13Z

doc/pipe_reader_cn.md

@@ -0,0 +1,47 @@
+# PIPE_READER 读取HDFS数据指南
+pipe_reader 是以数据流的形式读取数据，然后用定义好的`parser`将数据处理成所需的格式，通过Python的yield进行数据的返回


pipe_reader 是以数据流的形式读取数据，然后用定义好的parser将数据处理成所需的格式，通过Python的yield进行数据的返回

=>

PaddlePaddle提供了Reader的方式来读取训练数据，在PaddlePaddle Cloud上，您可以使用PipeReader 以数据流的方式读取数据，通过left_cmd以及parser参数，可以自定义您的命令行参数以及数据的处理逻辑。

Yancey1989 · 2018-03-19T08:07:00Z

doc/pipe_reader_cn.md

+        return ret
+```
+3. 指定pipe_reader参数
+读取文本格式的HDFS文件，只需要指定pipe_reader的还需要指定left_cmd用于读取数据流。集群中每个节点都会被指定一个node_id，依次从0开始，比如有3份训练文件的node_id分别是 0，1，2。


读取文本格式的HDFS文件，只需要指定pipe_reader的还需要指定left_cmd用于读取数据流

看代码中也没有cmd_left这个参数，应该是command么？

另外这句话听起来也不是很通顺，是不是可以改成：

您可以通过指定command参数的方式来读取HDFS上的数据？

Yancey1989 · 2018-03-19T08:08:46Z

doc/pipe_reader_cn.md

+        return ret
+```
+3. 指定pipe_reader参数
+读取文本格式的HDFS文件，只需要指定pipe_reader的还需要指定left_cmd用于读取数据流。集群中每个节点都会被指定一个node_id，依次从0开始，比如有3份训练文件的node_id分别是 0，1，2。


最好说明一下如何获取node_id

Yancey1989 · 2018-03-19T08:11:37Z

doc/pipe_reader_cn.md

+```
+hadoop fs -Dfs.default.name=hdfs://hadoop.com/:54310 -Dhadoop.job.ugi=name,password -cat /paddle/cluster_demo/text_classification/data/train/part-00000
+```
+4. 指定train的reader


在您的代码中使用PipeReader
您可以将PipeReader指定为trainer.train, trainer.test 或作为paddle.batch的参数，例如：

...

Yancey1989 · 2018-03-19T08:12:01Z

doc/pipe_reader_cn.md

+        num_passes=30,
+        event_handler=event_handler)
+```
+### 提交集群训练任务


这段可以去掉，把这个文档的链接加到usage的中。

seiriosPlus · 2018-03-19T10:34:08Z

I have fix some mistakes about PipeReader Chinese Guide, thanks for review. @Yancey1989

CLAassistant · 2020-03-23T12:45:32Z

All committers have signed the CLA.

seiriosPlus added 2 commits March 19, 2018 11:54

fix bug error do not import

b9d68f8

pipe_reader chinese doc

1a575a2

seiriosPlus requested a review from Yancey1989 March 19, 2018 05:30

Yancey1989 reviewed Mar 19, 2018

View reviewed changes

pipe_reader chinese doc

14a3902

Yancey1989 reviewed Mar 19, 2018

View reviewed changes

pipe_reader chinese doc optimized

6b06ce5

seiriosPlus added 2 commits March 20, 2018 11:10

fix PaddlePaddle#646

fda893f

fix some format mistakes

5d49eec

seiriosPlus closed this Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug error do not import #647

fix bug error do not import #647

seiriosPlus commented Mar 19, 2018 •

edited

Loading

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

seiriosPlus commented Mar 19, 2018

Yancey1989 commented Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

Yancey1989 Mar 19, 2018

seiriosPlus commented Mar 19, 2018

CLAassistant commented Mar 23, 2020 •

edited

Loading

		@@ -0,0 +1,66 @@
		# PIPE_READER 读取HDFS数据指南
		> pipe_reader 是以数据流的形式读取数据，然后用定义好的parser将数据处理成所需的格式，通过Python的yield进行数据的返回

		@@ -0,0 +1,47 @@
		# PIPE_READER 读取HDFS数据指南
		pipe_reader 是以数据流的形式读取数据，然后用定义好的`parser`将数据处理成所需的格式，通过Python的yield进行数据的返回

fix bug error do not import #647

fix bug error do not import #647

Conversation

seiriosPlus commented Mar 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiriosPlus commented Mar 19, 2018

Yancey1989 commented Mar 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiriosPlus commented Mar 19, 2018

CLAassistant commented Mar 23, 2020 • edited Loading

seiriosPlus commented Mar 19, 2018 •

edited

Loading

CLAassistant commented Mar 23, 2020 •

edited

Loading