Uniform documentation and example Notebooks for all transforms! #753

shahrokhDaijavad · 2024-10-29T20:51:11Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

This is a "super-issue" that will affect all transforms! Each transform owner will be assigned to do two tasks, for the transform they won:

Better documentation of each transform, based on a given template (higher priority task)
An example notebook for every transform. The notebooks should be simple to use by taking in the user data, calling the API, and showing the output result. All other code (extra imports, parameter settings) should be hidden away. This will be done based on a notebook template.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

touma-I · 2024-11-05T16:50:35Z

@shahrokhDaijavad can you please follow-up with the individual transform owners.

Bytes-Explorer · 2024-11-06T12:35:43Z

shahrokhDaijavad · 2024-11-06T15:50:40Z

This is a good first batch, @Bytes-Explorer. The owners of these transforms are: Boris, Michele, Sung, Tsuzuku-san, and Yang Zhao. We can go ahead and assign them, but we need to discuss what (and a timeline) to expect if they are doing other projects at the moment.

Bytes-Explorer · 2024-11-06T15:59:14Z

I believe you already have the template for documentation which should help all the owners fill in. Btw, Constantin has taken over all the work from Boris.

shahrokhDaijavad · 2024-11-06T21:59:47Z

@cmadam , @dolfim-ibm , @sungeunan-ibm , @dtsuzuku-ibm and @ian-cho We have started a significant effort to simplify the use of DPK for the first-time users.
A high-priority item is to have a better and more unified documentation (README files) for each transform. Beyond that, we want an example Jupyter notebook for each transform. For the second step, we will develop a template notebook and share it with you later. But the first step is first!

For the first step, we have a template for what we think should be in the README for each transform (attached).
DPK_Documentation_template.docx

As the owners of the first batch of transforms (list below), I am going to assign to you the task of such documentation for your transform with a target date of Nov. 22. Please use your good judgment to do this, based on what the current README has and what the common template is trying to achieve. If current work commitments prevent you from doing this, please comment, and suggest a way forward (e.g., a later date, a different person to assign this task to, etc).

Owners:
Exact dedup, Fuzzy dedup, doc_id: Constantin
PDF2Parquet, Doc_chunk, text_encoder: Michele
HTML2Parquet: Sung
Doc_quality: Tsuzuku-san
HAP: Yang Zhao

Signed-off-by: Daiki Tsuzuku <[email protected]>

shahrokhDaijavad · 2024-11-11T18:05:12Z

Hi, @cmadam, @dolfim-ibm, @sungeunan-ibm and @ian-cho . I just looked at the PR @dtsuzuku-ibm has submitted for the documentation of Doc_quality (PR #790) and if you haven't started doing this, you can use that README as a model (easier than the template above).

shahrokhDaijavad · 2024-11-11T18:09:57Z

BTW, we are working towards a template for the Jupyter notebook in this issue #754, and we will make it more solid in the next couple of days as a model to follow.

dolfim-ibm · 2024-11-13T17:05:46Z

@shahrokhDaijavad here we go #800

shahrokhDaijavad · 2024-11-14T23:27:54Z

@dtsuzuku-ibm, @cmadam, @dolfim-ibm, @sungeunan-ibm and @ian-cho (cc: @agoyal26 and @touma-I):
Based on the discussion we have been having with @dtsuzuku-ibm and @dolfim-ibm, who have finished the documentation of their transforms in PR #790 and PR #800 about adding some example code to the README file or not, we think we should add a simple example Notebook and the link to from the README now (combining steps 1 and 2 above) and don't wait for a "perfect" Notebook template. Having this Notebook example obviates the need for code in the README. The template for this "minimal" Notebook is this example Notebook that Maroun did for html2parquet: https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb. You should modify this notebook to your transform and add some explanation of what each cell does. The notebook goes in a directory named "notebooks" in parallel with python, ray, ... directories for your transform.

Bytes-Explorer · 2024-11-15T05:02:22Z

@shahrokhDaijavad Should we not keep all notebooks in the example folder? Readme can have the link.

shahrokhDaijavad · 2024-11-15T06:04:55Z

@Bytes-Explorer I think we should use the example folder for "use cases" that use a sequence of transforms to showcase that use case, e.g., RAG, fine-tuning, etc. IMHO, a single-function Notebook that only shows how to use the transform belongs to the directory of that transform, as it complements the README file of that transform with some real code. If you have a good reason for putting all these single-function notebooks in the examples folder, I change my opinion easily!

dolfim-ibm · 2024-11-15T08:23:20Z

Do I get it right, such a notebook will run only when the transform (in its latest state) is published to pypi?

When changing the transform, we need to wait the (pre)release before we can update the notebook, right?

Bytes-Explorer · 2024-11-15T09:08:55Z

@Bytes-Explorer I think we should use the example folder for "use cases" that use a sequence of transforms to showcase that use case, e.g., RAG, fine-tuning, etc. IMHO, a single-function Notebook that only shows how to use the transform belongs to the directory of that transform, as it complements the README file of that transform with some real code. If you have a good reason for putting all these single-function notebooks in the examples folder, I change my opinion easily!

@shahrokhDaijavad I see where you are coming from. I am coming from the point of view that if all examples are in the same folder like this one, it is easy for a beginner to have one place to look for things and get started. Application specific examples can be in sub folders, like this one and this one.

touma-I · 2024-11-15T12:05:51Z

Do I get it right, such a notebook will run only when the transform (in its latest state) is published to pypi?

When changing the transform, we need to wait the (pre)release before we can update the notebook, right?

@dolfim-ibm We could setup the venv environment for running the notebook based on either pip install or make venv. I have the feeling most developers will want to use make venv to do any testing of their notebooks and don't want to hit possible issues that the packaingmay introduce. I would keep it confined to the specific transform using make venv for that transform.

touma-I · 2024-11-15T12:10:54Z

@shahrokhDaijavad @Bytes-Explorer I would lean to keep this in the transform folder and not require the developer to make it work with collab. What we are asking here is very specific for the transform owner to think through how folks use their transform in a notebook and reduce as much as possible the number of variables that the developers need to deal with.

Signed-off-by: Constantin M Adam <[email protected]>

shahrokhDaijavad added the enhancement New feature or request label Oct 29, 2024

shahrokhDaijavad mentioned this issue Oct 29, 2024

Template for single transform notebook examples #754

Closed

2 tasks

touma-I assigned shahrokhDaijavad Nov 5, 2024

touma-I added the simplify-DPK label Nov 6, 2024

Bytes-Explorer assigned agoyal26 Nov 6, 2024

shahrokhDaijavad assigned cmadam, dolfim-ibm, sungeunan-ibm, dtsuzuku-ibm and ian-cho Nov 6, 2024

dtsuzuku-ibm added a commit that referenced this issue Nov 11, 2024

update readme following template #753 (comment)

f2c5a83

dtsuzuku-ibm mentioned this issue Nov 11, 2024

Doc Quality Transform: update readme and add sample notebook #790

Merged

dtsuzuku-ibm added a commit that referenced this issue Nov 11, 2024

update readme following template #753 (comment)

1a70530

Signed-off-by: Daiki Tsuzuku <[email protected]>

dolfim-ibm mentioned this issue Nov 13, 2024

Update README docs for language transforms #800

Merged

sungeunan-ibm mentioned this issue Nov 19, 2024

Html2Parquet Updated README and Added Sample Notebook #815

Merged

ian-cho mentioned this issue Nov 22, 2024

HAP transform: Update README.md and add sample notebook #821

Open

cmadam added a commit that referenced this issue Nov 26, 2024

Update doc to follow template in issue #753

6538218

Signed-off-by: Constantin M Adam <[email protected]>

cmadam mentioned this issue Nov 26, 2024

Update doc for doc_id and ededup to follow template in issue #753 #836

Open

cmadam added a commit that referenced this issue Nov 26, 2024

Update doc to follow template in issue #753

0f96b61

Signed-off-by: Constantin M Adam <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniform documentation and example Notebooks for all transforms! #753

Uniform documentation and example Notebooks for all transforms! #753

shahrokhDaijavad commented Oct 29, 2024

touma-I commented Nov 5, 2024

Bytes-Explorer commented Nov 6, 2024

shahrokhDaijavad commented Nov 6, 2024

Bytes-Explorer commented Nov 6, 2024

shahrokhDaijavad commented Nov 6, 2024

shahrokhDaijavad commented Nov 11, 2024

shahrokhDaijavad commented Nov 11, 2024

dolfim-ibm commented Nov 13, 2024

shahrokhDaijavad commented Nov 14, 2024

Bytes-Explorer commented Nov 15, 2024

shahrokhDaijavad commented Nov 15, 2024

dolfim-ibm commented Nov 15, 2024

Bytes-Explorer commented Nov 15, 2024

touma-I commented Nov 15, 2024 •

edited

Loading

touma-I commented Nov 15, 2024

Uniform documentation and example Notebooks for all transforms! #753

Uniform documentation and example Notebooks for all transforms! #753

Comments

shahrokhDaijavad commented Oct 29, 2024

Search before asking

Component

Feature

Are you willing to submit a PR?

touma-I commented Nov 5, 2024

Bytes-Explorer commented Nov 6, 2024

shahrokhDaijavad commented Nov 6, 2024

Bytes-Explorer commented Nov 6, 2024

shahrokhDaijavad commented Nov 6, 2024

shahrokhDaijavad commented Nov 11, 2024

shahrokhDaijavad commented Nov 11, 2024

dolfim-ibm commented Nov 13, 2024

shahrokhDaijavad commented Nov 14, 2024

Bytes-Explorer commented Nov 15, 2024

shahrokhDaijavad commented Nov 15, 2024

dolfim-ibm commented Nov 15, 2024

Bytes-Explorer commented Nov 15, 2024

touma-I commented Nov 15, 2024 • edited Loading

touma-I commented Nov 15, 2024

touma-I commented Nov 15, 2024 •

edited

Loading