Create a standardized form testing dataset #287

arinkulshi-skylight · 2024-10-07T05:20:01Z

Description

Here is the actual dataset.
https://drive.google.com/drive/folders/1WS2FYn0BTxWv0juh7lblzdMaFlI7zbDd
You can download the data from google drive and drag and drop the two folders images and ground-truth. the script above will also pull the data from the individual datasets on HuggingFace

Screenshots (if applicable)

Related Issues

[Link any related issues or tasks from your project management system.]

Checklist

The title of this PR is descriptive and concise.
My changes follow the style guidelines of this project.
I have added or updated test cases to cover my changes.
I've let the team know about this PR by linking it in the review channel

schreiaj

Please add a section in the readme about running this (and maybe interpreting results?)

Additionally, do we need to update dependencies to include datasets I don't seem to have it installed inside my poetry shell.

schreiaj · 2024-10-07T19:46:41Z

OCR/ocr/reportvision-dataset-1/medical_report_import.py

+for split in dataset.keys():
+    split_data = dataset[split]
+    for example in split_data:
+        unique_id = generate_unique_random_number()


Why not use the UUID package for this? While collisions are unlikely they are possible with this implementation

For that matter, why not just do this sequentially?

That is a good point. I was using the unique id's because the files kept getting overwritten. I have edited the script to generate Id's sequentially that should address the issue.

arinkulshi-skylight · 2024-10-07T22:55:20Z

Please add a section in the readme about running this (and maybe interpreting results?)

Additionally, do we need to update dependencies to include datasets I don't seem to have it installed inside my poetry shell.

I rewrote the script to auto populate all three datasets on one click. I also created a section in our readme pointing to the location of the data in gdrive. For this ticket I focused on only the data generation. Creating and interpreting results can be in the next ticket. I also added datasets to poetry. Completely forgot about that. Thanks for the feedback and help!

schreiaj

Minor nit but nothing blocking. LGTM

schreiaj · 2024-10-08T00:10:10Z

OCR/README.md

+### Test Data Sets
+
+Here is the standarized form testing dataset
+https://drive.google.com/drive/folders/1WS2FYn0BTxWv0juh7lblzdMaFlI7zbDd


Nit: This is a public repo and that google drive shouldn't be public. I'd personally just remove this entirely and let folks run the script to download them.

I removed it. The script should do the data pull.

created dataset import script

cb6f8e1

arinkulshi-skylight changed the title ~~created standarized dataset~~ Create a standardized form testing dataset Oct 7, 2024

formatting

f1d08b2

arinkulshi-skylight linked an issue Oct 7, 2024 that may be closed by this pull request

[Benchmarking Framework] Create a standardized form testing dataset #254

Closed

schreiaj requested changes Oct 7, 2024

View reviewed changes

arinkulshi-skylight added 2 commits October 7, 2024 15:46

edited script to run all dataset migration on one click

e633236

edited readme

c0fb7eb

added lock file

2935a0e

arinkulshi-skylight requested a review from schreiaj October 7, 2024 23:01

schreiaj previously approved these changes Oct 8, 2024

View reviewed changes

edited read me

c8cc635

arinkulshi-skylight dismissed schreiaj’s stale review via c8cc635 October 8, 2024 15:10

arinkulshi-skylight requested a review from schreiaj October 8, 2024 15:21

schreiaj approved these changes Oct 8, 2024

View reviewed changes

arinkulshi-skylight added this pull request to the merge queue Oct 8, 2024

Merged via the queue into main with commit a3e10c3 Oct 8, 2024
2 checks passed

arinkulshi-skylight deleted the standardized branch October 8, 2024 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a standardized form testing dataset #287

Create a standardized form testing dataset #287

arinkulshi-skylight commented Oct 7, 2024 •

edited

Loading

schreiaj left a comment

schreiaj Oct 7, 2024

arinkulshi-skylight Oct 7, 2024 •

edited

Loading

arinkulshi-skylight commented Oct 7, 2024

schreiaj left a comment

schreiaj Oct 8, 2024

arinkulshi-skylight Oct 8, 2024

Create a standardized form testing dataset #287

Create a standardized form testing dataset #287

Conversation

arinkulshi-skylight commented Oct 7, 2024 • edited Loading

Description

Screenshots (if applicable)

Related Issues

Checklist

schreiaj left a comment

Choose a reason for hiding this comment

schreiaj Oct 7, 2024

Choose a reason for hiding this comment

arinkulshi-skylight Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

arinkulshi-skylight commented Oct 7, 2024

schreiaj left a comment

Choose a reason for hiding this comment

schreiaj Oct 8, 2024

Choose a reason for hiding this comment

arinkulshi-skylight Oct 8, 2024

Choose a reason for hiding this comment

arinkulshi-skylight commented Oct 7, 2024 •

edited

Loading

arinkulshi-skylight Oct 7, 2024 •

edited

Loading