Feedback from DPK workshop @ Llama impact hackathon #796

sujee · 2024-11-12T07:03:34Z

sujee
Nov 12, 2024

I have did a DPK workshop at Llama impact hackathon (https://lu.ma/1mkqg22a) on Nov 08, 2024.

Registered good interest with some good questions / feedback.

Can we dedupe entire documents / zip files of docs?

My answer: YES.
(Pending #605)

Going forward, we are going to emphasize deduping entire docs, not chunks.

Question: How are zipfiles handled with pdf2parquet?
@dolfim-ibm ?

interest in HTML processing

👍

Can we restrict HTML crawler to a certain domain?

So it doesn't follow links to other domains?

I think yes.. there is a parameter called domain_focus need to confirm.
@shahrokhDaijavad ?

Can we strip out ads and trackers in HTML files?

Good question. @shahrokhDaijavad ?

Fuzzy Dedupe.

If I make a copy of document (title : Chapter 1) and then change the title to 'Chapter 2' will fuzzy dedupe catch it?

YES. But which one is eliminated (Chatper 1 or 2)? Not sure

@Bytes-Explorer ?

dolfim-ibm · 2024-11-12T07:46:26Z

dolfim-ibm
Nov 12, 2024
Collaborator

Question: How are zipfiles handled with pdf2parquet?
@dolfim-ibm ?

The pdf2parquet can work with zip files. It will iterate through all the files in the archive and create a single parquet with one doc per row.

Be aware that the DPK data_files_to_use argument is filtering which files to use as input. So if you want to use .zip, you should add it to the list.

1 reply

agoyal26 Nov 12, 2024
Collaborator

@dolfim-ibm this is a good question - Can we add this to read me ?

blublinsky · 2024-11-12T08:10:11Z

blublinsky
Nov 12, 2024
Collaborator

If I make a copy of document (title : Chapter 1) and then change the title to 'Chapter 2' will fuzzy dedupe catch it?

YES. But which one is eliminated (Chatper 1 or 2)? Not sure

shortest

2 replies

sujee Nov 12, 2024
Author

Thanks @blublinsky .
In this case assume both docs are the same length, except one char is different ('1' vs. '2').
Is there any predictability on which one will be kept?

blublinsky Nov 12, 2024
Collaborator

Not really, it depends on the order they will be processed

shahrokhDaijavad · 2024-11-12T21:18:22Z

shahrokhDaijavad
Nov 12, 2024
Collaborator

@hmtbr Can you please answer the 2 questions above, as related to the the HTML (web) crawler? One is about restricting the crawl not to follow the links to other domains (does the default subdomain_focus parameter being true do this?) and the second question is about stripping ads and trackers. Thanks.

2 replies

hmtbr Nov 13, 2024
Maintainer

Can we restrict HTML crawler to a certain domain?

Yes. The crawl is restricted to the domains and subdomains of the input seed URLs by default. Other domains will be crawled only when the user provides a list to the allow_domains argument explicitly.

Can we strip out ads and trackers in HTML files?

Practically, yes, by the domain restriction above.
The DPK connector doesn't load a page by a browser, but gets it by a HTTP client. It scrapes the page to extract links, and then sends get requests to them. Basically, the URLs of the ADs and trackers are outside the targeted domain of the crawler. These URLs will be ignored due to the domain restriction eventually.

sujee Nov 13, 2024
Author

thank you @hmtbr . That's good to know 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback from DPK workshop @ Llama impact hackathon #796

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Feedback from DPK workshop @ Llama impact hackathon #796

sujee Nov 12, 2024

Can we dedupe entire documents / zip files of docs?

interest in HTML processing

Can we restrict HTML crawler to a certain domain?

Can we strip out ads and trackers in HTML files?

Fuzzy Dedupe.

Replies: 3 comments · 5 replies

dolfim-ibm Nov 12, 2024 Collaborator

agoyal26 Nov 12, 2024 Collaborator

blublinsky Nov 12, 2024 Collaborator

sujee Nov 12, 2024 Author

blublinsky Nov 12, 2024 Collaborator

shahrokhDaijavad Nov 12, 2024 Collaborator

hmtbr Nov 13, 2024 Maintainer

sujee Nov 13, 2024 Author

sujee
Nov 12, 2024

Replies: 3 comments 5 replies

dolfim-ibm
Nov 12, 2024
Collaborator

agoyal26 Nov 12, 2024
Collaborator

blublinsky
Nov 12, 2024
Collaborator

sujee Nov 12, 2024
Author

blublinsky Nov 12, 2024
Collaborator

shahrokhDaijavad
Nov 12, 2024
Collaborator

hmtbr Nov 13, 2024
Maintainer

sujee Nov 13, 2024
Author