Replies: 3 comments 5 replies
-
The pdf2parquet can work with zip files. It will iterate through all the files in the archive and create a single parquet with one doc per row. Be aware that the DPK |
Beta Was this translation helpful? Give feedback.
-
shortest |
Beta Was this translation helpful? Give feedback.
-
@hmtbr Can you please answer the 2 questions above, as related to the the HTML (web) crawler? One is about restricting the crawl not to follow the links to other domains (does the default subdomain_focus parameter being true do this?) and the second question is about stripping ads and trackers. Thanks. |
Beta Was this translation helpful? Give feedback.
-
I have did a DPK workshop at Llama impact hackathon (https://lu.ma/1mkqg22a) on Nov 08, 2024.
Registered good interest with some good questions / feedback.
Can we dedupe entire documents / zip files of docs?
My answer: YES.
(Pending #605)
Going forward, we are going to emphasize deduping entire docs, not chunks.
Question: How are zipfiles handled with pdf2parquet?
@dolfim-ibm ?
interest in HTML processing
👍
Can we restrict HTML crawler to a certain domain?
So it doesn't follow links to other domains?
I think yes.. there is a parameter called
domain_focus
need to confirm.@shahrokhDaijavad ?
Can we strip out ads and trackers in HTML files?
Good question. @shahrokhDaijavad ?
Fuzzy Dedupe.
If I make a copy of document (title : Chapter 1) and then change the title to 'Chapter 2' will fuzzy dedupe catch it?
YES. But which one is eliminated (Chatper 1 or 2)? Not sure
@Bytes-Explorer ?
Beta Was this translation helpful? Give feedback.
All reactions