sniff the file type as part of the ingestor #61

denten · 2014-06-12T18:51:19Z

No description provided.

grahamsack · 2014-06-12T18:56:11Z

We also need a de-duping mechanism pretty badly. Citeseer may have good code already. If not we could write our own or potentially use google's edit distance.

Sent from my iPhone

On Jun 12, 2014, at 2:51 PM, Dennis Tenen [email protected] wrote:

—
Reply to this email directly or view it on GitHub.

denten · 2014-06-12T18:58:30Z

De-duping will happen automatically as we are doing bit-wise hashing to name the files. Exact copies will name clash and be discarded. Incidentally, this is what libgen folks do.

grahamsack · 2014-06-12T19:22:54Z

We may find thats not sufficient. As I've gone through files I've seen a lot of close dupes under different file names, with very minor variations (eg extra white space, etc). I think a method that can de-dupe if the content of two files is 98% or 99% identical but which doesn't require exact matching would be ideal. But maybe that's for a later time.

On Jun 12, 2014, at 2:58 PM, Dennis Tenen [email protected] wrote:

De-duping will happen automatically as we are doing bit-wise hashing to name the files. Exact copies will name clash and be discarded. Incidentally, this is what libgen folks do.

—
Reply to this email directly or view it on GitHub.

denten · 2014-06-12T19:25:06Z

i think we should save deeper de-duping for later

On Thu, Jun 12, 2014 at 3:22 PM, grahamsack [email protected]
wrote:

We may find thats not sufficient. As I've gone through files I've seen a
lot of close dupes under different file names, with very minor variations
(eg extra white space, etc). I think a method that can de-dupe if the
content of two files is 98% or 99% identical but which doesn't require
exact matching would be ideal. But maybe that's for a later time.

On Jun 12, 2014, at 2:58 PM, Dennis Tenen [email protected]
wrote:

De-duping will happen automatically as we are doing bit-wise hashing to
name the files. Exact copies will name clash and be discarded.
Incidentally, this is what libgen folks do.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
#61 (comment)
.

denten added this to the 0.2 milestone Jun 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sniff the file type as part of the ingestor #61

sniff the file type as part of the ingestor #61

denten commented Jun 12, 2014

grahamsack commented Jun 12, 2014

denten commented Jun 12, 2014

grahamsack commented Jun 12, 2014

denten commented Jun 12, 2014

sniff the file type as part of the ingestor #61

sniff the file type as part of the ingestor #61

Comments

denten commented Jun 12, 2014

grahamsack commented Jun 12, 2014

denten commented Jun 12, 2014

grahamsack commented Jun 12, 2014

denten commented Jun 12, 2014