Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sniff the file type as part of the ingestor #61

Open
denten opened this issue Jun 12, 2014 · 4 comments
Open

sniff the file type as part of the ingestor #61

denten opened this issue Jun 12, 2014 · 4 comments
Milestone

Comments

@denten
Copy link
Member

denten commented Jun 12, 2014

No description provided.

@denten denten added this to the 0.2 milestone Jun 12, 2014
@grahamsack
Copy link
Contributor

We also need a de-duping mechanism pretty badly. Citeseer may have good code already. If not we could write our own or potentially use google's edit distance.

Sent from my iPhone

On Jun 12, 2014, at 2:51 PM, Dennis Tenen [email protected] wrote:


Reply to this email directly or view it on GitHub.

@denten
Copy link
Member Author

denten commented Jun 12, 2014

De-duping will happen automatically as we are doing bit-wise hashing to name the files. Exact copies will name clash and be discarded. Incidentally, this is what libgen folks do.

@grahamsack
Copy link
Contributor

We may find thats not sufficient. As I've gone through files I've seen a lot of close dupes under different file names, with very minor variations (eg extra white space, etc). I think a method that can de-dupe if the content of two files is 98% or 99% identical but which doesn't require exact matching would be ideal. But maybe that's for a later time.

On Jun 12, 2014, at 2:58 PM, Dennis Tenen [email protected] wrote:

De-duping will happen automatically as we are doing bit-wise hashing to name the files. Exact copies will name clash and be discarded. Incidentally, this is what libgen folks do.


Reply to this email directly or view it on GitHub.

@denten
Copy link
Member Author

denten commented Jun 12, 2014

i think we should save deeper de-duping for later

On Thu, Jun 12, 2014 at 3:22 PM, grahamsack [email protected]
wrote:

We may find thats not sufficient. As I've gone through files I've seen a
lot of close dupes under different file names, with very minor variations
(eg extra white space, etc). I think a method that can de-dupe if the
content of two files is 98% or 99% identical but which doesn't require
exact matching would be ideal. But maybe that's for a later time.

On Jun 12, 2014, at 2:58 PM, Dennis Tenen [email protected]
wrote:

De-duping will happen automatically as we are doing bit-wise hashing to
name the files. Exact copies will name clash and be discarded.
Incidentally, this is what libgen folks do.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#61 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants