-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sniff the file type as part of the ingestor #61
Comments
We also need a de-duping mechanism pretty badly. Citeseer may have good code already. If not we could write our own or potentially use google's edit distance. Sent from my iPhone On Jun 12, 2014, at 2:51 PM, Dennis Tenen [email protected] wrote:
|
De-duping will happen automatically as we are doing bit-wise hashing to name the files. Exact copies will name clash and be discarded. Incidentally, this is what libgen folks do. |
We may find thats not sufficient. As I've gone through files I've seen a lot of close dupes under different file names, with very minor variations (eg extra white space, etc). I think a method that can de-dupe if the content of two files is 98% or 99% identical but which doesn't require exact matching would be ideal. But maybe that's for a later time. On Jun 12, 2014, at 2:58 PM, Dennis Tenen [email protected] wrote:
|
i think we should save deeper de-duping for later On Thu, Jun 12, 2014 at 3:22 PM, grahamsack [email protected]
|
No description provided.
The text was updated successfully, but these errors were encountered: