Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some url could not be found anymore (404 Not Found). #3

Open
zhongyy opened this issue Sep 3, 2018 · 15 comments
Open

Some url could not be found anymore (404 Not Found). #3

zhongyy opened this issue Sep 3, 2018 · 15 comments

Comments

@zhongyy
Copy link

zhongyy commented Sep 3, 2018

I have found about 300 images "404 not found" when I download the first 2000 images in the .csv file. Besides some links are duplicated. We are interested in your database and trying to use it in research. Would you like to provide the original images?

@shineway14
Copy link

@fwang91 URL return "404" Would you like to provide the original images?

@fwang91
Copy link
Owner

fwang91 commented Sep 4, 2018

I have also found this problem. The URL is crawled in the last year and maybe some URLs have been blocked. We cannot directly send the original images to the face recognition community due to the copyright issues.

If there is a better way to release dataset and avoid the copyright issues, please contact me.

@HaoLiuHust
Copy link

many links can not retrieve, sometimes 404 not found, sometimes timeout

@hustzeyu
Copy link

hustzeyu commented Sep 5, 2018

@fwang91 So, why don't you upload the whole clean dataset (in .jpg) and the corresponding landmarks to 百度网盘?

@fwang91
Copy link
Owner

fwang91 commented Sep 6, 2018

@hustzeyu we do not have the copyright of the original images.

@noeagles
Copy link

noeagles commented Sep 6, 2018

@fwang91 I think it‘s ok if we only use it for acadeimic purpose?

@hustzeyu
Copy link

hustzeyu commented Sep 6, 2018

Hmmm ... we have already used MS1M dataset for acadeimic purpose, is the other part a problem?

@mikeseese
Copy link

mikeseese commented Sep 6, 2018

Definitely not okay for academic purposes just because. Remember "academic" is another word for "commercial" since universities use research to get grant money, sell licenses, etc. Even if it wasn't recognized as "commercial", IMDb states anything non-personal needs explicit permission.

IMDb Conditions of Use explicitly states:

All content included on this site in or made available through any IMDb Service, such as text, graphics, logos, button icons, images, audio clips, video clips, digital downloads, data compilations, and software, is the property of IMDb or its content suppliers and protected by United States and international copyright laws.

And further:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

And to get consent:

Licensing IMDb Content; Consent to Use Robots and Crawlers: If you are interested in receiving our express written permission to use IMDb content for your non-personal (including commercial) use, please visit our Content Licensing section or contact our Licensing Department. We do allow the limited use of robots and crawlers, such as those from certain search engines, with our express written consent. If you are interested in receiving our express written permission to use robots or crawlers on our site, please contact our Licensing Department.

So that really means that the use of this dataset (as it requires you to make a script to download the images from the scrapped URLs) for anything non-personal (including academic and commercial) requires explicit written permission if you want to remain above the radar. Otherwise you're open for lawsuit from Amazon (who owns IMDb).

In other words, none of us can rehost the images without invoking copyright infringement, and it likely requires more effort than I think is worth for @fwang91 et al to fix the URLs if the URLs just simply changed.

@fwang91
Copy link
Owner

fwang91 commented Sep 6, 2018

@seesemichaelj thanks for your detailed explanation.

@hustzeyu
Copy link

hustzeyu commented Sep 7, 2018

expecting updated url

@fwang91
Copy link
Owner

fwang91 commented Sep 9, 2018

@zhongyy hi,could you tell me the number of invalid URL?

@zhongyy
Copy link
Author

zhongyy commented Sep 9, 2018

@ Thanks for your attention. We have downloaded the most of the dataset and we are making a list of the invalid URL. May I email it to you "[email protected]" ?

@fwang91
Copy link
Owner

fwang91 commented Sep 10, 2018

Yes. Thanks a lot.

@beszedes
Copy link

Hire you can find list of URLs that finished with 404 error for me:
https://drive.google.com/file/d/0B9JPNVxgMmu6T3FoMWZZUi00TDQ/view?usp=sharing

@jensph
Copy link

jensph commented Dec 17, 2019

Unfortunately one year later out of 1,662,888 URLs I was able to download 1,180,173, so there are some 482k invalid URLs... I could generate a list of the invalid URLs, but given that it's almost a third of the total this may no longer be useful.

As an example, most of the Tom_Paolino images are not available. That's subject ID nm0660057.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants