Some url could not be found anymore (404 Not Found). #3

zhongyy · 2018-09-03T08:21:48Z

I have found about 300 images "404 not found" when I download the first 2000 images in the .csv file. Besides some links are duplicated. We are interested in your database and trying to use it in research. Would you like to provide the original images?

shineway14 · 2018-09-03T12:21:15Z

@fwang91 URL return "404" Would you like to provide the original images?

fwang91 · 2018-09-04T01:57:17Z

I have also found this problem. The URL is crawled in the last year and maybe some URLs have been blocked. We cannot directly send the original images to the face recognition community due to the copyright issues.

If there is a better way to release dataset and avoid the copyright issues, please contact me.

HaoLiuHust · 2018-09-05T02:46:22Z

many links can not retrieve, sometimes 404 not found, sometimes timeout

hustzeyu · 2018-09-05T08:28:35Z

@fwang91 So, why don't you upload the whole clean dataset (in .jpg) and the corresponding landmarks to 百度网盘?

fwang91 · 2018-09-06T00:47:29Z

@hustzeyu we do not have the copyright of the original images.

noeagles · 2018-09-06T02:12:40Z

@fwang91 I think it‘s ok if we only use it for acadeimic purpose?

hustzeyu · 2018-09-06T02:31:20Z

Hmmm ... we have already used MS1M dataset for acadeimic purpose, is the other part a problem?

mikeseese · 2018-09-06T02:52:01Z

Definitely not okay for academic purposes just because. Remember "academic" is another word for "commercial" since universities use research to get grant money, sell licenses, etc. Even if it wasn't recognized as "commercial", IMDb states anything non-personal needs explicit permission.

IMDb Conditions of Use explicitly states:

All content included on this site in or made available through any IMDb Service, such as text, graphics, logos, button icons, images, audio clips, video clips, digital downloads, data compilations, and software, is the property of IMDb or its content suppliers and protected by United States and international copyright laws.

And further:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

And to get consent:

Licensing IMDb Content; Consent to Use Robots and Crawlers: If you are interested in receiving our express written permission to use IMDb content for your non-personal (including commercial) use, please visit our Content Licensing section or contact our Licensing Department. We do allow the limited use of robots and crawlers, such as those from certain search engines, with our express written consent. If you are interested in receiving our express written permission to use robots or crawlers on our site, please contact our Licensing Department.

So that really means that the use of this dataset (as it requires you to make a script to download the images from the scrapped URLs) for anything non-personal (including academic and commercial) requires explicit written permission if you want to remain above the radar. Otherwise you're open for lawsuit from Amazon (who owns IMDb).

In other words, none of us can rehost the images without invoking copyright infringement, and it likely requires more effort than I think is worth for @fwang91 et al to fix the URLs if the URLs just simply changed.

fwang91 · 2018-09-06T03:37:12Z

@seesemichaelj thanks for your detailed explanation.

hustzeyu · 2018-09-07T00:43:05Z

expecting updated url

fwang91 · 2018-09-09T14:43:17Z

@zhongyy hi，could you tell me the number of invalid URL?

zhongyy · 2018-09-09T14:50:15Z

@ Thanks for your attention. We have downloaded the most of the dataset and we are making a list of the invalid URL. May I email it to you "[email protected]" ?

fwang91 · 2018-09-10T05:49:45Z

Yes. Thanks a lot.

beszedes · 2018-09-25T05:56:58Z

Hire you can find list of URLs that finished with 404 error for me:
https://drive.google.com/file/d/0B9JPNVxgMmu6T3FoMWZZUi00TDQ/view?usp=sharing

jensph · 2019-12-17T00:03:16Z

Unfortunately one year later out of 1,662,888 URLs I was able to download 1,180,173, so there are some 482k invalid URLs... I could generate a list of the invalid URLs, but given that it's almost a third of the total this may no longer be useful.

As an example, most of the Tom_Paolino images are not available. That's subject ID nm0660057.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some url could not be found anymore (404 Not Found). #3

Some url could not be found anymore (404 Not Found). #3

zhongyy commented Sep 3, 2018

shineway14 commented Sep 3, 2018

fwang91 commented Sep 4, 2018

HaoLiuHust commented Sep 5, 2018

hustzeyu commented Sep 5, 2018

fwang91 commented Sep 6, 2018

noeagles commented Sep 6, 2018

hustzeyu commented Sep 6, 2018

mikeseese commented Sep 6, 2018 •

edited

Loading

fwang91 commented Sep 6, 2018

hustzeyu commented Sep 7, 2018

fwang91 commented Sep 9, 2018

zhongyy commented Sep 9, 2018

fwang91 commented Sep 10, 2018

beszedes commented Sep 25, 2018

jensph commented Dec 17, 2019 •

edited

Loading

Some url could not be found anymore (404 Not Found). #3

Some url could not be found anymore (404 Not Found). #3

Comments

zhongyy commented Sep 3, 2018

shineway14 commented Sep 3, 2018

fwang91 commented Sep 4, 2018

HaoLiuHust commented Sep 5, 2018

hustzeyu commented Sep 5, 2018

fwang91 commented Sep 6, 2018

noeagles commented Sep 6, 2018

hustzeyu commented Sep 6, 2018

mikeseese commented Sep 6, 2018 • edited Loading

fwang91 commented Sep 6, 2018

hustzeyu commented Sep 7, 2018

fwang91 commented Sep 9, 2018

zhongyy commented Sep 9, 2018

fwang91 commented Sep 10, 2018

beszedes commented Sep 25, 2018

jensph commented Dec 17, 2019 • edited Loading

mikeseese commented Sep 6, 2018 •

edited

Loading

jensph commented Dec 17, 2019 •

edited

Loading