Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial URLS #1

Closed
kaisarea opened this issue Nov 5, 2013 · 1 comment
Closed

Initial URLS #1

kaisarea opened this issue Nov 5, 2013 · 1 comment

Comments

@kaisarea
Copy link

kaisarea commented Nov 5, 2013

First I would like to say thank you this is an amazing program!

I have a question.

Do I need to provide some URL addresses of LinkedIn profiles to initialize this application? I have run this, mysql set up, login seems to have proceeded successfully. I run:

$ scrapy crawl linkedin -a login=True

and get the following output but no profiles are saved in the mysql linked_profiles table.

2013-11-05 14:48:03-0800 [scrapy] INFO: Scrapy 0.18.4 started (bot: linkedpy)
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Optional features available: ssl, http11, libxml2
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'linkedpy.spiders', 'REDIRECT_MAX_TIMES': 10000, 'CONCURRENT_REQUESTS_PER_DOMAIN': 100, 'SPIDER_MODULES': ['linkedpy.spiders'], 'BOT_NAME': 'linkedpy', 'DOWNLOAD_DELAY': 2}
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Enabled item pipelines:
2013-11-05 14:48:03-0800 [linkedin] INFO: Spider opened
2013-11-05 14:48:03-0800 [linkedin] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-11-05 14:48:03-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-11-05 14:48:03-0800 [linkedin] DEBUG: Crawled (200) <GET https://www.linkedin.com/uas/login> (referer: None)
{'password': 'password', 'key': '[email protected]'}
2013-11-05 14:48:06-0800 [linkedin] DEBUG: Redirecting (302) to <GET http://www.linkedin.com/nhome/?trk=> from <POST https://www.linkedin.com/uas/login-submit>
2013-11-05 14:48:09-0800 [linkedin] DEBUG: Crawled (200) <GET http://www.linkedin.com/nhome/?trk=> (referer: https://www.linkedin.com/uas/login)
2013-11-05 14:48:09-0800 [scrapy] INFO: Login successful!!!
2013-11-05 14:48:09-0800 [scrapy] INFO: No work available yet, Mission completed...
2013-11-05 14:48:11-0800 [linkedin] DEBUG: Redirecting (301) to <GET https://my.linkedin.com> from <GET http://my.linkedin.com>
2013-11-05 14:48:11-0800 [linkedin] DEBUG: Crawled (200) <GET https://my.linkedin.com> (referer: None)
2013-11-05 14:48:11-0800 [scrapy] INFO: Parsing urls from https://my.linkedin.com
2013-11-05 14:48:11-0800 [linkedin] INFO: Closing spider (finished)
2013-11-05 14:48:11-0800 [linkedin] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2694,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 104035,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 11, 5, 22, 48, 11, 554699),
'log_count/DEBUG': 11,
'log_count/INFO': 6,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2013, 11, 5, 22, 48, 3, 618784)}
2013-11-05 14:48:11-0800 [linkedin] INFO: Spider closed (finished)

@cduongvn
Copy link
Owner

cduongvn commented Nov 6, 2013

Hi @kaisarea

I'm glad to hear from you about the linkedpy program. But unfortunately, I wrote this one a year ago. LinkedIn pages have been changed significantly after time. That's why my Scrapy spider is currently failing to extract directories, and profiles as well. So you have nothing in your MySQL. That would make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants