Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

爬蟲爬超過 24 小時後,會發生靈異事件 #26

Open
ddio opened this issue Dec 9, 2018 · 4 comments
Open

爬蟲爬超過 24 小時後,會發生靈異事件 #26

ddio opened this issue Dec 9, 2018 · 4 comments

Comments

@ddio
Copy link
Contributor

ddio commented Dec 9, 2018

請填知道的部份就好,不用全部都填~

問題

因為目前用 crontab 跑爬蟲,所以當爬蟲跑出過 24hr 時,就會同時跑好幾支起來

這個問題是關於什麼?

  1. 當租屋網站幾乎不回應爬蟲時

解法

治標

TBD

治本

TBD

既有資料修正

這個問題和既有的資料有關嗎?修正的步驟是什麼?

@ddio
Copy link
Contributor Author

ddio commented Dec 10, 2018

觀察:速度慢除了網站回應慢之外,另一個原因是爬蟲執行時有時候會拿不到下一個目標,導致平行程度很容易下降

@ddio
Copy link
Contributor Author

ddio commented Dec 27, 2018

觀察:有時候 request_ts 會被多塞 1~3 倍重複的 request XD

@ddio
Copy link
Contributor Author

ddio commented Mar 11, 2019

原因之一:

當 DB 很忙,而且短時間內有 n 隻 detailSpider 執行的話,會因為第一隻來不及把 request 塞到 DB 裡,而產生 n 倍的 request XD
然後因為這樣讓爬蟲工作的時間更久, DB 繼續一路忙到明天。

可以介入的地方:

  1. 跨 spider 溝通的機制,包含產生工作、領工作
  2. 穩定的 DB CPU 輸出,目前是用有 CPU Credit 的機器 XD
  3. 減少工作量,研究哪些物件是不需要每天更新的

@ddio
Copy link
Contributor Author

ddio commented Nov 5, 2019

暫時的治標方法:

解決不了問題,解決製造問題的重複 request ,定期跑以下 sql ,把重複的 request 刪掉...

delete from request_ts where id in (select id from (select min(id) as id, count(*) as n from request_ts group by year, month, day, (seed->>'house_id')) as t where n > 1);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant