forked from cbrisson/stillness
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
135 lines (97 loc) · 3.16 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
- incorporate a copy of org.apache.velocity.tools.generic.log.CommonsLogLogSystem to get rid of the dependance to velocity-tools
- event handlers a la velocity
- #submit directive ? #submit(url,["login":"hello","password":"toto"]) .... cookies, user-agents... jakarta HttpClient ?
- review regex behaviour: begin and end of string? findFirst()/match()? ...
- integration of selectors mechanism
- more flexible selectors:
x sequence of positive, negative, ...
x handling of collections rather than systematic merging of result? (=> when possible, loop on selected items rather than on synchros)
- dynamic testing of selectors and scraping
2020
----
- clic actions ? (like #follow...)
- ease extraction of url, value, etc...
- #refine(selector)
- selenide scripts?
- automatic urls detection? bof, often better to work with classes...
- implicit loops (if $items size is 1 or n)
- json content by default (but support map/list)
- #if(#regexp()) ? How to write it?
- avoid synchro errors for "foo" != "\nfoo", even in unnormalized mode (third mode?)
Concurrence:
+ octoparse
+ https://simplescraper.io/
2021
----
# Chrome extensions
- https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=fr
- https://www.promptcloud.com/blog/how-to-scrape-data-with-web-scraper-chrome/
- https://lobstr.io/index.php/2018/04/19/les-meilleurs-outils-web-scraping-gratuits-2018/
- https://chrome.google.com/webstore/detail/spider-a-smart-web-scrapi/hhblpocflefpmmfibmajdfcjdkeafpen
# Firefox addons
- https://addons.mozilla.org/en-US/firefox/addon/web-scraper/
# Scrapy
### Dataflow
1. a spider request to the engine
2. engine store requests in scheduler
3. requests pop up to the engine
4. the engine sends requests to the downloader
5. the engine gets responses from the downloader
6. the engine sends back responses to the spider
7. the spider asks from specific items or for another request
8. items are sent to the items pipeline
### Examples
Handling Single Request and Response:
```python
scrapy.Request(url="abc.com/page/1", callback=self.parse_page)
```
where:
```python
def parse_page(self, response):
# Do your data extraction processes with the response
````
Multiple Requests & Response Handling
```python
def make_requests(self, urls):
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_url)
```
### Installing
```shell
conda install -c conda-forge scrapy
```
or:
```shell
pip install Scrapy
```
### Create a new project
```shell
scrapy startproject tutorial
```
Generates:
```
tutorial/
├── scrapy.cfg
└── tutorial
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── __pycache__
├── settings.py
└── spiders
├── __init__.py
└── __pycache__
```
Create a spider:
```shell
scrapy genspider [-t template] <name> <domain>
```
There are 4 templates available i.e. 4 types of spiders: basic, crawl, csvfeed, xmlfeed.
Extraqcting items uses CSS or XPATH.
(=> idea, use "css:" etc... selectors)
### Scrapy Shell
```shell
scrapy shell
```
To experiment with xpath expressions...