title | subtitle | author | date |
---|---|---|---|
3.2. Collecting data, web scraping with curl |
Linux for Data Scientists<br/>HOGENT toegepaste informatica |
Thomas Parmentier, Andy Van Maele, Bert Van Vreckem |
2024-2025 |
Browsers:
CLI utilities:
$ curl icanhazip.com
193.190.172.117
Try this:
curl 'https://icanhazdadjoke.com/'
curl 'https://covid19.mathdro.id/api/countries/BE'
curl 'https://api.coinlore.net/api/ticker/?id=90'
curl 'https://education.thingsflow.eu/IAQ/DeviceByQR?hashedname=5201731f632701e602d31f98be7297e088a94eb38736c452495f02e444d4ba2d'
-
Normally,
curl
prints tostdout
-
When redirected, progress information is printed:
$ curl 'https://icanhazdadjoke.com/' > joke.txt % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95 100 95 0 0 320 0 --:--:-- --:--:-- --:--:-- 319
-
Turn off with
-s
/--silent
E.g.
curl -X GET https://httpbin.org/anything
curl -X POST https://httpbin.org/anything
curl -X PUT https://httpbin.org/anything
curl -X DELETE https://httpbin.org/anything
...
Which request is default?
-o
,--output
<file>
-O
,--remote-name
curl -s -o anything.json https://httpbin.org/anything
curl -s -O https://www.google.com/robots.txt
-i
,--include
: toon ook de HTTP response headers-I
,--head
: toon enkel de headers
curl -i http://google.com
curl -H 'User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)' \
https://httpbin.org/anything
-L
,--location
curl https://www.twitter.com # Leeg resultaat!
curl -i https://www.twitter.com # zie location: veld in de response header
curl -L https://www.twitter.com # volg de redirect
-d
,--data
var=val&var=val
curl -X POST -d 'penguin=tux&color=blue' https://httpbin.org/anything
- Verzamel data over een bepaalde periode (bv. curl)
- Zet ruwe data om in geschikte vorm (bv. JSON/HTML -> CSV)
- Simuleer analyse van de data (bv. grafiekje, basis-statistieken)
- Genereer rapport (webpagina, PDF)
Resultaat in te dienen/demonstreren (= 30% examencijfer)
- Kies een dataset
- Schrijf een script dat de gewenste data downloadt
- Slaat op in bepaalde directory (instellen met variabele)
- Bestand met timestamp in de naam
- Open Dataportaal stad Gent
- CO2-meter in lokaal B.4.037
- Lijst van publieke REST APIs (Postman)
- Public API's (Github-repo)
- Free APIs for developers (Rapid API)
- Big list of free and open APIs (Ana Kravitz)
- ...