-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.txt
62 lines (48 loc) · 2.73 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
The webscraping program scrapes the four providers of weather forecasts, accuweather,
openweathermap, weather.com and weatherunderground for for their forecasts for
essentially all cities / weather stations in Germany. For every provider, the
forecast that lasts the longest into the future is selected. The forecasts are
stored in text-form (either json or xml). It is optional to parse the forecasts
into python-pandas data format.
If you want to start the webscraping you should first start your Command Window
and go to this directory. From here you can use the following command (note that
you need to have Python 3 installed):
python3 webscraping.py --provider=my_provider --city=my_city --folder=my_folder
The flags have the following options(for providers and city several options are
possible, just separate them by a comma without spaces):
--provider: accuweather,weatherdotcom,wunderground,openweathermap
(default is all providers)
--city: almost any german city name(if misspelled
or not included you will get an error,
default is all cities)
--folder: specify the folder where you want to
save the forecasts relative to the
webscraping folder(if the folder does
not exist it will be created, default is
forecasts)
--database: database where forecasts are stored. Default is
forecast_database.csv.
--newer-than: only parse forecasts that are newer than this time, provided
in UTC POSIX time. Default is 0, which means all
forecasts newer than 1 January 1970. You can use
http:// www.epochconverter.com/
to convert a human readable date to POSIX time. You need
the time in seconds (not miliseconds).
--verbosity: set the amount of information printed on the command line.
Possible levels are <=0, 1 and 2<=. The larger number
the more verbose is the output. Programming wise these
levels correspond to logging.WARNING, logging.INFO and
logging.DEBUG. Default is 0.
--pandize: parse forecasts from the providers --provider in the folder
--folder that are newer than --newer-than into python
pandas format. The pandized forecasts are inserted
into the database --database.
--pandize_temporary: parse those forecasts that were temporarily saved in the last
webscraping. This is especially useful if scrape the forecasts
every day. You can call pandize_temporary some time after the
scraping finished to put the last forecasts into the master
dataframe.
For example, to get the forecasts from accuweather and wunderground for Berlin,
Bremen and Munich and save them into the folder myForecasts you have to
write:
python3 webscraping.py --provider=accuweather,wunderground --city=berlin,bremen,munich --folder=myForecasts