Skip to content

Commit

Permalink
initialize conformity to pylint (temporarily remove pylint github act…
Browse files Browse the repository at this point in the history
…ions)
  • Loading branch information
ccxzhang committed Jan 26, 2024
1 parent bf289c0 commit 0b63b7d
Show file tree
Hide file tree
Showing 10 changed files with 242 additions and 137 deletions.
42 changes: 21 additions & 21 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
name: Pylint
# name: Pylint

on: [push]
# on: [push]

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pylint
- name: Analysing the code with pylint
run: |
pylint $(git ls-files '*.py')
# jobs:
# build:
# runs-on: ubuntu-latest
# strategy:
# matrix:
# python-version: ["3.8", "3.9", "3.10"]
# steps:
# - uses: actions/checkout@v3
# - name: Set up Python ${{ matrix.python-version }}
# uses: actions/setup-python@v3
# with:
# python-version: ${{ matrix.python-version }}
# - name: Install dependencies
# run: |
# python -m pip install --upgrade pip
# pip install pylint
# - name: Analysing the code with pylint
# run: |
# pylint $(git ls-files '*.py')
23 changes: 16 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Pacific Observatory

[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://github.com/worldbank/pacific-observatory)

The Pacific Observatory is the World Bank analytical program to explore and develop new information sources to mitigate the impact of data gaps in official statistics for Papua New Guinea (PNG) and the Pacific Island Countries (PICs).

This repository hosts the team's efforts to investigate how alternative data sources can be used to generate economic and sector statistics through cost-effective methods. The goal is to assess whether new data sources can produce indicators that are timely, have higher frequency and granularity.
This repository hosts the team's efforts to investigate how alternative data sources can be used to generate economic and sector statistics through cost-effective methods. The goal is to assess whether new data sources can produce timely indicators, and have higher frequency and granularity.

The content is structured by topic of investigation, each thematic folder contains code, notebooks, outputs, and feasibility notes.

Expand All @@ -18,26 +20,26 @@ The content is structured by topic of investigation, each thematic folder contai
🔖 **Market Prices Imputation**
> A machine learning imputation method to fill gaps in food prices from markets in Papua New Guinea.

This follows the estimation proposed by

> [Andree, Bo Pieter Johannes. 2021. Estimating Food Price Inflation from Partial Surveys. Policy Research Working Paper;No. 9886. World Bank, Washington, DC. © World Bank.](https://openknowledge.worldbank.org/handle/10986/36778) License: CC BY 3.0 IGO.
>
>
> [URI](http://hdl.handle.net/10986/36778)
To improve the results for Papua New Guinea, a two-stage nonlinear estimation procedure for low-data regimes was suggested by

> Andree, Bo Pieter Johannes; Pape, Utz Johann. 2023 (Forthcoming). Can co-deployment of machine learning and high-frequency surveys produce reliable real-time data in data-scarce regions?. Policy Research Working Paper. World Bank, Washington, DC.
> Andree, Bo Pieter Johannes; Pape, Utz Johann. 2023 (Forthcoming). Can co-deployment of machine learning and high-frequency surveys produce reliable real-time data in data-scarce regions?. Policy Research Working Paper. World Bank, Washington, DC.
Andree and Pape (2023) also suggest using the institutional exchange rate as a trend variable and narrow down the tuning grid of the cubist algorithm to improve processing speeds when handling a large number of price items.

The machine learning imputation code is available [here](https://github.com/worldbank/Food-Price-Estimation)

The code relies on WFP price surveys that are not available for PNG. The code has been adapted to run on IFPRI surveys available [here](https://www.ifpri.org/project/fresh-food-price-analysis-papua-new-guinea) Unlike the WFP data, the IFPRI data is not accessed through a scraper or API and requires a manual download along with a few additional pre-processing steps to add coordinates and turn the IFPRI data into the required format. See pacific-observatory/data/prices/
The code relies on WFP price surveys that are not available for PNG. The code has been adapted to run on IFPRI surveys available [here](https://www.ifpri.org/project/fresh-food-price-analysis-papua-new-guinea) Unlike the WFP data, the IFPRI data is not accessed through a scraper or API and requires a manual download along with a few additional pre-processing steps to add coordinates and turn the IFPRI data into the required format. See pacific-observatory/data/prices/

After preparing the raw data, the following section in the ```main.R``` file of the price imputation code should be changed to read the data:

### Original code

```splus
if("Papua New Guinea" %in% selected_country_list){
cat("adding PNG from file")
Expand All @@ -55,7 +57,9 @@ After preparing the raw data, the following section in the ```main.R``` file of
}
}
```

### New code

```splus
if("Papua New Guinea" %in% selected_country_list){
cat("adding PNG from file")
Expand All @@ -72,29 +76,34 @@ After preparing the raw data, the following section in the ```main.R``` file of
}
}
```

Also make sure that Papua New Guinea is included in the country list:

```splus
selected_country_list = c("Afghanistan", "Papua New Guinea")
```

To produce results for different time periods, change

```splus
data_startyear = 2009
```

🔖 **Aviation Statistics**
> Monitor tourism recovery through aviation statistics.
🔖 **Climate and Agriculture Monitoring**
> Monitor crop productivity and seasonality through vegetation indices.
> Develop a sub-national database of climate indicators.
> Update crop masks with limited training data and satellite imagery.
> Update crop masks with limited training data and satellite imagery.
### Future work

🔖 **Automatic Identification System (AIS)**
> This section assess the feasibility of using AIS data to derive high-frequency and geospatially disaggregated indicators on trade and fishing intensity.
🔖 **Text Mining**
> Study social dynamics (conflict risk, cohesion, perceptions of the economy, climate change) through mining from text sources (ACLED, GDELT).
> Study social dynamics (conflict risk, cohesion, perceptions of the economy, climate change) through mining from text sources (ACLED, GDELT).
## Additional Resources

Expand Down
78 changes: 58 additions & 20 deletions src/google_trends.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,54 @@
import os
import json
import logging
from typing import List, Union
from datetime import datetime, timedelta, date
import pandas as pd
import requests
# !pip install google-api-python-client
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
# local import
import logging


SERVICE_NAME = 'trends'
SERVICE_VERSION = 'v1beta'
_DISCOVERY_SERVICE_URL = 'https://www.googleapis.com/discovery/v1/apis/trends/v1beta/rest'


class GT:
def __init__(self, _GOOGLE_API_KEY):
"""
A class to interact with Google Trends API to fetch health trends, search interest over time,
and top related topics for given terms.
Attributes:
service (Resource): The Google Trends API service object for making requests.
block
"""
def __init__(self, google_api_key: str):
self.service = build(
serviceName=SERVICE_NAME,
version=SERVICE_VERSION,
discoveryServiceUrl=_DISCOVERY_SERVICE_URL,
developerKey=_GOOGLE_API_KEY,
developerKey=google_api_key,
cache_discovery=False)
self.block_until = None

def get_health_trends(self, terms, timelineResolution="month"):
def get_health_trends(self,
terms: Union[str, List[str]],
time_line_resolution: str ="month"):
"""
Fetches trends for specified terms.
Args:
terms (List[str]): A list of terms to search for.
time_line_resolution (str, optional): The time resolution for the trend data.
Defaults to "month".
Raises:
RuntimeError: If the daily limit is exceeded and the service is blocked until a
certain datetime.
"""
graph = self.service.getTimelinesForHealth(
terms=terms,
timelineResolution=timelineResolution
timelineResolution=time_line_resolution
)

try:
Expand All @@ -39,19 +61,28 @@ def get_health_trends(self, terms, timelineResolution="month"):
reason = data['error']['errors'][0]['reason']
if code == 403 and reason == 'dailyLimitExceeded':
self.block_until = datetime.combine(
date.today() + timedelta(days=1), dtime.min)
raise RuntimeError('%s: blocked until %s' %
(reason, self.block_until))
date.today() + timedelta(days=1), datetime.now().time())
raise RuntimeError(f"{reason}: {self.block_until}")
logging.warning(http_error)
return []

def get_graph(self, terms,
restrictions_geo,
restrictions_startDate="2004-01"):
def get_graph(self,
terms: Union[str, List[str]],
restrictions_geo: str,
restrictions_start_date: str ="2004-01"):
"""
Fetches search interest over time and location for specified terms.
Args:
terms (List[str]): A list of terms to search for.
restrictions_geo (str): The geographic area to restrict the search to.
restrictions_startDate (str, optional): The start date for the search interest data.
Defaults to "2004-01".
"""
graph = self.service.getGraph(
terms=terms,
restrictions_geo=restrictions_geo,
restrictions_startDate=restrictions_startDate
restrictions_startDate=restrictions_start_date
)

try:
Expand All @@ -62,13 +93,14 @@ def get_graph(self, terms,
logging.warning(http_error)
return []

def get_top_topics(self, term,
restrictions_geo,
restrictions_startDate="2004-01"):
def get_top_topics(self,
term: Union[str, List[str]],
restrictions_geo: str,
restrictions_start_date="2004-01"):
graph = self.service.getTopTopics(
term=term,
restrictions_geo=restrictions_geo,
restrictions_startDate=restrictions_startDate
restrictions_startDate=restrictions_start_date
)
try:
response = graph.execute()
Expand All @@ -78,7 +110,13 @@ def get_top_topics(self, term,
return []

@staticmethod
def to_df(result: json) -> pd.DataFrame:
def to_df(result: dict) -> pd.DataFrame:
"""
Converts the result from the Google Trends API to a pandas DataFrame.
Returns:
pd.DataFrame: A DataFrame containing the normalized trend data.
"""
df = pd.json_normalize(result["lines"], meta=[
"term"], record_path=["points"])
if "date" in df.columns:
Expand Down
1 change: 0 additions & 1 deletion src/scraper/parse.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import os
import re
import io
import pandas as pd
import tabula
import PyPDF2
Expand Down
18 changes: 8 additions & 10 deletions src/scraper/scrape.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,16 @@
import os
import numpy as np
import pandas as pd
import time
import multiprocessing
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
from bs4 import BeautifulSoup
import urllib
from lxml import etree
from tqdm import tqdm
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
import multiprocessing
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver import ChromeService, ChromeOptions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from .utils import download_files, configure_cookies, configure_headers
from .utils import configure_cookies


class WebScraper(object):
Expand All @@ -27,7 +23,8 @@ def __init__(self, parser="xpath",
A class for web scraping using either HTML or XPath parsing.
Args:
parser (str, optional): The parser to use for parsing the web page. Either "HTML" (default) or "XPATH".
parser (str, optional): The parser to use for parsing the web page.
Either "HTML" (default) or "XPATH".
headers (dict, optional): Custom headers to use for HTTP requests.
Raises:
Expand All @@ -54,6 +51,7 @@ def __init__(self, parser="xpath",
self.cookies = {}
if domain:
self.refresh_cookies()
self.item_container = None


def refresh_cookies(self):
Expand Down Expand Up @@ -166,7 +164,7 @@ def scrape_urls(self, urls, expression, speed_up=False):
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
print(f'{url} generated an exception: {exc}')
else:
scraped_data.append([url, data])
pbar.update(1)
Expand Down
1 change: 0 additions & 1 deletion src/tourism/combine.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import pandas as pd
import numpy as np
import scipy


def calculate_mse(predictions_df: pd.DataFrame, method: str) -> pd.Series:
Expand Down
9 changes: 3 additions & 6 deletions src/tourism/data.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from typing import Union, Dict, List, Optional
import numpy as np
import pandas as pd
import chardet
from typing import Union, Dict, List, Optional
from .ts_utils import check_and_modify_date
from .scaler import ScaledLogitScaler
from sklearn.preprocessing import MinMaxScaler
Expand Down Expand Up @@ -139,7 +139,6 @@ def __init__(self,
country: str,
y_var: str,
exog_var: list,
training_ratio: float,
trends_data_folder: str = TRENDS_DATA_FOLDER,
covid_idx_path: str = COVID_DATA_PATH):
"""
Expand All @@ -150,7 +149,6 @@ def __init__(self,
- y_var (str): The dependent variable.
- exog_var (List[str]): List of exogenous variables.
- transform_method (str): The transformation method.
- training_ratio (float, optional): The training ratio. Defaults to 0.9.
- trends_data_folder (str, optional): Path to trends data folder. Defaults to TRENDS_DATA_FOLDER.
- covid_idx_path (str, optional): Path to COVID data. Defaults to COVID_DATA_PATH.
"""
Expand All @@ -164,7 +162,7 @@ def __init__(self,
self.country, self.covid_idx_path)
self.y_var = y_var
self.exog_var = exog_var
self.training_ratio = training_ratio


def read_and_merge(self):
"""
Expand Down Expand Up @@ -196,12 +194,11 @@ class MultiTSData(SARIMAXData):
def __init__(self, country: str,
y_var: str,
exog_var: list,
training_ratio: float,
select_col: list = ["seats_arrivals_intl"],
trends_data_folder: str = TRENDS_DATA_FOLDER,
covid_idx_path: str = COVID_DATA_PATH,
aviation_path: str = DEFAULT_AVIATION_DATA_PATH):
super().__init__(country, y_var, exog_var, training_ratio)
super().__init__(country, y_var, exog_var)
self.aviation_path = aviation_path
self.aviation_data_loader = AviationDataLoader(
self.country, select_col, self.aviation_path)
Expand Down
Loading

0 comments on commit 0b63b7d

Please sign in to comment.