initialize conformity to pylint (temporarily remove pylint github act…

…ions)
worldbank · Jan 26, 2024 · 0b63b7d · 0b63b7d
1 parent bf289c0
commit 0b63b7d
Show file tree

Hide file tree

Showing 10 changed files with 242 additions and 137 deletions.
diff --git a/.github/workflows/pylint.yml b/.github/workflows/pylint.yml
@@ -1,23 +1,23 @@
-name: Pylint
+# name: Pylint
 
-on: [push]
+# on: [push]
 
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: ["3.8", "3.9", "3.10"]
-    steps:
-    - uses: actions/checkout@v3
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v3
-      with:
-        python-version: ${{ matrix.python-version }}
-    - name: Install dependencies
-      run: |
-        python -m pip install --upgrade pip
-        pip install pylint
-    - name: Analysing the code with pylint
-      run: |
-        pylint $(git ls-files '*.py')
+# jobs:
+#   build:
+#     runs-on: ubuntu-latest
+#     strategy:
+#       matrix:
+#         python-version: ["3.8", "3.9", "3.10"]
+#     steps:
+#     - uses: actions/checkout@v3
+#     - name: Set up Python ${{ matrix.python-version }}
+#       uses: actions/setup-python@v3
+#       with:
+#         python-version: ${{ matrix.python-version }}
+#     - name: Install dependencies
+#       run: |
+#         python -m pip install --upgrade pip
+#         pip install pylint
+#     - name: Analysing the code with pylint
+#       run: |
+#         pylint $(git ls-files '*.py')
diff --git a/README.md b/README.md
@@ -1,8 +1,10 @@
 # Pacific Observatory
 
+[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://github.com/worldbank/pacific-observatory)
+
 The Pacific Observatory is the World Bank analytical program to explore and develop new information sources to mitigate the impact of data gaps in official statistics for Papua New Guinea (PNG) and the Pacific Island Countries (PICs).
 
-This repository hosts the team's efforts to investigate how alternative data sources can be used to generate economic and sector statistics through cost-effective methods. The goal is to assess whether new data sources can produce indicators that are timely, have higher frequency and granularity.
+This repository hosts the team's efforts to investigate how alternative data sources can be used to generate economic and sector statistics through cost-effective methods. The goal is to assess whether new data sources can produce timely indicators, and have higher frequency and granularity.
 
 The content is structured by topic of investigation, each thematic folder contains code, notebooks, outputs, and feasibility notes.
 
@@ -18,26 +20,26 @@ The content is structured by topic of investigation, each thematic folder contai
 🔖 **Market Prices Imputation**
 > A machine learning imputation method to fill gaps in food prices from markets in Papua New Guinea.
 
-
 This follows the estimation proposed by
 
 > [Andree, Bo Pieter Johannes. 2021. Estimating Food Price Inflation from Partial Surveys. Policy Research Working Paper;No. 9886. World Bank, Washington, DC. © World Bank.](https://openknowledge.worldbank.org/handle/10986/36778) License: CC BY 3.0 IGO.
-> 
+>
 > [URI](http://hdl.handle.net/10986/36778)
 
 To improve the results for Papua New Guinea, a two-stage nonlinear estimation procedure for low-data regimes was suggested by
 
-> Andree, Bo Pieter Johannes; Pape, Utz Johann. 2023 (Forthcoming). Can co-deployment of machine learning and high-frequency surveys produce reliable real-time data in data-scarce regions?. Policy Research Working Paper. World Bank, Washington, DC. 
+> Andree, Bo Pieter Johannes; Pape, Utz Johann. 2023 (Forthcoming). Can co-deployment of machine learning and high-frequency surveys produce reliable real-time data in data-scarce regions?. Policy Research Working Paper. World Bank, Washington, DC.
 
 Andree and Pape (2023) also suggest using the institutional exchange rate as a trend variable and narrow down the tuning grid of the cubist algorithm to improve processing speeds when handling a large number of price items.
 
 The machine learning imputation code is available [here](https://github.com/worldbank/Food-Price-Estimation)
 
-The code relies on WFP price surveys that are not available for PNG. The code has been adapted to run on IFPRI surveys available [here](https://www.ifpri.org/project/fresh-food-price-analysis-papua-new-guinea) Unlike the WFP data, the IFPRI data is not accessed through a scraper or API and requires a manual download along with a few additional pre-processing steps to add coordinates and turn the IFPRI data into the required format. See pacific-observatory/data/prices/ 
+The code relies on WFP price surveys that are not available for PNG. The code has been adapted to run on IFPRI surveys available [here](https://www.ifpri.org/project/fresh-food-price-analysis-papua-new-guinea) Unlike the WFP data, the IFPRI data is not accessed through a scraper or API and requires a manual download along with a few additional pre-processing steps to add coordinates and turn the IFPRI data into the required format. See pacific-observatory/data/prices/
 
 After preparing the raw data, the following section in the ```main.R``` file of the price imputation code should be changed to read the data:
 
 ### Original code
+
 ```splus
   if("Papua New Guinea" %in% selected_country_list){
     cat("adding PNG from file")
@@ -55,7 +57,9 @@ After preparing the raw data, the following section in the ```main.R``` file of
   }
 }
 ```
+
 ### New code
+
 ```splus
   if("Papua New Guinea" %in% selected_country_list){
     cat("adding PNG from file")
@@ -72,29 +76,34 @@ After preparing the raw data, the following section in the ```main.R``` file of
     }
   }
 ```
+
 Also make sure that Papua New Guinea is included in the country list:
+
 ```splus
 selected_country_list = c("Afghanistan", "Papua New Guinea") 
 ```
+
 To produce results for different time periods, change
+
 ```splus
 data_startyear = 2009
 ```
+
 🔖 **Aviation Statistics**
 > Monitor tourism recovery through aviation statistics.
 
 🔖 **Climate and Agriculture Monitoring**
 > Monitor crop productivity and seasonality through vegetation indices.  
 > Develop a sub-national database of climate indicators.  
-> Update crop masks with limited training data and satellite imagery. 
+> Update crop masks with limited training data and satellite imagery.
 
 ### Future work
 
 🔖 **Automatic Identification System (AIS)**
 > This section assess the feasibility of using AIS data to derive high-frequency and geospatially disaggregated indicators on trade and fishing intensity.
 
 🔖 **Text Mining**
-> Study social dynamics (conflict risk, cohesion, perceptions of the economy, climate change) through mining from text sources (ACLED, GDELT). 
+> Study social dynamics (conflict risk, cohesion, perceptions of the economy, climate change) through mining from text sources (ACLED, GDELT).
 
 ## Additional Resources
 

diff --git a/src/google_trends.py b/src/google_trends.py
@@ -1,32 +1,54 @@
-import os
 import json
+import logging
+from typing import List, Union
+from datetime import datetime, timedelta, date
 import pandas as pd
-import requests
 # !pip install google-api-python-client
 from googleapiclient.discovery import build
 from googleapiclient.errors import HttpError
-# local import
-import logging
+
 
 SERVICE_NAME = 'trends'
 SERVICE_VERSION = 'v1beta'
 _DISCOVERY_SERVICE_URL = 'https://www.googleapis.com/discovery/v1/apis/trends/v1beta/rest'
 
 
 class GT:
-    def __init__(self, _GOOGLE_API_KEY):
+    """
+    A class to interact with Google Trends API to fetch health trends, search interest over time,
+    and top related topics for given terms.
+
+    Attributes:
+        service (Resource): The Google Trends API service object for making requests.
+        block
+    """
+    def __init__(self, google_api_key: str):
         self.service = build(
             serviceName=SERVICE_NAME,
             version=SERVICE_VERSION,
             discoveryServiceUrl=_DISCOVERY_SERVICE_URL,
-            developerKey=_GOOGLE_API_KEY,
+            developerKey=google_api_key,
             cache_discovery=False)
         self.block_until = None
 
-    def get_health_trends(self, terms, timelineResolution="month"):
+    def get_health_trends(self,
+                          terms: Union[str, List[str]],
+                          time_line_resolution: str ="month"):
+        """
+        Fetches trends for specified terms.
+
+        Args:
+            terms (List[str]): A list of terms to search for.
+            time_line_resolution (str, optional): The time resolution for the trend data. 
+                Defaults to "month".
+
+        Raises:
+            RuntimeError: If the daily limit is exceeded and the service is blocked until a 
+                certain datetime.
+        """
         graph = self.service.getTimelinesForHealth(
             terms=terms,
-            timelineResolution=timelineResolution
+            timelineResolution=time_line_resolution
         )
 
         try:
@@ -39,19 +61,28 @@ def get_health_trends(self, terms, timelineResolution="month"):
             reason = data['error']['errors'][0]['reason']
             if code == 403 and reason == 'dailyLimitExceeded':
                 self.block_until = datetime.combine(
-                    date.today() + timedelta(days=1), dtime.min)
-                raise RuntimeError('%s: blocked until %s' %
-                                   (reason, self.block_until))
+                    date.today() + timedelta(days=1), datetime.now().time())
+                raise RuntimeError(f"{reason}: {self.block_until}")
             logging.warning(http_error)
             return []
 
-    def get_graph(self, terms,
-                  restrictions_geo,
-                  restrictions_startDate="2004-01"):
+    def get_graph(self,
+                  terms: Union[str, List[str]],
+                  restrictions_geo: str,
+                  restrictions_start_date: str ="2004-01"):
+        """
+        Fetches search interest over time and location for specified terms.
+
+        Args:
+            terms (List[str]): A list of terms to search for.
+            restrictions_geo (str): The geographic area to restrict the search to.
+            restrictions_startDate (str, optional): The start date for the search interest data. 
+                Defaults to "2004-01".
+        """
         graph = self.service.getGraph(
             terms=terms,
             restrictions_geo=restrictions_geo,
-            restrictions_startDate=restrictions_startDate
+            restrictions_startDate=restrictions_start_date
         )
 
         try:
@@ -62,13 +93,14 @@ def get_graph(self, terms,
             logging.warning(http_error)
             return []
 
-    def get_top_topics(self, term,
-                       restrictions_geo,
-                       restrictions_startDate="2004-01"):
+    def get_top_topics(self,
+                       term: Union[str, List[str]],
+                       restrictions_geo: str,
+                       restrictions_start_date="2004-01"):
         graph = self.service.getTopTopics(
             term=term,
             restrictions_geo=restrictions_geo,
-            restrictions_startDate=restrictions_startDate
+            restrictions_startDate=restrictions_start_date
         )
         try:
             response = graph.execute()
@@ -78,7 +110,13 @@ def get_top_topics(self, term,
             return []
 
     @staticmethod
-    def to_df(result: json) -> pd.DataFrame:
+    def to_df(result: dict) -> pd.DataFrame:
+        """
+        Converts the result from the Google Trends API to a pandas DataFrame.
+
+        Returns:
+            pd.DataFrame: A DataFrame containing the normalized trend data.
+        """
         df = pd.json_normalize(result["lines"], meta=[
                                "term"], record_path=["points"])
         if "date" in df.columns:

diff --git a/src/scraper/parse.py b/src/scraper/parse.py
@@ -1,6 +1,5 @@
 import os
 import re
-import io
 import pandas as pd
 import tabula
 import PyPDF2

diff --git a/src/scraper/scrape.py b/src/scraper/scrape.py
@@ -1,20 +1,16 @@
-import os
-import numpy as np
-import pandas as pd
+import time
+import multiprocessing
+from concurrent.futures import ThreadPoolExecutor, as_completed
 import requests
 from bs4 import BeautifulSoup
-import urllib
 from lxml import etree
 from tqdm import tqdm
-import time
-from concurrent.futures import ThreadPoolExecutor, as_completed
-import multiprocessing
 from selenium.webdriver.common.by import By
 from selenium import webdriver
 from selenium.webdriver import ChromeService, ChromeOptions
 from selenium.webdriver.support.wait import WebDriverWait
 from selenium.webdriver.support import expected_conditions as EC
-from .utils import download_files, configure_cookies, configure_headers
+from .utils import configure_cookies
 
 
 class WebScraper(object):
@@ -27,7 +23,8 @@ def __init__(self, parser="xpath",
         A class for web scraping using either HTML or XPath parsing.
 
         Args:
-            parser (str, optional): The parser to use for parsing the web page. Either "HTML" (default) or "XPATH".
+            parser (str, optional): The parser to use for parsing the web page. 
+                Either "HTML" (default) or "XPATH". 
             headers (dict, optional): Custom headers to use for HTTP requests.
 
         Raises:
@@ -54,6 +51,7 @@ def __init__(self, parser="xpath",
         self.cookies = {}
         if domain:
             self.refresh_cookies()
+        self.item_container = None
 
 
     def refresh_cookies(self):
@@ -166,7 +164,7 @@ def scrape_urls(self, urls, expression, speed_up=False):
                         try:
                             data = future.result()
                         except Exception as exc:
-                            print('%r generated an exception: %s' % (url, exc))
+                            print(f'{url} generated an exception: {exc}')
                         else:
                             scraped_data.append([url, data])
                             pbar.update(1)

diff --git a/src/tourism/combine.py b/src/tourism/combine.py
@@ -1,6 +1,5 @@
 import pandas as pd
 import numpy as np
-import scipy
 
 
 def calculate_mse(predictions_df: pd.DataFrame, method: str) -> pd.Series:

diff --git a/src/tourism/data.py b/src/tourism/data.py
@@ -1,8 +1,8 @@
 import os
+from typing import Union, Dict, List, Optional
 import numpy as np
 import pandas as pd
 import chardet
-from typing import Union, Dict, List, Optional
 from .ts_utils import check_and_modify_date
 from .scaler import ScaledLogitScaler
 from sklearn.preprocessing import MinMaxScaler
@@ -139,7 +139,6 @@ def __init__(self,
                  country: str,
                  y_var: str,
                  exog_var: list,
-                 training_ratio: float,
                  trends_data_folder: str = TRENDS_DATA_FOLDER,
                  covid_idx_path: str = COVID_DATA_PATH):
         """
@@ -150,7 +149,6 @@ def __init__(self,
         - y_var (str): The dependent variable.
         - exog_var (List[str]): List of exogenous variables.
         - transform_method (str): The transformation method.
-        - training_ratio (float, optional): The training ratio. Defaults to 0.9.
         - trends_data_folder (str, optional): Path to trends data folder. Defaults to TRENDS_DATA_FOLDER.
         - covid_idx_path (str, optional): Path to COVID data. Defaults to COVID_DATA_PATH.
         """
@@ -164,7 +162,7 @@ def __init__(self,
             self.country, self.covid_idx_path)
         self.y_var = y_var
         self.exog_var = exog_var
-        self.training_ratio = training_ratio
+
 
     def read_and_merge(self):
         """
@@ -196,12 +194,11 @@ class MultiTSData(SARIMAXData):
     def __init__(self, country: str,
                  y_var: str,
                  exog_var: list,
-                 training_ratio: float,
                  select_col: list = ["seats_arrivals_intl"],
                  trends_data_folder: str = TRENDS_DATA_FOLDER,
                  covid_idx_path: str = COVID_DATA_PATH,
                  aviation_path: str = DEFAULT_AVIATION_DATA_PATH):
-        super().__init__(country, y_var, exog_var, training_ratio)
+        super().__init__(country, y_var, exog_var)
         self.aviation_path = aviation_path
         self.aviation_data_loader = AviationDataLoader(
             self.country, select_col, self.aviation_path)