Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final PR from Dev to Main #28

Merged
merged 44 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
700332e
Add files via upload
anngo2 Oct 11, 2024
9492d8d
Add files via upload
cecilymp1 Oct 17, 2024
2f76524
Create googlemaps.ipynb
cecilymp1 Oct 22, 2024
bbc0d43
Create cleaned_alldata_version2.csv
cecilymp1 Oct 24, 2024
6f0ff0c
uploading cleaning ipynb
cecilymp1 Oct 24, 2024
fe116a6
adding visualization folder along with notebook
anngo2 Oct 24, 2024
f30b2f0
Finding NaN Rows
adichopra8 Oct 24, 2024
d7108c3
data analysis viz
cecilymp1 Oct 25, 2024
4735cd6
created base_questions notebook
cecilymp1 Oct 25, 2024
8de9de7
data analysis NON CORRUPTED FILE
cecilymp1 Oct 25, 2024
7102948
Update data_analysis_visuals.ipynb
cecilymp1 Oct 25, 2024
3de3a6c
Merge branch 'data-analysis' of https://github.com/BU-Spark/ds-season…
cecilymp1 Oct 25, 2024
f1b2f06
created top 30 species csv
cecilymp1 Oct 25, 2024
3778537
changes to Data Visualiztion Mango
cecilymp1 Oct 29, 2024
ca01400
top 30 species analysis
cecilymp1 Oct 29, 2024
663ee09
deleted base_questions.ipynb because it is in the data_analysis.ipynb
cecilymp1 Oct 29, 2024
818074a
pushing clustering analysis in Kerala using DBSCAN
anngo2 Nov 6, 2024
6ccc6bd
geographical analysis for the top 30 species
anngo2 Nov 13, 2024
3766d50
cleaned up some data and also focused on kerala more
cecilymp1 Nov 13, 2024
b85ebbc
create avg phenological csv
cecilymp1 Nov 20, 2024
19d0407
powerbi csvs
cecilymp1 Nov 20, 2024
0e80d6e
powerbi code csv
cecilymp1 Nov 20, 2024
14f5e81
csv folder creation for organization
cecilymp1 Dec 3, 2024
18333fa
adding geo analysis code
anngo2 Dec 4, 2024
b218d55
update to powerbi ipynb
cecilymp1 Dec 5, 2024
a1f890f
edit to flourish onset weeks data
cecilymp1 Dec 6, 2024
bac52ad
update
cecilymp1 Dec 7, 2024
c92a6c4
2024 Readme file
cecilymp1 Dec 7, 2024
7549833
Final Deliverable Report Upload
cecilymp1 Dec 7, 2024
fba6a6c
final deliverable report
cecilymp1 Dec 7, 2024
e412500
readme update
cecilymp1 Dec 7, 2024
4cfe4d7
removing -2 values from the markov modeling, deleting mock_anal file
msbrendakim Dec 9, 2024
da4a2ac
final project report - fall 2024
anngo2 Dec 10, 2024
9db1780
removed copy of final deliverable
Dec 10, 2024
88b3508
Delete testing github!
anngo2 Dec 10, 2024
e451020
created dataset doc
Dec 10, 2024
4a7f775
moved readme outside of dataset doc
Dec 10, 2024
ccb6a76
pushing the DATASETDOC
Dec 10, 2024
6b72724
added PowerBI to final report documentation
anngo2 Dec 12, 2024
7a731b2
Delete Final Deliverable Report - FALL 2024 .pdf
anngo2 Dec 12, 2024
fee6c87
Update README.md
sins42 Dec 12, 2024
7ce6a1c
Merge pull request #26 from BU-Spark/data-analysis
sins42 Dec 12, 2024
a562df1
Update DATASETDOC-fa24.md
sins42 Dec 12, 2024
7ca783d
Merge pull request #29 from BU-Spark/sins42-patch-1
funkyvoong Jan 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
260 changes: 181 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,181 @@
# PIT-NE SeasonWatch Project Overview

_Further details on project background, process, and results in SeasonWatch_project_report_

SeasonWatch, a citizen science organization based in India, provided our PIT-NE team with a citizen database containing daily tree phenology data collected from citizen scientists in India and a reference database containing weekly tree phenology data collected from credible sources (e.g. textbooks).

## Applications

Our team processed and analyzed these databases to provide valuable information to support the SeasonWatch in their climate research efforts:

### Data Processing

- Cleaned and reformatted citizen and reference databases (Made database formatting consistent, handled data with incorrectly reported features, etc.)
- Developed a data validation system for citizen database (Used isolation forests for anomaly detection)

### Data Analysis

- Created visualizations of the citizen and reference data over time (Bar and line charts highlighting discrepancies between the citizen and reference observations over time)
- Developed a process for selecting representative citizen observations over a year to use as up-to-date baselines for any species.
- Designed a scoring function to identify flowering and fruiting stage transitions throughout a given year.

## Repository Structure

### code (Contains Python notebooks used in the final product)
- -2_values (Flags citizen observations with incorrect reports regarding the presence or absence of a phenophase in the reported species)
- data_cleaning (Cleans citizen and reference databases, and validates citizen database)
- mean_transition_times_generation (Creates visualizations and a dataset of probability distributions of phenophase transition times based on a score function)
- selecting_reference_data (Creates visualizations and a dataset of representative citizen observations selected as baselines)
- validation_labels (Flags citizen observations dropped during the data cleaning process and gives reasons for dropping them)
- visualizations (Creates visualizations of the citizen and reference data)
- year_to_year_transition_times_data_generation (Creates the year_to_year_transition_time dataset)
### data (Contains CSV files of original data and data produced by the Python notebooks in code)
- citizen_states_cleaned (Cleaned and reformatted citizen database sorted by states)
- india_map (Geographic data used for finding the Inidan state given a set of coordinates)
- original_citizen_data (Citizen database given by SeasonWatch)
- original_reference_data (Reference database given by SeasonWatch)
- reference_states_cleaned (Cleaned and reformatted reference database sorted by states)
- alldata_labeling_-2_all_species (Citizen database given by SeasonWatch with incorrect reports regarding the presence or absence of a phenophase in the reported species flagged)
- average_transition_times (Dataset of probability distributions of phenophase transition times based on a score function)
- cleaned_alldata (Cleaned and reformatted citizen database as one dataset)
- selected_reference_data (Dataset of representative citizen observations selected as baselines)
- species codes (Dataset mapping tree species ids to names)
- validation_labels_alldata (Citizen database given by SeasonWatch with citizen observations dropped during the data cleaning process flagged and reasons for dropping them given)
- year_to_year_transition_time (Dataset of max and mean transition time and probability of phenophases)
### dev_code (Contains Python notebooks used in the development process)
- jobfiles (Files of jobs submitted to shared cloud computing service)
- scc-config (Config for submitting jobs to shared cloud computing service)
- kmeans_pca_testing (Experimenting with and visualizing data validation methods)
- mean transition times from repeat observations (Experimenting with only using regular citizen observations to find phenophase transition times)
- mean_transition_times_dev (Experimenting with different methods for finding phenophase transition times)
- plotting (Preliminary, experimental visualizations)
- ref_cit_na_comparison (Comparing how much citizen data has associated reference data)
### plots (Contains PNG files depicting plots produced by the Python notebooks in code)

> _Citizen observations are usually depicted as percentages. This measure indicates the percentage of citizen reports observing a phenophase in the given week._
>
> _Plots report information weekly (48 weeks per year) over a year._

- combination_percentage_charts (Compares citizen data and reference data over time; bar charts indicate number of citizen observations that week)
- overlaid_percentage_plots (Compares related phenophases within citizen data over time; bar charts indicate number of citizen observations that week)
- repeat_combination_percentage_charts (Compares regular observations and all observations within citizen data over time; bar charts indicate number of citizen observations that week)
- repeat_observations (Compares differences between regular observations and reference data over time, and between all observations and reference data over time)
- selected_ref_vs_cit (Compares citizen data and selected baselines over time)
- transition_bar_plots (Depicts number of observations reporting a phenophase appearing over time)
- two_values_weighted (Compares percentage presence of a phenophase and the magnitude of the presence of a phenophase within the citizen data over time)

## Usage Guide

### Step 1: Data Cleaning

Data should be cleaned, reformatted, and validated before it is applied to anything. Thus, the data cleaning notebook or script should be run before any visualization or analysis.

> _Edit file paths within the code to any new citizen data or reference data CSV files._

### Step 2: Plotting & Analysis

Any other notebook within the code folder can be run next to update the data and plots. Notebooks have functions for plotting and producing datasets. Modify parameters (states, species, year, etc.) to the functions as needed (i.e. If selected reference data on tamarind in Kerala in 2018 is wanted, set the function parameters to match that).

> _Edit plot and CSV file paths within the code as needed._
# SeasonWatch Project

## Overview
The SeasonWatch Project aims to analyze the phenological changes in tree species in India, with a focus on Kerala, using citizen-science data collected from 2015 to 2023. The project examines the relationship between climate change and tree phenology by identifying trends, seasonal shifts, and geographic variations. Our final deliverables include visualizations, statistical analyses, and code that can be reused for similar ecological studies.

## Objectives
1. **Analyze phenological changes:** Investigate how trees respond to climate change and seasonal transitions.
2. **Identify key patterns:** Focus on the timing of phenological stages (leaves, flowers, fruits) for the top 30 observed tree species.
3. **Visualize shifts:** Create interactive and static visualizations to convey changes in onset weeks, geographic clustering, and seasonal variability.
4. **Provide reusable tools:** Develop and share scripts, cleaned datasets, and workflows for future analysis.

---

## Deliverables
### 1. Data Cleaning and Preparation
- **Input Data:**
- Citizen-submitted observations from SeasonWatch (2015–2023).
- ~177 species with detailed phenological stage observations.
- **Cleaned Data:**
- Filtered dataset with accurate geocoding, standardized state names, and adjusted missing values.
- Historical (pre-2020) and comparative (post-2020) datasets for reference.

### 2. Visualizations
- **Interactive Visualizations:** Created with Flourish Studio, including:
- Seasonal shifts in onset weeks.
- Geographic clustering of phenological stages.
- **Static Visuals:** Heatmaps, time-series plots, and summary tables saved to:
`/data/VISUALIZATIONS-fall 2024/Kerala Visuals`

### 3. Statistical Analysis
- Summary statistics, regression analysis, and survival modeling to answer base questions:
- How are trees changing due to climate change?
- What is the onset timing for flowering and fruiting in tropical species?
- What is the probability of transitioning between seasonal states?

### 4. Final Report
- Comprehensive document with:
- Visualizations.
- Interpretations of patterns and trends.
- Recommendations for conservation efforts.
- Delivered in PDF format.

---

## Getting Started
### Prerequisites
1. **Python Environment:**
- Python 3.9 or higher.
- Required libraries:
- `pandas`
- `numpy`
- `matplotlib`
- `geopandas (v0.9.0)`
- `shapely (v2.0.1)`
- `googlemaps`
- `seaborn`
2. **Geospatial Tools:**
- Shapefiles available in `india_map` folder for geographic visualizations.
- Google Maps API key for geocoding (optional).
3. **Data Files:**
- Cleaned datasets stored in the `/data` directory.

### Installation
1. Clone the repository:
```bash
git clone https://github.com/your-org/seasonwatch-project.git
cd seasonwatch-project
```
2. Set up a virtual environment:
```bash
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
pip install -r requirements.txt
```
3. Download the cleaned datasets and place them in the `/data` directory.

### Running the Code
1. **Data Cleaning:**
```bash
python scripts/data_cleaning.py
```
Cleans raw observation data, standardizes formats, and generates cleaned datasets.

2. **Geocoding:**
If using Google Maps API, update your API key in `config.py`:
```python
GOOGLE_API_KEY = 'your-api-key'
```
Run geocoding script:
```bash
python scripts/geocoding.py
```

3. **Visualization Generation:**
Generate visuals:
```bash
python scripts/visualizations.py
```
Outputs saved in `/data/VISUALIZATIONS-fall 2024`.

4. **Statistical Analysis:**
Perform analyses and generate summary tables:
```bash
python scripts/analysis.py
```

---

## Directory Structure
```
seasonwatch-project/
├── data/
│ ├── raw_data.csv
│ ├── cleaned_data.csv
│ ├── VISUALIZATIONS-fall 2024/
│ │ ├── Kerala Visuals/
│ │ └── ...
├── scripts/
│ ├── data_cleaning.py
│ ├── geocoding.py
│ ├── visualizations.py
│ └── analysis.py
├── india_map/
│ ├── shapefiles/
│ └── ...
├── README.md
├── requirements.txt
└── config.py
```

---

## Blockers Faced and Solutions
### Blockers
1. **Google API Costs:**
- Budget constraints limited the number of geocoding requests.
- **Solution:** Implemented a caching mechanism to minimize API calls and explored free alternatives like OpenStreetMap.

2. **Anomalies in Data:**
- Missing or incorrect data for several observations.
- **Solution:** Used statistical imputation techniques and cross-referenced with the SeasonWatch tree phenology handbook to standardize missing values.

3. **Processing Speed:**
- Geocoding and data cleaning scripts were slow due to large datasets.
- **Solution:** Optimized scripts using `try-except` blocks for error handling and parallel processing where possible.

4. **Prior Data Loss:**
- Previous teams dropped too many rows during data cleaning.
- **Solution:** Reviewed raw data meticulously and identified only 20,000 rows with missing values, preserving as much data as possible.

---

## Next Steps for Future Teams
1. **Enhance Visualizations:**
- Improve interactivity in Flourish Studio and integrate additional features, such as user filtering by state or species.
- Explore advanced visualization libraries like Plotly or D3.js for greater customization.

2. **Expand Analysis:**
- Incorporate additional years of data beyond 2023 to track long-term trends.
- Perform deeper survival and Markov modeling to understand tree state transitions.

3. **Climate Correlation:**
- Integrate external climate datasets (e.g., rainfall, temperature) to analyze correlations with phenological changes.

4. **Optimize Geocoding:**
- Automate retries for failed geocoding requests and further explore OpenStreetMap for cost-free options.

5. **Machine Learning Models:**
- Apply machine learning techniques to predict phenological stages based on climate and temporal data.

6. **Documentation:**
- Update README and inline comments regularly for new tools or methods added to the project.


## Contributors
- **Team Members:** Cecily Wang-Munoz, Aditya Chopra, An Ngo (Sue), Brenda Kim

---

## Acknowledgments
This project uses data from SeasonWatch and insights from the SeasonWatch tree phenology handbook. Special thanks to BU SPARK for supporting the project.
Binary file added SeasonWatch_Final_Report_fa24.pdf
Binary file not shown.
Loading
Loading