-
This is the GitHub repository for IATTC's regression tree algorithm on length frequency data.
-
The R codes for regression tree analysis were originally developed by Cleridy Lennart-Cody ( https://doi.org/10.1016/j.fishres.2009.11.014) and then modified by Haikun Xu to make it automatic as a R package.
-
Please contact Haikun ([email protected]) for any questions related to the package
-
How to install the package: devtools::install_github('HaikunXu/RegressionTree',ref='main')
-
User manual of the package: https://github.com/HaikunXu/FishFreqTree/blob/main/manual/Manual.pdf
The input data frame should include at least four columns named exactly as "lat", "lon", "year", and "quarter". The columns "lat" and "lon" represent the latitudinal and longitudinal positions of grid centers, respectively. The input data frame should also include various columns that record length frequency information with column names = length bin. This regression tree package works with length frequency data so please make sure the input values sum to 1 across length bins. An example of the input data can be found here.
This package finds the best multi-cell combination for a length frequency data based on the proportion of variance explained. The variables that are current considered in the code include latitude, longitude, quarter/cyclic quarter, and year (can be turned on by using year=TRUE). For those who don't consider quarter as a splitting dimension (e.g., your model has a time step of one year), please still add a column named "quarter" to the input data with values = 1. In the main functions this package provides (run_regression_tree and loop_regression_tree), you can manually turn off the quarter dimension by adding "quarter = FALSE" as a function argument.
-
run_regression_tree (type ?run_regression_tree on the console for more info): run the regression tree
-
loop_regression_tree (type ?loop_regression_tree on the console for more info): loop the regression tree
-
evaluate_regression_tree (type ?evaluate_regression_tree on the console for more info): evaluate a pre-specified regression tree
For the nth best split, the code first loops over all existing n cells that are defined by the previous n-1 splits, to find the best split (the one that leads to the maximum variance explained) for every cell. Then those best cell-specific splits are compared to find the split that results in the maximum variance explained. This split is the nth best split. This process is iterated until reaching the maximum number of splits specified by the user.
Users should combine the output figures with output tables to understand the best splits in order. Also, the advanced feature (see the example code for more details) in the package allows users to manually specify some or all splits.