# packages
source("../functions/required_packages.R")
# functions
source("../functions/extract.R") # predictor categories
source("../functions/categories.R") # predictor categories
Example use of the Cephaloplot extract functions
Data
Cephalopod outputs 3 R objects:
- CALL at the root of the folder: the parameters passed to run_init in step 2
- QUERY in each subfolder: the data associated with this subfolder
- MODEL in each subfolder: the models associated with this subfolder
# path to the data
<- "../data/" # note that the output folder needs to be included in the project_wd path if that's the arborescence of your files project_wd
Observations
You can extract the original observations with extract_observations
.
extract_observations(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all") %>%
head(10)
measurementvalue decimallongitude decimallatitude month genid year
1 194.3120 -179.5 -20.5 11 94 2019
2 189.7048 -177.5 -20.5 11 92;95 2019
3 186.3668 -177.5 -19.5 11 98 2019
4 190.6600 -175.5 -21.5 11 87;88;89 2019
5 190.6955 -175.5 -20.5 11 91 2019
6 191.7448 -175.5 -19.5 11 99 2019
7 190.6922 -174.5 -19.5 11 100;101 2019
8 195.8903 -170.5 -20.5 11 96 2019
9 194.7133 -166.5 -20.5 11 97 2019
10 187.7988 -159.5 31.5 10 246;247 2011
measurementunit depth varfactor ID
1 0-255 200 Night 1
2 0-255 200 Day 2
3 0-255 200 Day 3
4 0-255 200 Day 4
5 0-255 200 Day 5
6 0-255 200 Day 6
7 0-255 200 Day 7
8 0-255 200 Day 8
9 0-255 200 Day 10
10 0-255 200 Day;Night 11
Model performance
You can extract the different performance metrics used in the QC with the extract_performance
function:
- R2: good QC if > 0.25 (continuous) or 0.5 (presence-absence)
- CUM_VIP: cumulative prediction from the top 3 predictors, goof QC if > 50%
- NSD: normalized standard deviation, good QC if < 0.5
# for all models
extract_performance(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all")
# A tibble: 6 × 4
R2 CUM_VIP NSD model
<dbl> <dbl> <dbl> <chr>
1 0.104 93.7 0.00774 GLM
2 0.05 80.6 0.0377 MLP
3 0.141 59.6 0.0303 BRT
4 0.198 84.9 0.0108 GAM
5 0.159 60.3 0.00768 SVM
6 0.178 60.5 0.0120 RF
# for a specific model
extract_performance(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
model="GAM")
# A tibble: 1 × 4
R2 CUM_VIP NSD model
<dbl> <dbl> <dbl> <chr>
1 0.198 84.9 0.0108 GAM
Predictors
List and predictor importance
With extract_predictors
, you can retrieve all the predictors used as input (no subfolder indicated), or just the selected predictors for each model (subfolder indicated). The later will also retrieve the mean, median and sd of the predictors importance in that model (calculated based on the n folds x m bootstraps values).
# extract the input predictors
<- extract_predictors(project_wd,FOLDER_NAME = "zooscan_grey")
inputpreds head(inputpreds,10)
# A tibble: 10 × 1
variable
<chr>
1 climatology_A_0_50
2 climatology_A_200_300
3 climatology_A_ADG_regridded
4 climatology_A_APH_regridded
5 climatology_A_BBP_regridded
6 climatology_A_CHLA_regridded
7 climatology_A_K490_regridded
8 climatology_A_PAR_regridded
9 climatology_co2_SODA
10 climatology_co3_SODA
# extract the selected predictors for all models
<- extract_predictors(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all")
modpreds head(modpreds,10)
# A tibble: 10 × 5
variable mean median sd model
<fct> <dbl> <dbl> <dbl> <chr>
1 climatology_A_200_300 73.0 72.9 9.60 GLM
2 climatology_DIATO_1998_2020_CMEMS 13.7 12.2 9.96 GLM
3 climatology_co2_SODA 7.03 4.78 7.09 GLM
4 climatology_PICO_1998_2020_CMEMS 1.82 1.35 3.41 GLM
5 climatology_ph_total_SODA 1.85 1.26 2.84 GLM
6 climatology_n_200_300 2.65 0.472 4.14 GLM
7 climatology_A_200_300 36.1 35.6 10.3 MLP
8 climatology_n_200_300 28.7 29.8 8.75 MLP
9 climatology_co2_SODA 15.9 14.8 8.14 MLP
10 climatology_DIATO_1998_2020_CMEMS 10.6 5.50 9.66 MLP
# extract the selected predictors for a specific model
extract_predictors(project_wd,FOLDER_NAME = "zooscan_grey",
SUBFOLDER_NAME="all",model="RF") %>%
head(10)
# A tibble: 6 × 5
variable mean median sd model
<fct> <dbl> <dbl> <dbl> <chr>
1 climatology_A_200_300 24.0 24.0 2.90 RF
2 climatology_ph_total_SODA 21.6 21.6 2.15 RF
3 climatology_DIATO_1998_2020_CMEMS 14.9 14.4 2.55 RF
4 climatology_PICO_1998_2020_CMEMS 13.5 14.0 2.23 RF
5 climatology_n_200_300 13.8 13.8 1.77 RF
6 climatology_co2_SODA 12.2 12.0 1.50 RF
Predictor category
Some of the predictors have similar patterns, or relate to forcings in similar families These are subjective (some could be in several cateries like co2 in carbonate or longitude, or aph in optics or productivity) and you are welcome to change them, an example is in the file predictors_categories.csv, and you can retrieve them with the function predictor_categories
.
# on the 6 variables select
predictor_categories(modpreds) %>% distinct(variable,category) %>%
arrange(category)
# A tibble: 6 × 2
variable category
<fct> <chr>
1 climatology_co2_SODA Carbonate
2 climatology_n_200_300 Chemistry
3 climatology_A_200_300 Latitude
4 climatology_ph_total_SODA Latitude
5 climatology_DIATO_1998_2020_CMEMS Productivity
6 climatology_PICO_1998_2020_CMEMS Productivity
# on all of the input variables
<- predictor_categories(inputpreds)
inputpreds head(inputpreds,10)
# A tibble: 10 × 2
variable category
<chr> <chr>
1 climatology_A_0_50 Chemistry
2 climatology_A_200_300 Latitude
3 climatology_A_ADG_regridded Optics
4 climatology_A_APH_regridded Optics
5 climatology_A_BBP_regridded Optics
6 climatology_A_CHLA_regridded Productivity
7 climatology_A_K490_regridded Optics
8 climatology_A_PAR_regridded Latitude
9 climatology_co2_SODA Carbonate
10 climatology_co3_SODA Carbonate
Predictions
You can extract the predictions using the extract_prediction
function, it takes as input: * project_wd, FOLDER_NAME, SUBFOLDER_NAME * ensemble: boolean, if FALSE returns all the mean, sd, normalized standard deviation and coefficient of variation across bootstraps for each model, month and factor, if TRUE returns an ensemble model so the average values across models * model: vector, models to retrieve, default to NULL (all models if ensemble = FALSE, and models that pass the QC if ensemble = TRUE) * m: vector, months to retrieve, default to NULL (all months) * f: vector, factor levels to retrieve, default to NULL (all levels). If Cephalopod wasn’t used with a factor variable, the level will be 1
Note 1: the projection for the ensemble is not stored in the MODEL output but is recalculated in the function if you indicate ENSEMBLE = TRUE (based on the MODELS that pass the QC or based on the models indicated in the vector model), otherwise it returns a data frame of all the models projections
Note 2: for the Cephalopod-factor variable, the output for the last dimension is in the order of appearance in the data (i.e. if in varfactor the first sample is a Night sample, then the first output is Night and the second is Day).
Note 3: the function calculates the standard deviation, normalized standard deviation and coefficient of variation. The normalized standard deviation is calculated with the max value within models (i.e. taking the max average of the bootstraps per cell, month and factor for a specific color), whatever the subset list of months and factors you indicate it, so it is inter-comparable intra-models (e.g. month 9 at night and month 9 at day of RF model) but not inter-models. This is to avoid bias in the normalization due to a model being wildly different, however, that’s why the max used in the normalization for each model is also recorded and returned when ensemble is not TRUE so that you can calculate an inter-models comparadle NSD. When ensemble is TRUE, the averages NSD per cell across model is retrieved.
# extract all model predictions
<- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
df ensemble=FALSE,model=NULL,m=NULL,f=NULL)
Retrieving data for models: GLM MLP BRT GAM SVM RF
Processing model: GLM
Normalized standard deviation for GLM was calculated with: 238.0467
Processing model: MLP
Normalized standard deviation for MLP was calculated with: 196.59
Processing model: BRT
Normalized standard deviation for BRT was calculated with: 200.3443
Processing model: GAM
Normalized standard deviation for GAM was calculated with: 277.665
Processing model: SVM
Normalized standard deviation for SVM was calculated with: 188.4642
Processing model: RF
Normalized standard deviation for RF was calculated with: 193.5838
head(df,10)
# A tibble: 10 × 8
month varfactor predictedvalue predictedsd predictednsd predictedcv model
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 188. 2.82 0.0118 0.0150 GLM
2 1 1 188. 2.75 0.0115 0.0146 GLM
3 1 1 188. 2.67 0.0112 0.0142 GLM
4 1 1 188. 2.59 0.0109 0.0138 GLM
5 1 1 188. 2.44 0.0103 0.0130 GLM
6 1 1 188. 2.36 0.00992 0.0125 GLM
7 1 1 188. 2.30 0.00967 0.0122 GLM
8 1 1 189. 2.26 0.00949 0.0120 GLM
9 1 1 189. 2.26 0.00951 0.0120 GLM
10 1 1 189. 2.18 0.00918 0.0116 GLM
# ℹ 1 more variable: max_nsd <dbl>
# extract certain months, month and factor variable
<- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
df ensemble=FALSE,model="RF",m=c(1,2),f=1)
Retrieving data for models: RF
Processing model: RF
Normalized standard deviation for RF was calculated with: 193.5838
head(df,10)
# A tibble: 10 × 8
month varfactor predictedvalue predictedsd predictednsd predictedcv model
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 188. 2.79 0.0144 0.0148 RF
2 1 1 188. 2.76 0.0142 0.0146 RF
3 1 1 188. 2.76 0.0143 0.0147 RF
4 1 1 186. 4.12 0.0213 0.0222 RF
5 1 1 185. 4.41 0.0228 0.0239 RF
6 1 1 184. 3.48 0.0180 0.0189 RF
7 1 1 186. 3.04 0.0157 0.0163 RF
8 1 1 186. 3.15 0.0163 0.0169 RF
9 1 1 186. 2.98 0.0154 0.0160 RF
10 1 1 186. 3.11 0.0161 0.0167 RF
# ℹ 1 more variable: max_nsd <dbl>
# extract ensemble model => case where no model passed the QC
<- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
df ensemble=TRUE,model=NULL,m=NULL,f=NULL)
Error in extract_prediction(project_wd, FOLDER_NAME = "zooscan_grey", : No model passed the QC so we can't create a default ensemble. If you want a specific ensemble, indicate a vector of models with the model argument.
# if no model passed the QC, if you want an ensemble you will have to specify it
<- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
df ensemble=TRUE,model=c("GAM","RF"),m=NULL,f=NULL)
Retrieving data for models: GAM RF
Processing model: GAM
Processing ensemble model
Processing model: RF
Processing ensemble model
Normalized standard deviation for the ensemble model was calculated with: 217.8235
head(df,10)
# A tibble: 10 × 10
# Groups: latitude, longitude, varfactor, month [10]
latitude longitude predictedvalue predictedsd varfactor month predictednsd
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 81.5 6.5 181. 3.63 1 1,2,3,4… 0.0167
2 81.5 7.5 182. 3.53 1 1,2,3,4… 0.0162
3 81.5 8.5 182. 3.47 1 1,2,3,4… 0.0159
4 81.5 9.5 182. 3.43 1 1,2,3,4… 0.0158
5 81.5 10.5 182. 3.28 1 1,2,3,4… 0.0151
6 81.5 11.5 182. 3.26 1 1,2,3,4… 0.0150
7 81.5 12.5 182. 3.13 1 1,2,3,4… 0.0144
8 81.5 13.5 182. 3.08 1 1,2,3,4… 0.0141
9 81.5 14.5 182. 3.08 1 1,2,3,4… 0.0141
10 81.5 15.5 182. 3.02 1 1,2,3,4… 0.0139
# ℹ 3 more variables: predictedcv <dbl>, model <chr>, max_nsd <dbl>
# ensemble model where some models passed the QC
<- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey_varfactor",SUBFOLDER_NAME="all",
df ensemble=TRUE,model=NULL,m=NULL,f=NULL)
Retrieving data for models: BRT RF
Processing model: BRT
Processing ensemble model
Processing model: RF
Processing ensemble model
Normalized standard deviation for the ensemble model was calculated with: 191.1641
head(df,10)
# A tibble: 10 × 10
# Groups: latitude, longitude, varfactor, month [10]
latitude longitude predictedvalue predictedsd varfactor month predictednsd
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 81.5 6.5 188. 4.53 1,2 1,2,3,4… 0.0237
2 81.5 7.5 189. 5.28 1,2 1,2,3,4… 0.0276
3 81.5 8.5 189. 5.32 1,2 1,2,3,4… 0.0278
4 81.5 9.5 189. 5.31 1,2 1,2,3,4… 0.0278
5 81.5 10.5 190. 5.71 1,2 1,2,3,4… 0.0299
6 81.5 11.5 188. 4.75 1,2 1,2,3,4… 0.0248
7 81.5 12.5 188. 4.78 1,2 1,2,3,4… 0.0250
8 81.5 13.5 188. 4.81 1,2 1,2,3,4… 0.0251
9 81.5 14.5 189. 4.77 1,2 1,2,3,4… 0.0250
10 81.5 15.5 188. 4.88 1,2 1,2,3,4… 0.0255
# ℹ 3 more variables: predictedcv <dbl>, model <chr>, max_nsd <dbl>
Land
You can extract a land mask from the first climatology used as predictor with extract_land
.
<- extract_land(project_wd,FOLDER_NAME="zooscan_grey")
land_df head(land_df,10)
longitude latitude land
1 -179.5 89.5 9999
2 -178.5 89.5 9999
3 -177.5 89.5 9999
4 -176.5 89.5 9999
5 -175.5 89.5 9999
6 -174.5 89.5 9999
7 -173.5 89.5 9999
8 -172.5 89.5 9999
9 -171.5 89.5 9999
10 -170.5 89.5 9999