Example use of the Cephaloplot extract functions

This tutorial shows how to use the functions in Cephaloplot to extract data from the model output.

Author

Virginie Sonnet

Published

July 17, 2025

# packages
source("../functions/required_packages.R")

# functions 
source("../functions/extract.R") # predictor categories
source("../functions/categories.R") # predictor categories

Data

Cephalopod outputs 3 R objects:

CALL at the root of the folder: the parameters passed to run_init in step 2
QUERY in each subfolder: the data associated with this subfolder
MODEL in each subfolder: the models associated with this subfolder

# path to the data 
project_wd <- "../data/" # note that the output folder needs to be included in the project_wd path if that's the arborescence of your files

Observations

You can extract the original observations with extract_observations.

extract_observations(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all") %>% 
  head(10)

   measurementvalue decimallongitude decimallatitude month    genid year
1          194.3120           -179.5           -20.5    11       94 2019
2          189.7048           -177.5           -20.5    11    92;95 2019
3          186.3668           -177.5           -19.5    11       98 2019
4          190.6600           -175.5           -21.5    11 87;88;89 2019
5          190.6955           -175.5           -20.5    11       91 2019
6          191.7448           -175.5           -19.5    11       99 2019
7          190.6922           -174.5           -19.5    11  100;101 2019
8          195.8903           -170.5           -20.5    11       96 2019
9          194.7133           -166.5           -20.5    11       97 2019
10         187.7988           -159.5            31.5    10  246;247 2011
   measurementunit depth varfactor ID
1            0-255   200     Night  1
2            0-255   200       Day  2
3            0-255   200       Day  3
4            0-255   200       Day  4
5            0-255   200       Day  5
6            0-255   200       Day  6
7            0-255   200       Day  7
8            0-255   200       Day  8
9            0-255   200       Day 10
10           0-255   200 Day;Night 11

Model performance

You can extract the different performance metrics used in the QC with the extract_performance function:

R2: good QC if > 0.25 (continuous) or 0.5 (presence-absence)
CUM_VIP: cumulative prediction from the top 3 predictors, goof QC if > 50%
NSD: normalized standard deviation, good QC if < 0.5

# for all models 
extract_performance(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all")

# A tibble: 6 × 4
     R2 CUM_VIP     NSD model
  <dbl>   <dbl>   <dbl> <chr>
1 0.104    93.7 0.00774 GLM  
2 0.05     80.6 0.0377  MLP  
3 0.141    59.6 0.0303  BRT  
4 0.198    84.9 0.0108  GAM  
5 0.159    60.3 0.00768 SVM  
6 0.178    60.5 0.0120  RF

# for a specific model 
extract_performance(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
                    model="GAM")

# A tibble: 1 × 4
     R2 CUM_VIP    NSD model
  <dbl>   <dbl>  <dbl> <chr>
1 0.198    84.9 0.0108 GAM

Predictors

List and predictor importance

With extract_predictors, you can retrieve all the predictors used as input (no subfolder indicated), or just the selected predictors for each model (subfolder indicated). The later will also retrieve the mean, median and sd of the predictors importance in that model (calculated based on the n folds x m bootstraps values).

# extract the input predictors 
inputpreds <- extract_predictors(project_wd,FOLDER_NAME = "zooscan_grey") 
head(inputpreds,10)

# A tibble: 10 × 1
   variable                    
   <chr>                       
 1 climatology_A_0_50          
 2 climatology_A_200_300       
 3 climatology_A_ADG_regridded 
 4 climatology_A_APH_regridded 
 5 climatology_A_BBP_regridded 
 6 climatology_A_CHLA_regridded
 7 climatology_A_K490_regridded
 8 climatology_A_PAR_regridded 
 9 climatology_co2_SODA        
10 climatology_co3_SODA

# extract the selected predictors for all models 
modpreds <- extract_predictors(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all") 
head(modpreds,10)

# A tibble: 10 × 5
   variable                           mean median    sd model
   <fct>                             <dbl>  <dbl> <dbl> <chr>
 1 climatology_A_200_300             73.0  72.9    9.60 GLM  
 2 climatology_DIATO_1998_2020_CMEMS 13.7  12.2    9.96 GLM  
 3 climatology_co2_SODA               7.03  4.78   7.09 GLM  
 4 climatology_PICO_1998_2020_CMEMS   1.82  1.35   3.41 GLM  
 5 climatology_ph_total_SODA          1.85  1.26   2.84 GLM  
 6 climatology_n_200_300              2.65  0.472  4.14 GLM  
 7 climatology_A_200_300             36.1  35.6   10.3  MLP  
 8 climatology_n_200_300             28.7  29.8    8.75 MLP  
 9 climatology_co2_SODA              15.9  14.8    8.14 MLP  
10 climatology_DIATO_1998_2020_CMEMS 10.6   5.50   9.66 MLP

# extract the selected predictors for a specific model 
extract_predictors(project_wd,FOLDER_NAME = "zooscan_grey",
                   SUBFOLDER_NAME="all",model="RF") %>% 
  head(10)

# A tibble: 6 × 5
  variable                           mean median    sd model
  <fct>                             <dbl>  <dbl> <dbl> <chr>
1 climatology_A_200_300              24.0   24.0  2.90 RF   
2 climatology_ph_total_SODA          21.6   21.6  2.15 RF   
3 climatology_DIATO_1998_2020_CMEMS  14.9   14.4  2.55 RF   
4 climatology_PICO_1998_2020_CMEMS   13.5   14.0  2.23 RF   
5 climatology_n_200_300              13.8   13.8  1.77 RF   
6 climatology_co2_SODA               12.2   12.0  1.50 RF

Predictor category

Some of the predictors have similar patterns, or relate to forcings in similar families These are subjective (some could be in several cateries like co2 in carbonate or longitude, or aph in optics or productivity) and you are welcome to change them, an example is in the file predictors_categories.csv, and you can retrieve them with the function predictor_categories.

# on the 6 variables select
predictor_categories(modpreds) %>% distinct(variable,category) %>% 
  arrange(category)

# A tibble: 6 × 2
  variable                          category    
  <fct>                             <chr>       
1 climatology_co2_SODA              Carbonate   
2 climatology_n_200_300             Chemistry   
3 climatology_A_200_300             Latitude    
4 climatology_ph_total_SODA         Latitude    
5 climatology_DIATO_1998_2020_CMEMS Productivity
6 climatology_PICO_1998_2020_CMEMS  Productivity

# on all of the input variables 
inputpreds <- predictor_categories(inputpreds)
head(inputpreds,10)

# A tibble: 10 × 2
   variable                     category    
   <chr>                        <chr>       
 1 climatology_A_0_50           Chemistry   
 2 climatology_A_200_300        Latitude    
 3 climatology_A_ADG_regridded  Optics      
 4 climatology_A_APH_regridded  Optics      
 5 climatology_A_BBP_regridded  Optics      
 6 climatology_A_CHLA_regridded Productivity
 7 climatology_A_K490_regridded Optics      
 8 climatology_A_PAR_regridded  Latitude    
 9 climatology_co2_SODA         Carbonate   
10 climatology_co3_SODA         Carbonate

Predictions

You can extract the predictions using the extract_prediction function, it takes as input: * project_wd, FOLDER_NAME, SUBFOLDER_NAME * ensemble: boolean, if FALSE returns all the mean, sd, normalized standard deviation and coefficient of variation across bootstraps for each model, month and factor, if TRUE returns an ensemble model so the average values across models * model: vector, models to retrieve, default to NULL (all models if ensemble = FALSE, and models that pass the QC if ensemble = TRUE) * m: vector, months to retrieve, default to NULL (all months) * f: vector, factor levels to retrieve, default to NULL (all levels). If Cephalopod wasn’t used with a factor variable, the level will be 1

Note 1: the projection for the ensemble is not stored in the MODEL output but is recalculated in the function if you indicate ENSEMBLE = TRUE (based on the MODELS that pass the QC or based on the models indicated in the vector model), otherwise it returns a data frame of all the models projections

Note 2: for the Cephalopod-factor variable, the output for the last dimension is in the order of appearance in the data (i.e. if in varfactor the first sample is a Night sample, then the first output is Night and the second is Day).

Note 3: the function calculates the standard deviation, normalized standard deviation and coefficient of variation. The normalized standard deviation is calculated with the max value within models (i.e. taking the max average of the bootstraps per cell, month and factor for a specific color), whatever the subset list of months and factors you indicate it, so it is inter-comparable intra-models (e.g. month 9 at night and month 9 at day of RF model) but not inter-models. This is to avoid bias in the normalization due to a model being wildly different, however, that’s why the max used in the normalization for each model is also recorded and returned when ensemble is not TRUE so that you can calculate an inter-models comparadle NSD. When ensemble is TRUE, the averages NSD per cell across model is retrieved.

# extract all model predictions
df <- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
                          ensemble=FALSE,model=NULL,m=NULL,f=NULL)

Retrieving data for models: GLM MLP BRT GAM SVM RF
 Processing model: GLM
 Normalized standard deviation for GLM was calculated with:  238.0467
 Processing model: MLP
 Normalized standard deviation for MLP was calculated with:  196.59
 Processing model: BRT
 Normalized standard deviation for BRT was calculated with:  200.3443
 Processing model: GAM
 Normalized standard deviation for GAM was calculated with:  277.665
 Processing model: SVM
 Normalized standard deviation for SVM was calculated with:  188.4642
 Processing model: RF
 Normalized standard deviation for RF was calculated with:  193.5838

head(df,10)

# A tibble: 10 × 8
   month varfactor predictedvalue predictedsd predictednsd predictedcv model
   <chr> <chr>              <dbl>       <dbl>        <dbl>       <dbl> <chr>
 1 1     1                   188.        2.82      0.0118       0.0150 GLM  
 2 1     1                   188.        2.75      0.0115       0.0146 GLM  
 3 1     1                   188.        2.67      0.0112       0.0142 GLM  
 4 1     1                   188.        2.59      0.0109       0.0138 GLM  
 5 1     1                   188.        2.44      0.0103       0.0130 GLM  
 6 1     1                   188.        2.36      0.00992      0.0125 GLM  
 7 1     1                   188.        2.30      0.00967      0.0122 GLM  
 8 1     1                   189.        2.26      0.00949      0.0120 GLM  
 9 1     1                   189.        2.26      0.00951      0.0120 GLM  
10 1     1                   189.        2.18      0.00918      0.0116 GLM  
# ℹ 1 more variable: max_nsd <dbl>

# extract certain months, month and factor variable 
df <- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
                          ensemble=FALSE,model="RF",m=c(1,2),f=1)

Retrieving data for models: RF
 Processing model: RF
 Normalized standard deviation for RF was calculated with:  193.5838

head(df,10)

# A tibble: 10 × 8
   month varfactor predictedvalue predictedsd predictednsd predictedcv model
   <chr> <chr>              <dbl>       <dbl>        <dbl>       <dbl> <chr>
 1 1     1                   188.        2.79       0.0144      0.0148 RF   
 2 1     1                   188.        2.76       0.0142      0.0146 RF   
 3 1     1                   188.        2.76       0.0143      0.0147 RF   
 4 1     1                   186.        4.12       0.0213      0.0222 RF   
 5 1     1                   185.        4.41       0.0228      0.0239 RF   
 6 1     1                   184.        3.48       0.0180      0.0189 RF   
 7 1     1                   186.        3.04       0.0157      0.0163 RF   
 8 1     1                   186.        3.15       0.0163      0.0169 RF   
 9 1     1                   186.        2.98       0.0154      0.0160 RF   
10 1     1                   186.        3.11       0.0161      0.0167 RF   
# ℹ 1 more variable: max_nsd <dbl>

# extract ensemble model => case where no model passed the QC
df <- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
                          ensemble=TRUE,model=NULL,m=NULL,f=NULL)

Error in extract_prediction(project_wd, FOLDER_NAME = "zooscan_grey", : No model passed the QC so we can't create a default ensemble. If you want a specific ensemble, indicate a vector of models with the model argument.

# if no model passed the QC, if you want an ensemble you will have to specify it 
df <- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey",SUBFOLDER_NAME="all",
                          ensemble=TRUE,model=c("GAM","RF"),m=NULL,f=NULL)

Retrieving data for models: GAM RF
 Processing model: GAM
 Processing ensemble model
 Processing model: RF
 Processing ensemble model
 Normalized standard deviation for the ensemble model was calculated with:  217.8235

head(df,10)

# A tibble: 10 × 10
# Groups:   latitude, longitude, varfactor, month [10]
   latitude longitude predictedvalue predictedsd varfactor month    predictednsd
      <dbl>     <dbl>          <dbl>       <dbl> <chr>     <chr>           <dbl>
 1     81.5       6.5           181.        3.63 1         1,2,3,4…       0.0167
 2     81.5       7.5           182.        3.53 1         1,2,3,4…       0.0162
 3     81.5       8.5           182.        3.47 1         1,2,3,4…       0.0159
 4     81.5       9.5           182.        3.43 1         1,2,3,4…       0.0158
 5     81.5      10.5           182.        3.28 1         1,2,3,4…       0.0151
 6     81.5      11.5           182.        3.26 1         1,2,3,4…       0.0150
 7     81.5      12.5           182.        3.13 1         1,2,3,4…       0.0144
 8     81.5      13.5           182.        3.08 1         1,2,3,4…       0.0141
 9     81.5      14.5           182.        3.08 1         1,2,3,4…       0.0141
10     81.5      15.5           182.        3.02 1         1,2,3,4…       0.0139
# ℹ 3 more variables: predictedcv <dbl>, model <chr>, max_nsd <dbl>

# ensemble model where some models passed the QC
df <- extract_prediction(project_wd,FOLDER_NAME = "zooscan_grey_varfactor",SUBFOLDER_NAME="all",
                          ensemble=TRUE,model=NULL,m=NULL,f=NULL)

Retrieving data for models: BRT RF
 Processing model: BRT
 Processing ensemble model
 Processing model: RF
 Processing ensemble model
 Normalized standard deviation for the ensemble model was calculated with:  191.1641

head(df,10)

# A tibble: 10 × 10
# Groups:   latitude, longitude, varfactor, month [10]
   latitude longitude predictedvalue predictedsd varfactor month    predictednsd
      <dbl>     <dbl>          <dbl>       <dbl> <chr>     <chr>           <dbl>
 1     81.5       6.5           188.        4.53 1,2       1,2,3,4…       0.0237
 2     81.5       7.5           189.        5.28 1,2       1,2,3,4…       0.0276
 3     81.5       8.5           189.        5.32 1,2       1,2,3,4…       0.0278
 4     81.5       9.5           189.        5.31 1,2       1,2,3,4…       0.0278
 5     81.5      10.5           190.        5.71 1,2       1,2,3,4…       0.0299
 6     81.5      11.5           188.        4.75 1,2       1,2,3,4…       0.0248
 7     81.5      12.5           188.        4.78 1,2       1,2,3,4…       0.0250
 8     81.5      13.5           188.        4.81 1,2       1,2,3,4…       0.0251
 9     81.5      14.5           189.        4.77 1,2       1,2,3,4…       0.0250
10     81.5      15.5           188.        4.88 1,2       1,2,3,4…       0.0255
# ℹ 3 more variables: predictedcv <dbl>, model <chr>, max_nsd <dbl>

Land

You can extract a land mask from the first climatology used as predictor with extract_land.

land_df <- extract_land(project_wd,FOLDER_NAME="zooscan_grey")
head(land_df,10)

   longitude latitude land
1     -179.5     89.5 9999
2     -178.5     89.5 9999
3     -177.5     89.5 9999
4     -176.5     89.5 9999
5     -175.5     89.5 9999
6     -174.5     89.5 9999
7     -173.5     89.5 9999
8     -172.5     89.5 9999
9     -171.5     89.5 9999
10    -170.5     89.5 9999