Subsampling of Plankton Imagery Datasets and Reprojection in a reduced space

Introduction

Large morphological datasets

Plantkon imagers can image planktonic organisms from 300nm (FlowCam nano) up to 10cm (ISIIS) (Lombard et al, 2019). Some run in a lab but more and more are automated and can be deployed on cruises or on fixed platforms, generating millions and millions of images. Those datasets are mostly used for taxonomic purposes. However, a number of morphological characteristics associated to each individual are often generated to be used within machine learning algorithms. If those features form the basis of the plankton image recognition, they are also interesting measures of variability within the community that do not depend on the species level.

Such morphological datasets that combine numerous columns of morphological features and millions of observations, can’t easily be opened in any of the widely used programming softwares - Python, R, MATLAB for example - for further exploration with multivariate data analysis. Even if it was possible, running a Principal Component Analysis or a t-SNE on millions of lines takes time and computer power. Knowledge of methods and resources specific to big data analysis such as MapReduce with Hadoop or High Performance Computing help in those cases but these resources are not available to everyone and these tools have a quite steep learning curve.

A methodology based on subsampling and dimension reduction

Here, we detail a workflow to explore morphological variation of planktonic organisms based on subsampling, Principal Component Analysis and reprojection (Figure 1). Note that the schema includes an extra step not presented here which consist in analyzing the morphological diversity through clustering and diversity indices.

Figure 1. Workflow of the methodology based on subsampling, dimension reduction and reprojecion

Applications and example

The full methodology was applied to a 2-years dataset of phytoplankton imagery from an Imaging FlowCytobot (Sonnet et al., in preparation).

Vilgrain et al. (2021) applied the morphological approach with the PCA on a smaller dataset of Arctic copepods that did not need a subsampling step.

The expanded morphological approach using PCA, k-means and diversity (in that case the Shannon-Wiener index) designed by Jean-Olivier Irisson and Sakina-Dorothée Ayata was also used on marine snow (Trudnowska et al., 2021).

Below, you can find an example with a subset from the data used by Sonnet et al. The graphs come from the analysis of the 2-years dataset but if you want to try running the analysis, in the github folder you will find a 2-days extract of the dataset and the files that stem from this analysis applied on it.

Storage

I assume the morphological features are stored for each image within a database. All the queries within this document are only examples and will need to be adapted to the structure of your database. The same applies for the database information and password. For reference, Figure 2 shows the structure of the database I used (designed by Audrey Ciochetto).

If you only have csv files, you can combine them within the shell and work from there and the principles of the analysis are the same. That’s what I started with!

Figure 2. Diagram of the database structure in the Mouw lab. Main boxes are tables and each table has a first column called id. Smaller boxes on arrows are the column name that relates the right table with the id column of the left table.

Codes

Most of the codes have not been turned into functions because there are many possibilities in your column names, morphological characteristics, conditions specific to the deployment of any plankton imager and the parameters you will want to test. However, both within this document and at the beginning of each code, there is an indication of which sections of the code you might be willing to adapt.

Throughout the document and the codes associated, I assume you are in an R project and directory that have 4 folders: data, figures, documents and scripts.

1. Number of images

The number of images to subsample will probably never be completely objective. Any criteria or threshold will be based on your computer power, your time, your dataset, the analysis you are planning, even on you…

Method: I chose to take 5% of my total number of samples and then to try with 3, 10, 100, 250, 500 and 1000 images per sample and look at the changes in variance explained and variables contributing the most to the principal components. I repeat the subsampling 10 times for each case to estimate the standard deviation and the mean.

1.1. Random files

I have inherent criteria based on my instrument when I select my samples, your criteria will depend on your instrument and the way your database is organized. In my case:

a deployment id (deployment_id = 1)
we have had a few files where the timing is not recorded so I can’t use them quantitatively and I exclude them from the analysis (runTime != 0) = remove 23 samples
the quality control code is good (qc = 1)
the minimum number of images in the sample: (n_roi > 250)
dates between 2017-11-01 and 2019-11-01 (date > ‘2017-11-01’ & date < ‘2019-11-01’)

For the roi or images, I have more criteria:

the name of my columns: metadata columns (id, roi_number, filename, date, taxonomic id), columns chosen (area, perim…) and an associated shorter name
add 3 columns to the columns originally in the database: major axis/minor axis, perimeter/major axis and circularity
the class_id is not one of the ciliates, zooplankton, detritus, mix
if the class_id is unclassified, then the area has to be under 90 000 pixel squared to avoid images of algae or nematodes

I use these criteria within the R script 1-1_sample_files.R to select a percentage of random samples from your dataset.

Get the number of files present in the database and meeting your criteria (you might want to change these criteria in the query)
Calculate the number of files to sample based on a percentage (the percentage is 5 in the code)
Query the files id and take randomly the number determined in 2. (you might want to change these criteria in the query)
Query the images that meet your criteria from the files id (you might want to change the criteria of the images in the query)
Extract counts per sample
Export to csv

Outputs: combined images file (\(\small{\textit{subsp_combined.csv}}\)) and counts per file (\(\small{\textit{subsp_counts_cleaned.csv}}\))

Extract column names in bash:

head -1 data/subsp_combined.csv > data/colnames.csv

1.2. Subsampling

Subsample 3, 10, 100, 250, 500 and 1000 images per sample and repeat 10 times (total: 60 files): bash script 1-2_subsampling.sh

This code calls 2 other codes, extract.awk and indices.R.

Arguments:

-d: path where to get and write the data
-p: path where to get the scripts (no slash at the end)
-f: filename of the combined images (default: combined.csv)
-c: filename with the counts per sample after query/cleaning (default: counts_cleaned.csv)
-s: type of sampling, proportion of the file (proportion) VS number of images per sample (date) (default)
-i: replicate number
-n: number of images to sample per file or proportion of the file to sample

Output: \(\small{\textit{combined_sampled{i}_{n}im.csv}}\)

# bash code
for i in 3 10 50 100 250 500 1000 # images subsampled by sample (you might want to try other ones)
do
    for j in {1..10} # replicates  
    do
      bash scripts/1-2_subsampling.sh -d data/ -p scripts -f subsp_combined.csv \
      -c subsp_counts_cleaned.csv -s date -i $j -n $i
    done
done
rm data/idxToSample.csv # remove the csv file with the random indices generated at each iteration

Sometimes, a couple of the subsamples have a header. To make sure none of them have one, remove lines that contain the header pattern (make sure this pattern does not appear in your filename!).

grep "id" data/combined_sampled* | cut -d: -f1 | xargs -n1 sed -i .bak '1d' # select the files with a header and remove this header
rm data/*.bak # remove the temporary .bak files
grep "id" data/combined_sampled* # verify there are none left

1.3. Subsample Analysis

I chose to perform a weighted-PCA with Box-Cox transformation for the non-normal variables. If that’s also your case, you’ll need:

the minimum value that each column can take in the database to switch the columns to have a minimum higher than 1
the concentration of organisms per sample
run the weighted-PCA with Box-Cox transformation

For all of the queries you can use the following bash command with your database information:

-h: server
-D: database
-u: username
-p: password (no space) counts_cleaned.csv)
-e: query

mysql -h images.gso.uri.edu -D plankton_images -u student -pRestmonami -e "query;" > data/mincols.csv

Minimum per column

Below is the query for the minimum. Note that I add 3 columns to the columns originally in the database: major axis/minor axis, perimeter/major axis and circularity.

SELECT MIN(Area) AS area, MIN(Biovolume) AS vol, MIN(ConvexArea) AS 'c.area', 
MIN(ConvexPerimeter) AS 'c.perim', MIN(Eccentricity) AS ecc, 
MIN(EquivDiameter) AS 'eq.diam', MIN(Extent) AS extent, MIN(H180) AS h180, MIN(H90) AS h90, 
MIN(Hflip) AS hflip, MIN(MajorAxisLength) AS 'maj.ax', 
MIN(MinorAxisLength) AS 'min.ax', MIN(Perimeter) AS perim, MIN(Solidity) AS solid, 
MIN(numBlobs) AS 'nb.blobs', MIN(shapehist_kurtosis_normEqD) AS 'kurt.dist', 
MIN(shapehist_mean_normEqD) AS 'mean.dist', MIN(shapehist_median_normEqD) AS 'med.dist', 
MIN(shapehist_mode_normEqD) AS 'mode.dist', MIN(shapehist_skewness_normEqD) AS 'skew.dist', 
MIN(summedArea) AS 's.area', MIN(summedBiovolume) AS 's.vol', MIN(summedConvexArea) AS 's.c.area', 
MIN(summedConvexPerimeter) AS 's.c.perim', MIN(summedMajorAxisLength) AS 's.maj.ax', 
MIN(summedMinorAxisLength) AS 's.min.ax', MIN(summedPerimeter) AS 's.perim', 
MIN(texture_average_contrast) AS 'tx.contrast', MIN(texture_average_gray_level) AS 'tx.gray', 
MIN(texture_entropy) AS 'tx.entropy', MIN(texture_smoothness) AS 'tx.smooth', 
MIN(texture_third_moment) AS 'tx.3rd.mom', MIN(texture_uniformity) AS 'tx.unif', 
MIN(RotatedBoundingBox_xwidth) AS 'r.bbox.x', MIN(RotatedBoundingBox_ywidth) AS 'r.bbox.y',
MIN(Area_over_PerimeterSquared) AS 'area.perim2', MIN(Area_over_Perimeter) AS 'area.perim', 
MIN(H90_over_Hflip) AS 'h90.hflip', MIN(H90_over_H180) AS 'h90.h180', 
MIN(summedConvexPerimeter_over_Perimeter) AS 's.c.perim.perim', 
MIN(MajorAxisLength/MinorAxisLength) AS 'maj.min', 
MIN(Perimeter/MajorAxisLength) AS 'perim.maj', 
MIN(4*3.14159265359*Area_over_PerimeterSquared) AS 'circ' 
FROM roi JOIN raw_files ON roi.raw_file_id = raw_files.id 
         JOIN auto_class ON roi.id = auto_class.roi_id 
WHERE runTime != 0 AND 
      deployment_id = 1 AND 
      qc_file = 1 AND 
      n_roi > 250 AND
      date > '2017-11-01' AND date < '2019-11-01' AND
      class_id NOT IN (58,85,88,90,91,94,106,109,110,111,119,121,122,124,127,128,129) AND
      !(class_id = 96 AND Area > 90000)

Concentration per sample

Second, for the weighted-PCA you will need the concentration of organisms per sample to use as weight: R script concentration.R.

Connect to the database and query the counts, runTime, inhibitTime and mL_counted (you might want to change the criteria of the query)
Compute the concentration: \(mL\_counted = 0.25 \times \frac{(runTime-inhibitTime)}{60}\) and \(concentration = \frac{counts}{mL\_counted}\)
Export to csv

ATTENTION: This code is specifically designed for data from the Imaging FlowCytobot, you will need to estimate the concentration differently for other instruments.

Principal Component Analysis

Third, you can run the PCA on each sample and look at how well it performs: R script 1-3_test_nb_im.R.

This code uses the function pcatrans.R. This function is also in the github repository and allows you to transform certain variables of a dataset with a logarithmic or Box-Cox transformation.

Import files (this will need to be modified based on your instrument and database):
- concentration per sample (\(\small{\textit{concentration.csv}}\))
- name of subsampled files (\(\small{\textit{combined_sampled{i}_{n}im.csv}}\))
- column names (\(\small{\textit{colnames.csv}}\))
- minimum value in the database per column (\(\small{\textit{mincols.csv}}\)) and index of the ones below 0
- define the columns you don’t want to Box-Cox transform
Run the transformation and the PCA
- shift columns by their minimum
- Box-Cox transform
- weighted-PCA keeping the 6 first components (the weight is the concentration of the sample)
- store information on the variance explained by the PCs and the main contributors in a tibble
Export to csv
Graphs
- percentage of variance explained for each of the 6 first PC depending on the number of images sampled (Figure 3)
- occurences of variables as 1st, 2nd and 3rd contributor for each of the 4 first PCs (Figure 4)

Output : results of the comparison between the pca (\(\small{\textit{subsp_results.csv}}\)) and graphs

Figure 3. Variance explained by each axis depending on the number of images subsampled, the black bars are the standard deviations, also materialized by the grey ribbon and the green arrows indicate the values for 250 images

Figure 4. Number of replicates for which a morphological feature is first, second or third contributor for each of the principal components 1 to 4 depending on the number of images subsampled. The morphological variables are coloured and for each number of images subsampled we count how many times they appear as first, second or thirdcontributor on the 4 axes in the 10 replicates. For example, the convex perimeter is the first contributor for the principal component 1 for all 10 replicates whatever the number of images we subsample. However, when we take 3 images per sample, the second contributor of the first principal component is the major axis in 3 replicates and the perimeter in 7 replicates.

For my analysis, I have decided for 250 images per sample because the variance is small between replicates and the main contributors are similar. You need to take into account further analysis down the road. In my case, I want to look at the evolution hour by hour so I want to keep a maximum of details.

You will only need one of the subsamples (with the number of images you have chosen!) for the step 2 so you can delete the others in bash.

find data/combined_sampled* ! -name combined_sampled1_250im.csv | xargs -n1 rm # change your sample name

2. Analysis and reprojection

2.1. Data

Same as for the subsampling, query the data through bash but this time query your whole dataset.

mysql -h images.gso.uri.edu -D plankton_images -u student -pRestmonami -e  "query;" > data/combined.csv

For reference, here is the query I used and it took 411 minutes (7h) to pull 110 741 191 rows from the database (but I have very specific conditions in my query that make it much slower):

SELECT roi.id, roi_number, filename, date, Area AS area, Biovolume AS vol, ConvexArea AS 'c.area', ConvexPerimeter AS 'c.perim', Eccentricity AS ecc, EquivDiameter AS 'eq.diam', Extent AS extent, H180 AS h180, H90 AS h90, Hflip AS hflip, MajorAxisLength AS 'maj.ax', MinorAxisLength AS 'min.ax', Perimeter AS perim, Solidity AS solid, numBlobs AS 'nb.blobs', shapehist_kurtosis_normEqD AS 'kurt.dist', shapehist_mean_normEqD AS 'mean.dist', shapehist_median_normEqD AS 'med.dist', shapehist_mode_normEqD AS 'mode.dist', shapehist_skewness_normEqD AS 'skew.dist', summedArea AS 's.area', summedBiovolume AS 's.vol', summedConvexArea AS 's.c.area', summedConvexPerimeter AS 's.c.perim', summedMajorAxisLength AS 's.maj.ax', summedMinorAxisLength AS 's.min.ax', summedPerimeter AS 's.perim', texture_average_contrast AS 'tx.contrast', texture_average_gray_level AS 'tx.gray', texture_entropy AS 'tx.entropy', texture_smoothness AS 'tx.smooth', texture_third_moment AS 'tx.3rd.mom', texture_uniformity AS 'tx.unif', RotatedBoundingBox_xwidth AS 'r.bbox.x', RotatedBoundingBox_ywidth AS 'r.bbox.y', Area_over_PerimeterSquared AS 'area.perim2', Area_over_Perimeter AS 'area.perim', H90_over_Hflip AS 'h90.hflip', H90_over_H180 AS 'h90.h180', summedConvexPerimeter_over_Perimeter AS 's.c.perim.perim', MajorAxisLength/MinorAxisLength AS 'maj.min', Perimeter/MajorAxisLength AS 'perim.maj', 4*3.14159265359*Area_over_PerimeterSquared AS 'circ', class_id 
FROM roi JOIN raw_files ON roi.raw_file_id = raw_files.id 
         JOIN auto_class ON roi.id = auto_class.roi_id 
WHERE runTime != 0 AND 
      deployment_id = 1 AND 
      qc_file = 1 AND 
      n_roi > 250 AND
      date > '2017-11-01' AND date < '2019-11-01' AND
      class_id NOT IN (58,85,88,90,91,94,106,109,110,111,119,121,122,124,127,128,129) AND
      !(class_id = 96 AND Area > 90000)
ORDER BY date, roi_number

ATTENTION: The subsample file in step 1 was exported from R to csv, as such, it was comma delimited while the combined file we just queried from bash is tab delimited.

The cleaning left consist in removing lines with NULL (or NA depending on your database) values because they correspond to very small cells for which the morphology could not be computed in our case (left: 110 741 171) and removing the header line. In bash, run:

grep -v "NULL" data/combined.csv > data/combined_cleaned.csv # remove lines with NULL values 
head -1 data/combined_cleaned.csv > data/colnames.csv # keep the header in case 
sed -i .bak '1d' data/combined_cleaned.csv # remove the header 
rm data/combined.csv */*.bak

As an example, here is how the first columns and lines of my dataset look like. As I mentioned, it has 110 741 171 images (rows) and 47 variables (columns).

To subsample, you will need the number of images per sample. The following bash code will look at the third column (filename) and count the occurence of each one of them, the second awk will just reorganize to file to have “filename,counts”. For my file the code took 45min.

time awk '{print $3}' data/combined_cleaned.csv | uniq -c | awk 'NR==1; NR > 1 {print $2","$1 | "sort"}' > data/counts.csv # the 2nd awk reorganize
sed -i .bak '1d' data/counts.csv # remove first line with the count for "filename"
head -1 data/subsp_counts_cleaned.csv | cat - data/counts.csv > data/counts_cleaned.csv # add a header based on the header of the subsampled counts (filename, count)
rm */*.bak data/counts.csv

# if I have removed the header  
echo "filename, counts" > counts_cleaned.csv 
awk 'NR==1; {print $2","$1}' counts.csv | sed -n 2p >> counts_cleaned.csv
tail -n+2 counts.csv >> counts_cleaned.csv

2.2. PCA on subsampling

Subsample with the number of images you determined in 1, in my case 250.

time bash scripts/1-2_subsampling.sh -d data -p scripts -f combined_cleaned.csv \
-c counts_cleaned.csv -s date -n 250

Run the PCA and clustering on the subsampled file. Use the Rscript 2-2_pca.R.

Load the data (you might need to change the name of the file based on the number of images you subsampled)
weighted-PCA with Box-Cox transformation
Export

Outputs: a csv file with the Box-Cox coefficients used for the transformation (\(\small{lambda.csv}\)) and a R file with the result object of the PCA (\(\small{\textit{res_pca.rds}}\))

2.3. Reprojection

Reproject all images in the PCA space. Use the Rscript 2-3_reprojection.R.

Load the data
Define the function to apply on each chunk of the csv file (reprojection+knn)
- use the boxcox coefficients from the boxcox transformation of the subsample and apply them on each chunk of data
- use the SVD matrix from the weighted-PCA run on the subsample to reproject all images in the morphological space
Run the reprojection on each chunk

Outputs: a file with the coordinates in the morphological space for each image (\(\small{\textit{full_data_pca.csv}}\))

The data is ready for analysis!!

References

Irisson, J.-O., Cailleton, M., Desnos, C., Jalabert, L., Elineau, A., Stemmann, L., & Ayata, S.-D. (2020). Morphological diversity increases with oligotrophy along a zooplankton time series. Ocean Sciences Meeting 2020.

Lombard, Fabien, Emmanuel Boss, Anya M. Waite, Meike Vogt, Julia Uitz, Lars Stemmann, Heidi M. Sosik, et al. 2019. “Globally Consistent Quantitative Observations of Planktonic Ecosystems.” Frontiers in Marine Science 6 (April): 196. https://doi.org/10.3389/fmars.2019.00196.

Trudnowska, E., Lacour, L., Ardyna, M. et al. Marine snow morphology illuminates the evolution of phytoplankton blooms and determines their subsequent vertical export. Nat Commun 12, 2816 (2021). https://doi.org/10.1038/s41467-021-22994-4

Vilgrain, Laure, Frédéric Maps, Marc Picheral, Marcel Babin, Cyril Aubry, Jean‐Olivier Irisson, and Sakina‐Dorothée Ayata. 2021. “Trait‐based Approach Using in Situ Copepod Images Reveals Contrasting Ecological Patterns across an Arctic Ice Melt Zone.” Limnology and Oceanography 66 (4): 1155–67. https://doi.org/10.1002/lno.11672.