Automatic Image Processing to Determine the Community Size Structure of Riverine Macroinvertebrates

Rosa Gurí; Ignasi Arranz; Marc Ordeix; Carmen García-Comas

doi:10.3791/64320

Environment

Automatic Image Processing to Determine the Community Size Structure of Riverine Macroinvertebrates

Published: January 13, 2023 doi: 10.3791/64320

Rosa Gurí^1,4, Ignasi Arranz^2,4, Marc Ordeix^1,4, Carmen García-Comas^3,4

¹Center for the Study of Mediterranean Rivers (CERM), Universitat de Vic - Universitat Central de Catalunya, ²Laboratoire Evolution et Diversité Biologique (EDB), UMR5174, Université Toulouse 3 Paul Sabatier, Centre national de la recherche scientifique (CNRS), Institut de Recherche pour le Développement (IRD), ³Department of Marine Biology and Oceanography, Institut de Ciències del Mar, Consejo Superior de Investigaciones Científicas (CSIC), ⁴Aquatic Ecology Group, Universitat de Vic - Universitat Central de Catalunya

Summary

The article is based on the creation of an adapted protocol to scan, detect, sort, and identify digitized objects corresponding to benthic river macroinvertebrates using a semi-automatic imaging procedure. This procedure allows the acquisition of the individual size distributions and size metrics of a macroinvertebrate community in about 1 h.

Abstract

Body size is an important functional trait that can be used as a bioindicator to assess the impacts of perturbations in natural communities. Community size structure responds to biotic and abiotic gradients, including anthropogenic perturbations across taxa and ecosystems. However, the manual measurement of small-bodied organisms such as benthic macroinvertebrates (e.g., >500 µm to a few centimeters long) is time-consuming. To expedite the estimation of community size structure, here, we developed a protocol to semi-automatically measure the individual body size of preserved river macroinvertebrates, which are one of the most commonly used bioindicators for assessing the ecological status of freshwater ecosystems. This protocol is adapted from an existing methodology developed to scan marine mesozooplankton with a scanning system designed for water samples. The protocol consists of three main steps: (1) scanning subsamples (fine and coarse sample size fractions) of river macroinvertebrates and processing the digitized images to individualize each detected object in each image; (2) creating, evaluating, and validating a learning set through artificial intelligence to semi-automatically separate the individual images of macroinvertebrates from detritus and artifacts in the scanned samples; and (3) depicting the size structure of the macroinvertebrate communities. In addition to the protocol, this work includes the calibration results and enumerates several challenges and recommendations to adapt the procedure to macroinvertebrate samples and to consider for further improvements. Overall, the results support the use of the presented scanning system for the automatic body size measurement of river macroinvertebrates and suggest that the depiction of their size spectrum is a valuable tool for the rapid bioassessment of freshwater ecosystems.

Introduction

Benthic macroinvertebrates are broadly used as bioindicators to determine the ecological status of water bodies¹. Most indices to describe macroinvertebrate communities focus on taxonomic metrics. However, new bioassessment tools that integrate body size are encouraged to provide an alternative or complementary perspective to taxonomic approaches²^,³.

Body size is considered a metatrait that is related to other vital traits such as metabolism, growth, respiration, and movement⁴. Furthermore, body size can determine trophic position and interactions⁵. The relationship between individual body size and the normalized biomass (or abundance) by size class in a community is defined as the size spectrum⁶ and follows the general pattern of a linear decrease in normalized biomass as individual size increases on a logarithmic scale⁷. The slope of this linear relationship has been extensively studied theoretically, and empirical studies across ecosystems have used it as an ecological indicator of the community size structure⁴. Another synthetic indicator of community size structure that has been successfully used in biodiversity-ecosystem functioning studies is community size diversity, which is represented as the Shannon index of the size classes of the size spectrum or its analog, which is calculated based on the individual size distributions⁸.

In freshwater ecosystems, the size structure of different faunal groups is used as an ataxic indicator to assess the response of biotic communities to environmental gradients⁹^,¹⁰^,¹¹ and to anthropogenic perturbations¹²^,¹³^,¹⁴^,¹⁵^,¹⁶. Macroinvertebrates are not an exception, and their size structure also responds to environmental changes¹⁷^,¹⁸ and anthropogenic perturbations, such as mining¹⁹, land use²⁰, or nitrogen (N) and phosphorus (P) enrichment²⁰^,²¹^,²². However, measuring hundreds of individuals to describe the community size structure is a tedious and time-consuming task that is often avoided as a routine measurement in laboratories due to a lack of time. Thus, several semi-automatic or automatic imaging methods to classify and measure specimens have been developed²³^,²⁴^,²⁵^,²⁶. However, most of these methods are focused on taxonomic classification more than on the individual size of the organisms and are not ready to use for all kinds of macroinvertebrates. In marine plankton ecology, a scanning image analysis system has been extensively used to determine the size and taxonomic composition of zooplankton communities²⁷^,²⁸^,²⁹^,³⁰^,³¹. This instrument can be found in several marine institutes worldwide, and it is used to scan preserved zooplankton samples to obtain high-resolution digital images of the entire sample. The present protocol adapts the use of this instrument to estimate the macroinvertebrate community size spectrum in rivers in a rapid automatic manner without investing in creating a new device.

The protocol consists of scanning a sample and processing the whole image to automatically obtain single images (i.e., vignettes) of the objects in the sample. Several measures of shape, size, and grey-level features characterize each object and allow for the automatic classification of the objects into categories, which are then validated by an expert. The individual size of each organism is calculated using the ellipsoidal biovolume (mm³), which is derived from the area of the organism measured in pixels. This allows for obtaining the size spectrum of the sample in a rapid manner. To the best of our knowledge, this scanning imaging system has only been used to process mesozooplankton samples, but the device may potentially allow for working with freshwater benthic macroinvertebrates.

The overall goal of this study is, therefore, to introduce a method to rapidly obtain the individual size of preserved river macroinvertebrates by adapting an existing protocol previously used with marine mesozooplankton²⁷^,³²^,³³. The procedure consists of using a semi-automatic approach that operates with a scanning device to scan water samples and three open software to process the scanned images. An adapted protocol to scan, detect, and identify digitized river macroinvertebrates to automatically acquire the community size structure and related size metrics is herein presented. The assessment of the procedure and guidelines to enhance the efficiency are also presented based on 42 scanned images of riverine macroinvertebrate samples collected from three basins in the North-Eastern (NE) Iberian Peninsula (Ter, Segre-Ebre, and Besòs).

The samples were collected at 100 m river stretches following the protocol for field sampling and laboratory analysis of benthic river macroinvertebrates in fordable rivers from the Spanish Government³⁴. The samples were collected with a surber sampler (frame: 0.3 m x 0.3 m, mesh: 250 µm) following a multi-habitat survey. In the laboratory, the samples were cleaned and sieved through a 5 mm and a 500 µm mesh to obtain two subsamples: a coarse subsample (5 mm mesh) and a fine subsample (500 µm mesh), which were stored in separate vials and preserved in 70% ethanol. Separating the sample into two size fractions allows for a better estimation of the community size structure, since large organisms are rarer and fewer than the small organisms. Otherwise, the scanned sample has a biased representation of the large size fraction.

Subscription Required. Please recommend JoVE to your librarian.

Protocol

NOTE: The protocol described here is based on the system developed by Gorsky et al.²⁷ for marine mesozooplankton. A specific description of the scanner (ZooSCAN), scanning software (VueScan 9x64 [9.5.09]), image processing software (Zooprocess, ImageJ), and automatic identification software (Plankton Identifier) steps can be found in previous references³²^,³³. To best adjust the sizes of the benthic macroinvertebrates with respect to the mesozooplankton, once the project is created following the original protocol³²^,³³, change the parameter of minimum size (minsizeesd_mm) to 0.3 mm and the parameter of maximum size (maxsizeesd_mm) to 100 mm in the configuration file. To help follow the protocol, this is summarized in a work chart (Figure 1). The created project is stored in the computer's C folder and is organized in the following folders: PID_process, Zooscan_back, Zooscan_check, Zooscan_config, Zooscan_meta, Zooscan_results, and Zooscan_scan. Each folder is composed of several subfolders that the different software applications use in the following steps of the protocol.

1. Acquisition of digital images for macroinvertebrate samples

Scanning and processing the blank
NOTE: Create two blank images daily before scanning to extract the background scans while processing the scanned images on the same day.
1. Turn on the scanner and switch on the light in the dual position to project white light from the top and from the bottom.
  NOTE: When scanning mesozooplankton samples, the upward light direction is used, but because macroinvertebrates are more opaque, it is recommended to switch the light to a dual position.
2. Clean and rinse the scan tray with tap water.
3. Pour 110 mL of tap water stored at room temperature (RT) into the scan tray until the glass is covered. Place the large frame (24.5 cm x 15.8 cm) on the scan tray in the correct position (with the corner at the top-left part of the scan tray), and fill it with tap water until the step of the frame is covered to avoid a meniscus effect, which would alter the scanned image. Close the scanner lid.
  NOTE: Use water at RT to avoid condensation and bubble formation. Clean the frame without marks or droplets to avoid light reflection.
4. Go to the image processing software, select the working project, and click on Scan (Convert) Background Image.
5. Go to the scanning software and click on Preview. Ensure to preview the scanned image, check that there are no lines or spots, and wait for at least 30 s before starting another scan. Click on Scan and press OK in the instructions window before the second scan to send the data from the scanning software to the image-processing software.
  NOTE: Scan twice to obtain the two background scans that will comprise the blank. This step is done once every day before starting the sample processing, and the images are stored in the Zooscan_back folder.
6. Close the scanning software after finishing the scan.
Sample preparation and scanning
CAUTION: Ethanol is a flammable liquid and could cause serious eye damage/irritation.
1. Fill in the sample metadata. Go to the image processing software and select Fill in Sample Metadata. Enter the sample identity, click on OK, and fill in the metadata.
  NOTE: The metafile is specifically created for mesozooplankton samples, so it does not fit the benthic macroinvertebrate sampling methodology, yet all fields of the file need to be filled in before the scan, or an error flag will pop up.
2. Pour 110 mL of 70% ethanol into the scan tray until the glass is covered and place the large frame (24.5 cm x 15.8 cm) with the corner at the top-left part of the scan tray.
  NOTE: Work with ethanol instead of water, as the macroinvertebrates are preserved in ethanol. In water, they float and drift in the scan tray, preventing a sharp image and, thus, reliable size measurements. Ethanol should be preserved at RT to avoid condensation and bubble formation.
3. Pour the macroinvertebrates sample into the scan tray edged by the frame, and cover the frame step with more ethanol if needed.
  NOTE: Refrain from adding too much ethanol to avoid the organisms floating and drifting.
4. Homogenize the sample throughout the frame area, placing the largest individuals in the center of the tray for proper image processing, and sink the floating organisms using a wooden needle.
  NOTE: If a subsample numerically contains more than 1,000 individuals, divide the subsample into two or more fractions to minimize touching organisms in the scanned image, and scan the fractions separately.
5. Separate the touching organisms and the organisms touching the frame edges using the wooden needle.
  NOTE: This step requires 5-20 min. Touching organisms are considered a single object by the software; thus, in those cases, the calculated individual sizes do not correspond to actual single organisms and can bias the estimate of the community size structure. There is the possibility of editing the image with the image processing software to separate them, but this additional step involves at least 1.5 h of reprocessing; thus, manual separation is highly recommended.
6. To scan the sample, close the scanner lid, go to the image processing software, select the working project, and click on SCAN Sample with Zooscan (For Archive, No Process).
7. Select the sample and follow the instructions.
8. Go to the scanning software and click on Preview. Ensure to preview the scanned image, check that there are no lines or spots, and wait at least 30 s before starting another scan.
9. After at least 30 s, click on the Scan button in the scanning software.
  NOTE: Press OK in the image processing software after pressing Scan in the scanning software. Do not press any key on the computer keyboard, and avoid vibrations of the scan during the scanning. Three files are generated in the Zooscan_scan > _raw folder: (i) a tagged image file format (.tif) (16 bit); (ii) a standard text document named LOG (.txt) that records information on the scanning parameters; and (iii) a standard text document named META (.txt) with information on the sampling methods.
10. Verify that the raw scan is correct.
  NOTE: If the scan has light stripes or other visible issues, consider repeating the scan to avoid problems in the following steps.
Sample recovery
1. Remove the frame and rinse it above the scan tray using a squeeze bottle filled with 70% ethanol to recover any attached macroinvertebrates.
2. Lift the upper part of the scanner to retrieve all the organisms and ethanol from the tray through the scan retrieval funnel into a beaker. With the upper part of the scanner still lifted, rinse the tray with the squeeze bottle to sweep along any remaining organisms.
3. Pass the specimens and ethanol from the beaker through a 500 µm mesh to retain the invertebrates in the mesh, and store them back in a vial with 70% ethanol.
4. Once all the specimens are recovered in the vial, clean the tray with tap water.
  NOTE: Wash the tray with tap water between samples to minimize ethanol precipitation, which alters the image processing. Rinse the frame with tap water to avoid potential damages related with ethanol use. At the end of the day, clean the tray using tap water and dry it gently with paper to avoid scratches.
Image processing
1. Go to the image processing software and select CONVERT & PROCESS Images and Organisms in Batch Mode and then Convert AND Process Image AND Particles (Image in RAW Folder). Keep the default settings and click on OK. NORMAL END will appear at the end of the process.
  NOTE: A PID file and the vignettes corresponding to all the detected objects in the scanned image (in a Joint Photographic Group file [.jpg]) will be created in the Zooscan_scan > _work folder. A PID file is a single file that stores all the metadata (metafile), the technical data associated with the log file, and a table with 36 measured variables of all the objects detected in the image. The measured variables correspond to different estimates of grey level, fractal dimension, shape, and size. The variables that can be used for size estimation are the area and the major and minor axes of an ellipse with an equal area to the object (see section 3 of the protocol). The processing time depends on the image density and the computer characteristics, and can be launched between samples while recovering and preparing the next sample. Otherwise, it is recommended to launch the processing of the samples scanned each day in batch mode during the night and check for proper image processing the next morning.
2. Check if the background in the processed image is appropriately subtracted from the sample image using the image processing software or by checking the mask images (terminated in msk1.gif) located in Zooscan_scan > _work. If the background contains saturated areas or many dots, consider repeating the scan to ensure high-quality images.
  NOTE: To avoid saturated areas in the background, the scan tray should be rinsed with tap water after every scan with ethanol. It is also important to (1) reduce the number of scanned individuals (by fractionating the sample and scanning in different folds); (2) ensure that big organisms are placed in the center of the scan tray; (3) use clean, filtered ethanol; (4) reduce dirtiness on the samples; (5) ensure that the volume of ethanol for the scanning is adequate; and (6) ensure that the delay between the preview of the sample and the scan is at least 30 s.
Separation of touching organisms
NOTE: When there are several vignettes with touching organisms, it is necessary to separate the images of the touching organisms from other organisms and/or from fibers/debris to ensure a proper estimate of the community size structure.
1. Go to the image processing software to detect the vignettes with multiple objects. Select SEPARATION Using Vignettes and press on OK. In the configuration selection window, keep the default settings and click on OK.
2. In the SEPARATION from VIGNETTES window, keep the default settings, additionally select ADD Outlines on Vignettes, and then select the sample to edit.
3. Separate the touching organisms in each vignette that pops up by drawing a line with the mouse (press the roll button to draw). Once the separation in a vignette is complete, click the X button in the upper-right corner of the window, and press YES to process the next one. Press NO to end and save the changes. At the end of the process, NORMAL END will appear if everything is correct.
4. After separation, reprocess the image to obtain the updated object data. Go to the image processing software, click on PROCESS (Converted) Image (Process One), and select Process Again Particles from Processed Images in WORK Sub-Folders. Select the sample, and in the Single Image Process window, keep the default settings, check Work with Separation Mask (CREATE-MODIFY-INCLUDE), and then click on OK. At the end of the process, NORMAL END will appear if everything is correct.
5. In the Separation Control window, press OK to save the image with the contours before the processing; if a previous image exists, it will be replaced.
6. In the Separation Control Mask window, if needed, select EDIT to add separation lines to the mask using the mouse to separate touching organisms that have not appeared before in the separation using vignettes step. When finished, end the process, and in the Separation Mask Control window, select YES to accept the mask. At the end of the process, NORMAL END will appear if everything is correct.
  NOTE: Reprocessing a sample with a separation mask is time-consuming (this could take more than 1.5 h per sample). It is preferable to dedicate the required time in step 1.2.5 to avoid this additional step.

2. Automatic recognition of the objects

NOTE: Create a learning set to automatically predict the identity of the detected objects, thus separating the organisms from the debris in the sample.

Learning set creation
1. Copy the images and the .pid files associated with the images that will be used to create the categories of the learning set from Zooscan_scan > _work to PID_process > Unsorted_vignettes_pid.
  NOTE: Select a subset of samples with high taxa diversity and different sampling sites and/or sampling seasons to ensure maximum representativeness of organisms in the samples.
2. In the PID_process > Learning set folder, create a subfolder with the name of the new learning set (i.e., yyyymmdd_raw_LS), and inside it, create the subfolders that will correspond to each category of the learning set (i.e., macroinvertebrates, debris, other invertebrates).
  NOTE: To efficiently obtain the community size structure of river macroinvertebrate samples, it is recommended to use a learning set based on just three categories: macroinvertebrates, other invertebrates, and debris. This learning set basically separates the vignettes of objects corresponding to organisms from those corresponding to debris (e.g., fibers, particles, or filamentous algae).
3. Go to the image processing software (Advanced mode only) and choose EXTRACT Vignettes for PLANKTON IDENTIFIER (unsorted vignettes for training). Keep the default options and check the Add Outlines box.
4. Go to the automatic identification software, click on Learning, select from PID_process > Learning_set the created subfolder for the new learning set (step 2.1.2), and press OK.
5. In the left section (Unsorted Thumbs) of the open window, select the folder Unsorted vignettes_pid. Select the vignettes and drag them with the mouse from the unsorted thumbs to the folder of their corresponding category in the right section, Sorted Thumbs, to classify each object into the defined categories. The moved vignettes will be marked with a red X.
  NOTE: Define the categories manually by creating subfolders in the sorted thumbs folder or create them by clicking on the folders icon in the software. Do not move more than 50 vignettes at the same time.
6. Once all the categories are completed with the selected objects (about 300 objects per category), click on Create Learning File and save it with the desired name.
  NOTE: The learning set will be saved as a .pid file in the PID_process > Learning set folder of the project. It is recommended to create and test several learning sets with different levels of categories (from coarse to fine forms) and with a different balance of the number of objects within each category. Start with a coarse learning set with a low number of categories and at least 50 objects per category, and then increase the number of objects in each category and/or create finer learning sets. A category should be representative of its variability in the set of samples.
Evaluation of the learning set
NOTE: Perform cross-validation with two folds and five trials using the Random Forest method with the automatic identification software to obtain a confusion matrix of the resulting classification of the objects.
1. Go to the automatic classification software and click on Data Analysis.
2. In Select learning file, select the created learning set file from PID_process > Learning set.
3. In Select a method, choose the Cross-Validation Random Forest method. In Original Variables, untick the position variables (X, Y, XM, YM, BX, BY, and Height). In Customized Variables, tick only ESD.
  NOTE: This method uses one random part of the learning set to recognize the other part (two folds), and this is repeated five times to ensure it is statistically robust.
4. Click on Start Analysis, and save the results as Analysis_name.txt in the PID_process > Prediction folder. When the analysis has been successfully completed, quit the data analysis.
5. Go to the PID_process > Prediction folder and click on the cross-validation file. A window will pop up with the confusion matrix of the true classification (rows) versus the automatic classification (columns).
  NOTE: The recall is the percentage of organisms belonging to a group that was automatically well recognized, whereas 1-precision is the percentage of organisms classified by the algorithm as a group that is not recognized (contamination in a group). The recall should be above 70%, and the contamination (1-precision) should be lower than 20%.
6. Repeat steps 2.1-2.5 if several learning sets were created and the recall and 1-precision of each one need to be obtained.
  NOTE: If several learning sets have been created, choose the one with the greatest recall (good recognition) and precision (low contamination) of the group of interest (i.e., macroinvertebrates) to test the automatic prediction of a set of samples in the next step.
Prediction of the identification of macroinvertebrates
NOTE: Use the selected learning set to predict the identity of all the objects in a subset of samples using the automatic identification software with a random forest algorithm.
1. Go to the automatic identification software and click on Data Analysis.
2. In Select Learning file, select the learning set file from PID_process > Learning set that must be used for the prediction.
3. In Select Sample File(s), select from the PID_results folder the samples (PID files) that are going to be predicted.
  NOTE: Process a maximum of 20 .pid files at the same time to avoid errors related to memory problems. If too many .pid files are processed at the same time, the process will show a correct end but may not be processed well, and an error may occur in the next steps when processing with the image processing software.
4. In Select a Method, choose the Random Forest method. Tick Save Detailed Results for Each Sample. In Original Variables, untick the position variables (X, Y, XM, YM, BX, BY and Height). In Customized Variables, tick only ESD.
5. Click on Start Analysis, and save the results as Analysis_name.txt in the PID_process > Prediction folder.
Manual validation
NOTE: An expert manually validates the prediction from the previous step to reclassify misclassified objects into the correct category.
1. Copy the Analysis_sample_dat1.txt files to be validated from the PID_process > Prediction folder to the PID_process > Pid_results folder.
2. Go to the image processing software and select EXTRACT Vignettes in Folders According to PREDICTION or VALIDATION. Then, select Use PREDICTED Files From "pid_results" Folder. Keep the default settings and press OK.
3. The software creates a folder called sample_yyyymmdd_hhmm_to_validate with the predicted objects in the PID_process > Sorted vignettes folder.
4. Go to the PID_process > Sorted vignettes folder, and copy the folder sample_yyyymmdd_ hhmm_to_validate. Replace the folder name _to validate with _validated.
5. To manually validate the automatic classification, open the folder sample_yyyymmdd_ hhmm_validated, and review all the vignettes from each subfolder (category) in order to identify if there are misclassified objects. When one object is misclassified, drag the vignette using the mouse to the correct folder (category).
6. Go to the image processing software and select LOAD Identifications from Sorted Vignettes. Keep the default settings and select yyyymmdd_hhmm_name_validated to be processed.
7. Go to PID_process > Pid_results > Dat1_validated, where a file named Id_from_sorted_vignettes_yyyymmdd_hhmm.txt and one .txt file for each one of the validated samples (sample_tot_1_dat1.txt) have been created.
  NOTE: These .txt files contain a new column that presents the prediction, called pred_valid_Id_yyyymmdd_hhmm, which specifies the expert classification of each object (i.e., the validated classification). New categories (e.g., finer taxonomic categories) could be created at this point, during validation. However, keep the name of the original category in the new name (e.g., macroinvertebrate_chironomidae). This allows for retracing the original category when calculating the recall and precision and for easily grouping all the macroinvertebrates to calculate the community size structure parameters (i.e., the size spectrum and size diversity). The text file provides the data associated with each object, including the minor and major axes that are used to obtain the ellipsoidal volume of each organism as a measure of individual body size. Moreover, the last two columns of the table contain the predicted and validated categories of each object (row), which allow for calculating, by category, the recall and precision of the learning set on the subset of samples.

Figure 1: Work chart representing section 1 and section 2 of the protocol. The times are illustrative and could change depending on the computer, the abundance of vignettes to process, and the number of categories of the learning set. This case corresponds to the validation of a learning set of three categories on a set of 42 subsamples (in total, 47,473 vignettes). Please click here to view a larger version of this figure.

3. Calculating the individual size distribution, size spectra, and size metrics

NOTE: The calculations mentioned in this section were performed using Matlab (see script as Supplementary File 1).

Individual size distribution
1. The last column of the Id_from_sorted_vignettes_YYYYMMDD_HHHH.txt file contains the validated classification of the objects. Select only the objects classified as macroinvertebrates to depict their individual size distribution in the sample.
  NOTE: Individual body size corresponds to the ellipsoidal volume of the macroinvertebrate organisms. The system provides measurements in pixels.
2. Concatenate the vectors with the size measurements from both scans, because each fraction has a different subsampling exponent. Before concatenation, correct for the fractionation by replicating the size vectors as many times as the corresponding subsample has been fractionated.
  NOTE: This step is needed if a scan corresponds to a fraction of a sample (i.e., coarse or fine).
3. Calculate the ellipsoidal volume from the major (M) and minor (m) axes of prolate ellipsoids with the same pixel areas as the organisms. Before computing the ellipsoidal volume, convert the major (M) and minor (m) axes from pixels to millimeters (mm) with the following conversion factor (cf):
  1 pixel = 2,400 dpi
  1 in = 25.4 mm
  cf = 25.4/2400
  The ellipsoidal volume (ellipVol with units in mm³) corresponds to:
4. Depict the probability density function of the individual size distribution on the log₂ scale.
Size diversity
1. Calculate the size diversity (Sd) following Quintana et al. (2008)⁸, as in García-Comas et al. (2016)³⁵:
  
  where p_x(x) is the probability density function of size x, and x represents log₂(ellipVol). This measure is, therefore, the Shannon diversity index adapted to a continuous measure, such as the individual size distribution in a community.
Normalized biovolume size spectrum (NBSS)
1. Define the size classes of the NBSS, establishing the lower bound of the spectrum as the 0.01 quantile of the macroinvertebrate size distribution in the samples and creating size classes by a geometrical scale of base 2 until the largest organism in the samples is encompassed.
  NOTE: The size class width increases with size to account for the greater variability associated with greater sizes. The NBSS of the macroinvertebrate communities analyzed here had 14 size classes (Table 1).
2. Obtain the normalized biovolume by dividing the total biovolume in each size class by the size class width.
Size spectrum slope
1. Calculate the linear slope of the NBSS.
  NOTE: The slope (µ) is calculated based on the relationship between the log₂ (size class mid-point) and the log₂(normalized biomass) in the size classes greater than the mode, ignoring any empty ones (in this study, the size classes from 3 to 14).

Size class limits (mm³)	Size class mid-point (mm³)
0,1236	0,1855
0,2473	0,3709
0,4946	0,7418
0,9891	1,4837
1,9783	1,4837
3,9560	5,9348
7,9131	11,8696
15,8261	23,7392
31,6522	47,4783
63,3044	94,9567
126,6089	189,9133
253,2178	379,8267
506,4300	7597,7000
1012,9000	15193,0000
2025,7000

Table 1: Size classes of the normalized biomass size spectrum (NBSS). The table also shows the 15 size class limits and the size class mid-points of the organisms.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

Acquisition of digital images of macroinvertebrate samples
Scanning nuances: Ethanol deposition in the scan tray
While testing the system for macroinvertebrates, several scans were of poor quality. A dark saturated area in the background prevented normal processing of the image and the measurement of the individual sizes of the macroinvertebrates (Figure 2). Several reasons have been given for the appearance of saturated areas in the background or highly pixelated images: (1) the presence of too many organisms on the scan tray; (2) the presence of dirtiness in the samples; (3) an insufficient delay between the preview of the sample and its scan; or (4) using in the image processing a background image of poor quality because of condensation, dirtiness, or poor water quality³³. In macroinvertebrate community samples, the use of ethanol instead of water causes precipitation on the tray, which forms a dark shadow if it is not properly rinsed with water in between scans. This is vital to obtain sharp images and to minimize any related corrosion of the scan tray glass.

Scanning nuances: Debris concentration
From the analysis of a subset of 47,473 vignettes, a high percentage (86.1%) corresponded to debris, including detritus, fibers, or body parts (such as legs or gills), or scanning artifacts (Figure 3A-E). Invertebrate organisms corresponded to the remaining 13.9% of the detected objects (Figure 3F-L). Thus, despite the previous meticulous separation of organisms from organic matter in the laboratory, plenty of small debris still remained in the vial.

Scanning nuances: Touching objects
The significant presence of debris enhances the touching between organisms, and therefore, the creation of vignettes with aggregates that include multiple touching organisms and organisms attached to particles or fibers (Figure 4A-C). These vignettes are a source of bias in determining the shape of the individual size structure. In a set of five samples (11 subsamples), out of all the vignettes with any macroinvertebrates, 10% corresponded to groups with touching organisms or organisms touching particles or fibers. Those vignettes were edited with the image processing program in order to separate the touching organisms and the organisms with particles attached. Reprocessing the samples with the separation mask involved the creation of new vignettes with the newly separated objects, which were validated to ensure their proper classification.

Automatic recognition of the objects
Learning set results
A learning set is a set of vignettes of objects classified into different categories by an expert and used in a supervised learning model, and this can also be called a training set²⁷. It is possible to work with an existing learning set, update the existing learning set with new vignettes and/or categories, or create a new learning set for a specific project.

To determine the best learning set to rapidly obtain the macroinvertebrate size structure, several learning sets were created and tested through cross-validation with the Random Forest algorithm. The resulting confusion matrix shows the true classification (rows) versus the automatic classification (columns). The recall is the percentage of organisms belonging to a category that was automatically well classified, whereas the 1-precision is the percentage of organisms misclassified by the algorithm as belonging to a category (contamination in a category)³³. As a rule of thumb, the recall should be above 70%, and the contamination (1-precision) should be lower than 20% to keep a category in the learning set. The learning set with the greatest recall and precision for macroinvertebrates is then further validated with a subset of samples to determine its real accuracy in macroinvertebrate identification.

Three types of ataxic learning sets (raw, intermediate, and fine) with categories based on the morphological features of the objects were tested. The raw learning set included three categories: macroinvertebrates, other invertebrates (microcrustaceans), and debris (fibers, particles, and artifacts like glass stains). The intermediate learning set included 16 categories: 5 for macroinvertebrates, 3 for other invertebrates, and 8 for debris. The fine learning set included 4 more categories of macroinvertebrates, with a total of 20 categories (Table 2).

In addition to defining the categories, the effect of the number of vignettes per category was also tested. Each learning set was tested separately using 50 vignettes, 100 vignettes, and 300 vignettes in each category (and 500 vignettes for the raw learning set with three categories). All the categories were balanced in number except for "Ostracoda", "long-round macroinvertebrates", and "round shell macroinvertebrates", which included fewer individuals in the 100 vignette and 300 vignette learning sets because not enough organisms of these categories were detected in the scanned images.

The recall and precision for macroinvertebrates (all the macroinverebrate categories together) and organisms (the macroinvertebrate and other invertebrate categories together) were considered to select the best learning set by cross-validation (see the tables in Supplementary File 2). The best learning set was the raw learning set with three categories (macroinvertebrates, other invertebrates, and debris), with 300 objects in each category (Table 2). The raw learning set was subsequently used to validate the automatic classification of the objects in the subset of scanned samples.

Learning set	Number of categories	Images per category	Recall Organisms	Recall macro-invertebrates	1-precision organisms	1-precision macroinvertebrates
Raw	3	50	0.97	0.84	0.12	0.24
		100	0.96	0.87	0.06	0.17
		300	0.95	0.91	0.09	0.15
		500	0.93	0.88	0.13	0.2
Medium	16	50	0.83	0.77	0.17	0.24
		100	0.84	0.79	0.15	0.21
		300	0.87	0.84	0.14	0.18
Fine	20	50	0.89	0.86	0.14	0.18
		100	0.9	0.87	0.11	0.14
		300	0.9	0.86	0.13	0.14

Table 2: Created and tested learning sets (raw, intermediate, and fine) with the categories within each one and the number of objects per category. Recall and 1-precision of the created learning sets. Categories of the Raw learning set: Macroinvertebrates (1), Other invertebrates (2), Debris (3). Categories of the Medium learning set: Long macroinvertebrates (1), Long smooth macroinvertebrates (2), Long spiky macroinvertebrates (3), Round macroinvertebrates (4), Round shell macroinvertebrates (5), Cladocera (6), Copepoda (7), Ostracoda (8), Aggregates (9), Fibres (10), Heads (11), Legs (12), Stains (13), Dark stains (14), Light grey stains (15), Round stains (16). categories of the Fine learning set: Long macroinvertebrates (1), Long smooth macroinvertebrates (2), Long smooth dark macroinvertebrates (3), Long-round macroinvertebrates (4), Long spiky macroinvertebrates (5), Round macroinvertebrates (6), Round shell macroinvertebrates (7), Round dark macroinvertebrates (8), Round shell macroinvertebrates (9), Cladocera (10), Copepoda (11), Ostracoda (12), Aggregates (13), Fibres (14), Heads (15), Legs (16), Stains (17), Dark stains (18), light grey stains (19), Round stains (20).

Validation of automatic recognition with the best learning set
The objects in a subset of 42 fine and coarse subsamples were automatically classified by the selected learning set with the Random Forest algorithm. After manual validation, the recall for all the categories was high (on average, 0.94 for macroinvertebrates, 0.95 for other invertebrates, and 0.92 for debris), while the contamination (1-precision) was rather low, except for other invertebrates (0.25 for macroinvertebrates, 0.84 for other macroinvertebrates, and 0.01 for debris) (Figure 5). Other invertebrates (microcrustaceans) were rare in the samples (present in 17 out of 42 subsamples); thus, the comparison was not robust. Moreover, this category was highly affected by the contamination because of the similarity in shape and grey levels to other objects.

The comparison of automatic versus validated macroinvertebrate abundance showed that these were highly correlated (Pearson's r = 0.92, p-value < 0.0001, n = 24 for coarse subsamples; Pearson's r = 0.98, p-value < 0.0001, n = 18 for fine subsamples), with a slight overestimation by the automatic performance due to contamination from debris (slopes < 1) (Figure 6). Regarding the comparison of the mean ellipsoidal volume, the correlation was also high (Pearson's r = 0.96, p-value < 0.0001, n = 24 for coarse samples; Pearson's r = 0.99, p-value < 0.0001, n = 18 for fine samples), and the size spectrum slope was close to −1 (Figure 6). The difference in the slopes between the fine and coarse fractions reflect the greater effect of misclassification in the large size fractions, which is related to their low organism counts.

The probability density functions of the individual size distributions of the automatic prediction strongly concurred with the validated predictions for the fine subsamples, as well as for the coarse subsamples. However, there were some exceptions for the coarse subsamples related to the number of organisms and, thus, greater effect of misclassification in those cases, as highlighted before (Figure 7).

Effect of touching organisms on the individual size distributions, size spectra, and size metrics
A comparison of the size distributions obtained before and after the separation of the touching organisms and before the validation in a subset of five selected samples was performed to assess the effect of touching objects. To compare the size distributions, the coarse and fine subsamples were combined, according to their fractionation, to reconstruct a sample representing the macroinvertebrate community. In three samples, the abundance after validation increased (>500 individuals) (Figure 8A). Despite this increase, the mean ellipsoidal volume fit very closely to the one calculated in the validated samples (Figure 8B).

The size distributions of the corrected samples (after the separation of touching organisms) differed slightly from the validated ones. Thus, the presence of multiple objects had a small influence on the size distributions in those samples (Figure 9A-E). Accordingly, the size diversity calculated based on the corrected samples correlated strongly with the size diversity of the validated ones (Pearson's r = 0.94, p-value = 0.017, n = 5) (Figure 9F).

Theoretically, the normalized biovolume size spectrum (NBSS) of a community with several trophic levels has a size spectrum slope in the log₂ scale approaching −1 in steady state conditions⁴. The NBSS in natural communities often has a bump rather than a linear distribution, and this is mostly attributed to the sampling bias of the smallest size classes³⁶. In the present study, the third size class was the most common in the NBSS.

The NBSSs were quite similar between the steps of the protocol (Figure 10A-C), except for a few size classes in a couple of spectra (Figure 10D-E). Accordingly, the size spectrum slope calculated based on the corrected samples correlated strongly with the slope based on the validated ones (Pearson's r = 0.99, p-value ≤ 0.0001, n = 5) (Figure 10F).

Figure 2: Examples of scanned images with different qualities before and after being processed. (A,B) Raw image (left) and processed image (right) of a fine subsample with good scan quality; (C,D) Raw image (left) and processed image (right) of a fine subsample with bad scan quality (dark background and cut image on the left edge); (E,F) raw image (left) and processed image (right) of a fine subsample with bad scan quality (very pixelated dark background). Please click here to view a larger version of this figure.

Figure 3: Contour vignettes representing different objects present in the samples. (A-E) Debris (fiber, round stain, macroinvertebrate leg, stains, and organic debris); (F-I) macroinvertebrates (Coleoptera, Diptera, Plecoptera, and Trichoptera) and (J-L) other invertebrates (Cladocera, Copepoda, and Ostracoda). Scale bars indicate 1 mm gma = 1.1. Please click here to view a larger version of this figure.

Figure 4: Examples of vignettes containing multiple objects. (A) A macroinvertebrate (Hydracarina) attached to a fiber; (B) multiple organisms (Caenidae) aggregated by a fiber; and (C) two touching macroinvertebrates (Chironomidae and Caenidae). Scale bars indicate 1 mm gma = 1.1. Please click here to view a larger version of this figure.

Figure 5: Boxplots of recall and contamination (1-precision). The boxplots for the three categories of macroinvertebrates, other invertebrates, and debris (300 vignettes per category) of the selected learning set validated on a subset of samples (n = 42). Please click here to view a larger version of this figure.

Figure 6: Comparison between the abundance and mean ellipsoidal volume estimates in automatic versus validated classification. (A) Abundance estimates in the subsamples (n = 42) and (B) mean ellipsoidal volume estimates in the subsamples (n = 42). The dark dots correspond to the coarse subsamples (>0.5 cm mesh); the grey dots correspond to the fine subsamples (>500 µm mesh). The dashed line represents the 1:1 relationship. Please click here to view a larger version of this figure.

Figure 7: Probability density functions representing the relative contribution (y-axis) of the individual size in the log-scale (x-axis) for comparison between automatic estimates and between validated estimates. (A,B) Automatic and validated estimates for coarse subsamples (n = 18), (C,D) Automatic and validated estimates for fine subsamples (n = 24). (A,C) Comparison between automatic estimates and (B,D) comparison between validated estimates. Colors represent each subsample to help discern the spectra. Please click here to view a larger version of this figure.

Figure 8: Comparison between the abundance and mean ellipsoidal volume estimates in validated subsamples versus subsamples validated after the separation of touching objects from selected natural samples (fine and coarse subsamples together). (A) Abundance estimates by sampling frame (n = 5) and (B) mean ellipsoidal volume estimates (n = 5). The dashed line represents the 1:1 relationship. Please click here to view a larger version of this figure.

Figure 9: Probability density functions representing the relative contribution (y-axis) of the individual size in the log₂-scale (x-axis) for the automatic prediction, validated prediction, and validated prediction with their respective size diversity values (Sd). (A-E) Probability density functions for selected natural samples (fine and coarse subsamples together) (n = 5); the red line corresponds to the automatic prediction, the blue line corresponds to the validated prediction, and the green line corresponds to the corrected samples (validated after the separation of touching objects). (F) Comparison of validated versus corrected size diversity estimates; the dashed line corresponds to the 1:1 relationship. Please click here to view a larger version of this figure.

Figure 10: Normalized biovolume size spectra (NBSS) and comparison of NBSS slopes (µ) in between treatments. (A-E) NBSS representing the relationship between the mid-point value of each size class in the log-scale (x-axis) versus the normalized biovolume per scanning frame (y-axis) of the five selected samples for the automatic (red crosses), validated (blue triangles), and corrected (green circles) predictions with their respective size spectrum slopes (µ) calculated in the size classes from the from the modal size class and upward (the third size class is indicated by the vertical dashed line). (F) Comparison of the slopes calculated on the validated samples versus the corrected ones (after the separation of touching objects). The dashed line corresponds to the 1:1 relationship, r². Please click here to view a larger version of this figure.

Supplementary File 1: Matlab script to perform the calculations. Please click here to download this File.

Supplementary File 2: Cross-validation, recall, and 1-precision of the created learning sets. (A) Raw learning set with 3 categories and 50 vignettes per category; (B) raw learning set with 3 categories and 100 vignettes per category; (C) raw learning set with 3 categories and 300 vignettes per category; (D) raw learning set with 3 categories and 500 vignettes per category; (E) raw learning set with 5 categories and 50 vignettes per category; (F) raw learning set with 5 categories and 100 vignettes per category; (G) raw learning set with 5 categories and 300 vignettes per category; (H) intermediate learning set with 16 categories and 50 vignettes per category; (I) intermediate learning set with 16 categories and 100 vignettes per category; (J) intermediate learning set with 16 categories and 300 vignettes per category; (K) fine learning set with 20 categories and 50 vignettes per category; (L) fine learning set with 20 categories and 100 vignettes per category; and (M) fine learning set with 20 categories and 300 vignettes per category. Please click here to download this File.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

The adaptation of the methodology described by Gorsky et al. 2010 for riverine macroinvertebrates allows for high classification accuracy in estimating the community size structure in freshwater macroinvertebrates. The results suggest that the protocol can reduce the time for estimating the individual size structure in a sample to about 1 hour. Thus, the proposed protocol is intended to promote the routine use of macroinvertebrate size spectra as a fast and integrative bioindicator to assess the impact of perturbations in freshwater ecosystems. The macroinvertebrate size spectrum has already been used as a successful index to evaluate the ecological status of coastal lagoons²². With the development of the protocol, intensive surveys on invertebrates can be carried out to enable field monitoring campaigns that cover large spatial and temporal scales.

As the aim of this protocol is to obtain the individual size distribution of the sampled community in a quick way, disregarding taxonomy, it is recommended to a create simple learning set like the one proposed here. Tests of finer learning sets, with a higher number of categories, give lower recall and precision for macroinvertebrates as a whole (Table 2), and the validation step is more time-consuming.

The automatic prediction strongly concurred with the validated prediction of 42 natural subsamples from different sampling sites, suggesting that the method in automatic mode is suitable for counting and measuring the macroinvertebrates in natural samples (Figure 6). Moreover, the similarity in the NBSSs between the automatic and validated predictions and the high fit to the linear theoretical model suggests that the automatic mode is a promising method for pursuing theoretical ecological studies (Figure 10).

During the adaptation of this protocol, several issues were encountered, and they were solved or minimized in different ways. An issue to take into consideration when scanning macroinvertebrate samples is the appearance of dark saturated areas. Thus, it is important to check the processed, scanned images as soon as possible to detect this problem and to repeat the scan if necessary. This problem has also been found when scanning plankton³³, but it is increased by the use of ethanol instead of tap water. It is not recommended to use tap water, as the organisms preserved in 70% ethanol will drift on the surface. Even though the device is designed to resist diluted ethanol (5%), the invertebrate samples are preserved with 70% ethanol. Operating with lower concentrations of ethanol is not recommended either, as the organisms could be damaged through rehydration and dehydration processes³⁷. The proposed solution, which is highly recommended, is to rinse the scan tray with fresh water several times after every scan performed with ethanol. This avoids the accumulation of precipitates that may alter the image background and protects the glass of the scan tray from corrosion.

Another detected issue is the presence of vignettes with multiple organisms, which can alter the size spectrum because of the underestimation of individuals of certain sizes. When the number of vignettes with multiple objects is low (<10%), as in this study, the presence of multiple objects has a small influence on the size distributions and NBSSs in those samples (Figure 9 and Figure 10). This indicates that, to obtain a representative size structure of the macroinvertebrate community, it is not necessary to invest time in step 1.5 of the protocol (the separation of touching organisms), for which the image reprocessing lasts about 1.5 h. Instead, it is highly recommended to take time in step 2.5 of the protocol (separating touching organisms or aggregates using a wooden needle), which is much less time-consuming (maximum 30 min) and ensures a proper estimate of the size distributions in automatic mode³⁰. An option to reduce the number of touching organisms is to work with fewer organisms per scan, but the time commitment invested in scanning one sample in a high number of fractions and the possibility of aggregation of organisms should be taken into consideration. Another solution would be to preserve only a subsample that would allow for calculating a representative size spectrum when sorting the organisms in the laboratory instead of preserving all the sampled organisms, as done in this work. The reduction in the number of organisms per sample would reduce the probability of touching organisms. Moreover, when fewer individuals are stored, the sample contains less debris, which facilitates the separation, especially if fibers can be avoided.

The observed limitation of the automatic classification method is related to the low presence of microcrustaceans (category: other macroinvertebrates) in the used samples. The lack of representation of microcrustaceans can affect their correct classification and limit the precision of the automatic prediction for this category. Nevertheless, the other categories, debris and macroinvertebrates, which are the main objective of this work, present high recall and precision. Alternatives to using this scanner device would be to adapt a common scanner to hold water frames, promote open-source codes for sample processing and machine learning like the one provided here, and write codes for measuring organisms under the microscope with a camera or through flux with a set of cameras. This has been done on several occasions²³^,²⁴^,²⁵^,²⁶^,³⁸^,³⁹^,⁴⁰, but the method that we propose regulates the scanning parameterization in order to obtain comparable size estimates, which is difficult to control for with the other systems. Furthermore, the proposed protocol and scanning device are ready to use, open-source, and already established in the marine mesozooplankton community. Overall, the adaptation of this protocol demonstrates a promising avenue for using this automatic imaging method to obtain the size structure of freshwater macroinvertebrates efficiently and to test the potential of size metrics for freshwater bioassessment.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The authors declare no potential competing interests.

Acknowledgments

This work was supported by the Spanish Ministry of Science, Innovation and Universities (grant number RTI2018-095363-B-I00). We thank the CERM-UVic-UCC members Èlia Bretxa, Anna Costarrosa, Laia Jiménez, María Isabel González, Marta Jutglar, Francesc Llach, and Núria Sellarès for their work in macroinvertebrate field sampling and laboratory sorting and David Albesa for collaborating in the sample scanning. We finally thank Josep Maria Gili and the Institut de Ciències del Mar (ICM-CSIC) for the use of the laboratory facilities and scanner device.

Materials

Name	Company	Catalog Number	Comments
Beaker	Labbox		Other containers could be used
Dionized water	Icopresa	8420239600123	To dilute the ethanol
Funnel	Vitlab	41094
Glass vials 8 ml	Labbox	SVSN-C10-195	1 vial/subsample
ImageJ Software	Free access		Version 4.41o/ Image processing software
Large frame	Hydroptic	Provided by ZooScan	24.5 cm x 15.8 cm
Monalcol 96 (Ethanol 96)	Montplet	1050JE001
Plankton Identifier Software	Free access		Version 1.2.6/ Automatic identification software
Sieve	Cisa	26852.2	Nominal aperture 500µ and nominal aperture 0,5 cm
Tweezers	Bondline	B5SA	Stainless, anti-magnetic, anti-acid
VueScan 9 x 64 (9.5.09) Software	Hydroptic		Version 9.0.51/ Sacn software
Wooden needle			Any plastic or wood needle can be used
Zooprocess Software	Free access		Version 7.14/Image processing software
ZooScan	Hydroptic	54	Version III/ Scanner