WHOI-Plankton Data Integration • iRfcb

Introduction

The WHOI-Plankton dataset (Sosik et al. 2015) contains millions of microscopic marine plankton images captured by the IFCB, and manually classified into > 100 categories by researchers at the Woods Hole Oceanographic Institution (WHOI). The dataset is part of a larger collection of over 700 million images gathered since 2006 at the Martha’s Vineyard Coastal Observatory (MVCO), with real-time and archived data accessible via the IFCB Data Dashboard. Annotated images are stored in annual zip files, organized into class-specific subdirectories. Example images for each class are available on the WHOI-Plankton GitHub page.

iRfcb provides functions to interact with the WHOI-Plankton data and the IFCB Dashboard, making it easier to integrate annotated images into custom training datasets. It supports incorporation of these images using MATLAB code from the ifcb-analysis repository (Sosik and Olson 2007), as demonstrated in this tutorial.

Getting Started

Installation

You can install the package from CRAN using:

install.packages("iRfcb")

Some functions from the iRfcb package used in this tutorial require Python to be installed. You can download Python from the official website: python.org/downloads.

The iRfcb package can be configured to automatically activate an installed Python virtual environment (venv) upon loading by setting an environment variable. For more details, please refer to the package README.

Load the iRfcb library:

library(iRfcb)

Download Sample Data

To get started, download sample data from the SMHI IFCB Plankton Image Reference Library (Torstensson et al. 2024). This dataset will serve as the primary dataset in this example, which we will expand by incorporating training images from the WHOI-Plankton dataset.

# Define data directory
data_dir <- "data"

# Download and extract test data in the data folder
ifcb_download_test_data(
  dest_dir = data_dir,
  max_retries = 10,
  sleep_time = 30,
  verbose = FALSE
)

Download WHOI-Plankton Data

This section demonstrates how to download the WHOI-Plankton dataset iRfcb.

Define Download Paths

First we define the download directories:

# Define paths to download destinations
png_folder_whoi <- file.path(
  data_dir, 
  "whoi_plankton", 
  "png_images", 
  "extracted_images"
)
raw_folder_whoi <- file.path(
  data_dir, 
  "whoi_plankton", 
  "data"
)
manual_folder_whoi <- file.path(
  data_dir, 
  "whoi_plankton", 
  "manual"
)
class2use_file_whoi <- file.path(
  data_dir,
  "whoi_plankton", 
  "config", 
  "class2use_whoi.mat"
)
blobs_folder_whoi <- file.path(
  data_dir, 
  "whoi_plankton", 
  "blobs"
)

Download and Prepare WHOI-Plankton Data

help("ifcb_prepare_whoi_plankton")

The ifcb_prepare_whoi_plankton() function enables downloading zipped .png images from selected years of the WHOI-Plankton dataset, with an option to extract them using the extract_images argument. It also retrieves raw data (.roi, .hdr and .adc files), along with blobs and features (when available), from the MVCO IFCB Dashboard. Interactions with WHOI-Plankton are managed through the ifcb_download_whoi_plankton() function, while the IFCB Dashboard is accessed via the ifcb_download_dashboard_data() function. Finally, the function generates manual .mat files for storing class information, ensuring compatibility with the MATLAB code in the ifcb-analysis repository (Sosik and Olson 2007).

In this example, we use data from the years 2013 and 2014.

# Initialize the python session if not already set up
env_path <- "~/.virtualenvs/iRfcb" # Or your preferred venv path

# Initialize python environment
ifcb_py_install(envname = env_path)

# Download and prepare the WHOI-Plankton dataset
ifcb_prepare_whoi_plankton(
  years = 2013:2014,
  png_folder = png_folder_whoi, 
  raw_folder = raw_folder_whoi, 
  manual_folder = manual_folder_whoi, 
  class2use_file = class2use_file_whoi, 
  extract_images = FALSE,
  skip_classes = NULL,
  download_blobs = TRUE, # Optionally download blobs
  blobs_folder = blobs_folder_whoi,
  download_features = FALSE,
  quiet = TRUE
)

To exclude certain images from the training dataset, either exclude the class completely with the skip_classes argument, or set extract_images = TRUE and manually delete specific .png files from the png_folder and rerun ifcb_prepare_whoi_plankton.

The function ifcb_prepare_whoi_plankton() converts the filename format of older IFCB models (IFCB1-6) to match the newer IFCB format (DYYYYMMDDTHHMMSS_IFCBXXX). By default, this conversion is performed using the convert_filenames and convert_adc arguments. This step ensures compatibility with the ifcb-analysis MATLAB code. For further details, please refer to the help pages of the respective functions.

Merge Data with Existing Dataset

The ifcb_merge_manual() function allows you to integrate data from the WHOI-Plankton dataset with your existing training dataset created using the ifcb-analysis code (Sosik and Olson 2007). The merged dataset and class2use file will be saved in the locations specified by manual_folder_output and class2use_file_output.

In this example, we merge the WHOI-Plankton dataset with the iRfcb sample dataset from the SMHI IFCB Plankton Image Reference Library (Torstensson et al. 2024), downloaded earlier in this tutorial. Please note that class names in the class2use files may need to be standardized if you intend to merge images into the same class (e.g. “Coscinodiscus” and “Coscinodiscus_spp”).

# Define paths to existing manual dataset
class2use_file_smhi <- file.path(
  data_dir, 
  "config", 
  "class2use.mat"
)
manual_folder_smhi <- file.path(
  data_dir, 
  "manual"
)

# Define paths to final merged dataset
class2use_file_merged <- file.path(
  data_dir, 
  "merged_data", 
  "config", 
  "class2use.mat"
)
manual_folder_merged <- file.path(
  data_dir, 
  "merged_data", 
  "manual"
)

# Merge WHOI-Plankton with existing manual dataset
ifcb_merge_manual(
  class2use_file_base = class2use_file_smhi,
  class2use_file_additions = class2use_file_whoi,
  class2use_file_output = class2use_file_merged,
  manual_folder_base = manual_folder_smhi,
  manual_folder_additions = manual_folder_whoi,
  manual_folder_output = manual_folder_merged,
  quiet = TRUE
)

The newly merged dataset is now ready for use in training and testing machine learning models. To integrate it with the ifcb-analysis MATLAB code, blobs and features are extracted for the merged dataset (if not already downloaded through the IFCB Dashboard). Random forest models can then be built using the ifcb-analysis code.

This concludes this tutorial for the iRfcb package. For more detailed information, refer to the package documentation or the other tutorials. See how data pipelines can be constructed using iRfcb in the following Example Project. Happy analyzing!

Citation

## To cite package 'iRfcb' in publications use:
## 
##   Anders Torstensson (2025). iRfcb: Tools for Managing Imaging
##   FlowCytobot (IFCB) Data. R package version 0.5.1.
##   https://CRAN.R-project.org/package=iRfcb
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {iRfcb: Tools for Managing Imaging FlowCytobot (IFCB) Data},
##     author = {Anders Torstensson},
##     year = {2025},
##     note = {R package version 0.5.1},
##     url = {https://CRAN.R-project.org/package=iRfcb},
##   }

References

Sosik, H. M. and Olson, R. J. (2007) Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnol. Oceanogr: Methods 5, 204–216.
Sosik, H. M., Peacock, E. E. and Brownlee E. F. (2015), Annotated Plankton Images - Data Set for Developing and Evaluating Classification Methods. https://doi.org/10.1575/1912/7341
Torstensson, A., Skjevik, A-T., Mohlin, M., Karlberg, M. and Karlson, B. (2024). SMHI IFCB Plankton Image Reference Library. SciLifeLab. Dataset. https://doi.org/10.17044/scilifelab.25883455.v3