
WHOI-Plankton Data Integration
Source:vignettes/whoi-plankton-data-integration.Rmd
whoi-plankton-data-integration.RmdIntroduction
The WHOI-Plankton dataset (Sosik et al. 2015) contains millions of microscopic marine plankton images captured by the IFCB, and manually classified into > 100 categories by researchers at the Woods Hole Oceanographic Institution (WHOI). The dataset is part of a larger collection of over 700 million images gathered since 2006 at the Martha’s Vineyard Coastal Observatory (MVCO), with real-time and archived data accessible via the IFCB Data Dashboard. Annotated images are stored in annual zip files, organized into class-specific subdirectories. Example images for each class are available on the WHOI-Plankton GitHub page.
iRfcb provides functions to interact with the
WHOI-Plankton data and the IFCB Dashboard, making it easier to integrate
annotated images into custom training datasets. It supports
incorporation of these images using MATLAB code from the ifcb-analysis
repository (Sosik and Olson 2007), as demonstrated in this tutorial.
Getting Started
Installation
You can install the package from CRAN using:
install.packages("iRfcb")Some functions from the iRfcb package used in this
tutorial require Python to be installed. You can download
Python from the official website: python.org/downloads.
The iRfcb package can be configured to automatically
activate an installed Python virtual environment (venv) upon loading by
setting an environment variable. For more details, please refer to the
package README.
Load the iRfcb library:
Download Sample Data
To get started, download sample data from the SMHI IFCB Plankton Image Reference Library (Torstensson et al. 2024). This dataset will serve as the primary dataset in this example, which we will expand by incorporating training images from the WHOI-Plankton dataset.
# Define data directory
data_dir <- "data"
# Download and extract test data in the data folder
ifcb_download_test_data(
dest_dir = data_dir,
max_retries = 10,
sleep_time = 30,
verbose = FALSE
)Download WHOI-Plankton Data
This section demonstrates how to download the WHOI-Plankton dataset
iRfcb.
Define Download Paths
First we define the download directories:
# Define paths to download destinations
png_folder_whoi <- file.path(
data_dir,
"whoi_plankton",
"png_images",
"extracted_images"
)
raw_folder_whoi <- file.path(
data_dir,
"whoi_plankton",
"data"
)
manual_folder_whoi <- file.path(
data_dir,
"whoi_plankton",
"manual"
)
class2use_file_whoi <- file.path(
data_dir,
"whoi_plankton",
"config",
"class2use_whoi.mat"
)
blobs_folder_whoi <- file.path(
data_dir,
"whoi_plankton",
"blobs"
)Download and Prepare WHOI-Plankton Data
help("ifcb_prepare_whoi_plankton")
The ifcb_prepare_whoi_plankton() function enables
downloading zipped .png images from selected years of the
WHOI-Plankton dataset,
with an option to extract them using the extract_images
argument. It also retrieves raw data (.roi,
.hdr and .adc files), along with blobs and
features (when available), from the MVCO IFCB Dashboard.
Interactions with WHOI-Plankton are managed through the
ifcb_download_whoi_plankton() function, while the IFCB
Dashboard is accessed via the
ifcb_download_dashboard_data() function. You can also
interact with the IFCB Dashboard using the functions
ifcb_download_dashboard_metadata() and
ifcb_list_dashboard_bins(). Finally, the function generates
manual .mat files for storing class information, ensuring
compatibility with the MATLAB code in the ifcb-analysis
repository (Sosik and Olson 2007).
In this example, we use data from the years 2013 and 2014.
# Initialize the python session if not already set up
env_path <- "~/.virtualenvs/iRfcb" # Or your preferred venv path
# Initialize python environment
ifcb_py_install(envname = env_path)
# Download and prepare the WHOI-Plankton dataset
ifcb_prepare_whoi_plankton(
years = 2013:2014,
png_folder = png_folder_whoi,
raw_folder = raw_folder_whoi,
manual_folder = manual_folder_whoi,
class2use_file = class2use_file_whoi,
extract_images = FALSE,
skip_classes = NULL,
download_blobs = TRUE, # Optionally download blobs
blobs_folder = blobs_folder_whoi,
download_features = FALSE,
quiet = TRUE
)To exclude certain images from the training dataset, either exclude
the class completely with the skip_classes argument, or set
extract_images = TRUE and manually delete specific
.png files from the png_folder and rerun
ifcb_prepare_whoi_plankton.
The function ifcb_prepare_whoi_plankton() converts the
filename format of older IFCB models (IFCB1-6) to match the newer IFCB
format (DYYYYMMDDTHHMMSS_IFCBXXX). By default, this
conversion is performed using the convert_filenames and
convert_adc arguments. This step ensures compatibility with
the ifcb-analysis
MATLAB code. For further details, please refer to the help pages of the
respective functions.
Merge Data with Existing Dataset
The ifcb_merge_manual() function allows you to integrate
data from the WHOI-Plankton dataset with your existing training dataset
created using the ifcb-analysis
code (Sosik and Olson 2007). The merged dataset and class2use file will
be saved in the locations specified by manual_folder_output
and class2use_file_output.
In this example, we merge the WHOI-Plankton dataset with the
iRfcb sample dataset from the SMHI IFCB Plankton
Image Reference Library (Torstensson et al. 2024), downloaded
earlier in this tutorial. Please note that class names in the
class2use files may need to be standardized if you intend
to merge images into the same class (e.g. “Coscinodiscus” and
“Coscinodiscus_spp”).
# Define paths to existing manual dataset
class2use_file_smhi <- file.path(
data_dir,
"config",
"class2use.mat"
)
manual_folder_smhi <- file.path(
data_dir,
"manual"
)
# Define paths to final merged dataset
class2use_file_merged <- file.path(
data_dir,
"merged_data",
"config",
"class2use.mat"
)
manual_folder_merged <- file.path(
data_dir,
"merged_data",
"manual"
)
# Merge WHOI-Plankton with existing manual dataset
ifcb_merge_manual(
class2use_file_base = class2use_file_smhi,
class2use_file_additions = class2use_file_whoi,
class2use_file_output = class2use_file_merged,
manual_folder_base = manual_folder_smhi,
manual_folder_additions = manual_folder_whoi,
manual_folder_output = manual_folder_merged,
quiet = TRUE
)The newly merged dataset is now ready for use in training and testing
machine learning models. To integrate it with the ifcb-analysis
MATLAB code, blobs and features are extracted for the merged dataset (if
not already downloaded through the IFCB Dashboard). Random forest models
can then be built using the ifcb-analysis
code.
This concludes this tutorial for the iRfcb package. For
more detailed information, refer to the package documentation or the
other tutorials.
See how data pipelines can be constructed using iRfcb in
the following Example
Project. Happy analyzing!
Citation
## To cite package 'iRfcb' in publications use:
##
## Anders Torstensson (2025). iRfcb: Tools for Managing Imaging
## FlowCytobot (IFCB) Data. R package version 0.6.0.
## https://CRAN.R-project.org/package=iRfcb
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {iRfcb: Tools for Managing Imaging FlowCytobot (IFCB) Data},
## author = {Anders Torstensson},
## year = {2025},
## note = {R package version 0.6.0},
## url = {https://CRAN.R-project.org/package=iRfcb},
## }
References
- Sosik, H. M. and Olson, R. J. (2007) Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnol. Oceanogr: Methods 5, 204–216.
- Sosik, H. M., Peacock, E. E. and Brownlee E. F. (2015), Annotated Plankton Images - Data Set for Developing and Evaluating Classification Methods. https://doi.org/10.1575/1912/7341
- Torstensson, A., Skjevik, A-T., Mohlin, M., Karlberg, M. and Karlson, B. (2024). SMHI IFCB Plankton Image Reference Library. SciLifeLab. Dataset. https://doi.org/10.17044/scilifelab.25883455.v3