Skip to contents

Introduction

This vignette provides an overview of quality control (QC) methods for Imaging FlowCytobot (IFCB) data using the iRfcb package. The package offers tools to analyze Particle Size Distribution (PSD) follwing Hayashi et al. in prep, verify geographical positions, and integrate contextual data from sources like ferrybox systems. These QC workflows ensure high-quality datasets for phytoplankton and microzooplankton monitoring in marine ecosystems.

You’ll learn how to:

  • Set up the iRfcb package and Python environment.
  • Analyze particle size distributions for data quality.
  • Check spatial metadata for proximity to land, basin classification, and missing positions.

Follow this tutorial to streamline the QC process and ensure reliable IFCB data.

Getting Started

Installation

You can install the package from GitHub using the devtools package:

# install.packages("devtools")
devtools::install_github("EuropeanIFCBGroup/iRfcb",
                         dependencies = TRUE)

Some functions from the iRfcb package used in this tutorial require Python to be installed. You can download Python from the official website: python.org/downloads.

Load the iRfcb and ggplot2 libraries:

Download Sample Data

To get started, download sample data from the SMHI IFCB Plankton Image Reference Library (Torstensson et al. 2024) with the following function:

# Define data directory
data_dir <- "data"

# Download and extract test data in the data folder
ifcb_download_test_data(dest_dir = data_dir,
                        max_retries = 10,
                        sleep_time = 30,
                        verbose = FALSE)

Particle Size Distribution

IFCB data can be quality controlled by analyzing the particle size distribution (PSD) (Hayashi et al. in prep). iRfcb uses the code available at https://github.com/kudelalab/PSD, which is efficient in detecting samples with bubbles, beads, incomplete runs etc. Before running the PSD quality check, ensure the necessary Python environment is set up and activated:

# Define path to virtual environment
env_path <- "~/.virtualenvs/iRfcb" # Or your preferred venv path

# Install python virtual environment
ifcb_py_install(envname = env_path)

# Run PSD quality control
psd <- ifcb_psd(feature_folder = "data/features/2023",
                hdr_folder = "data/data/2023",
                save_data = FALSE,
                output_file = NULL,
                plot_folder = NULL,
                use_marker = FALSE,
                start_fit = 10,
                r_sqr = 0.5,
                beads = 10 ** 12,
                bubbles = 150,
                incomplete = c(1500, 3),
                missing_cells = 0.7,
                biomass = 1000,
                bloom = 5,
                humidity = 70)

The results can be printed and visualized through plots:

# Print output from PSD
head(psd$fits)
## # A tibble: 5 × 8
##   sample            a     k   R.2 max_ESD_diff capture_percent bead_run humidity
##   <chr>         <dbl> <dbl> <dbl>        <int>           <dbl> <lgl>       <dbl>
## 1 D20230314T… 5.90e 5 -1.88 0.713            3           0.955 FALSE        16.0
## 2 D20230314T… 2.51e 5 -1.60 0.702            3           0.944 FALSE        16.0
## 3 D20230810T… 3.36e 7 -2.73 0.955            4           0.919 FALSE        65.4
## 4 D20230915T… 1.32e10 -5.54 0.989            2           0.967 FALSE        71.5
## 5 D20230915T… 4.39e10 -6.03 0.981            3           0.961 FALSE        71.5
head(psd$flags)
## # A tibble: 2 × 2
##   sample           flag         
##   <chr>            <chr>        
## 1 D20230915T091133 High Humidity
## 2 D20230915T093804 High Humidity
# Plot PSD of the first sample
plot <- ifcb_psd_plot(sample_name = psd$data$sample[1],
                      data = psd$data,
                      fits = psd$fits,
                      start_fit = 10) +
  # Set white background and ensure plot background is white
  theme(
    panel.background = element_rect(fill = "white", color = NA), 
    plot.background = element_rect(fill = "white", color = NA)
  )

# Print the plot
print(plot)

Geographical QC/QA

Check If IFCB Is Near Land

To determine if the IFCB is near land (i.e. ship in harbor), examine the position data in the .hdr files (or from vectors of latitudes and longitudes):

# Read HDR data and extract GPS position (when available)
gps_data <- ifcb_read_hdr_data("data/data/",
                               gps_only = TRUE,
                               verbose = FALSE) # Do not print progress bar

# Create new column with the results
gps_data$near_land <- ifcb_is_near_land(gps_data$gpsLatitude,
                                        gps_data$gpsLongitude,
                                        distance = 100, # 100 meters from shore
                                        shape = NULL) # Using the default NE 1:50m Land Polygon

# Print output
head(gps_data)
##                     sample gpsLatitude gpsLongitude           timestamp
## 1 D20220522T000439_IFCB134          NA           NA 2022-05-22 00:04:39
## 2 D20220522T003051_IFCB134          NA           NA 2022-05-22 00:30:51
## 3 D20220712T210855_IFCB134          NA           NA 2022-07-12 21:08:55
## 4 D20220712T222710_IFCB134          NA           NA 2022-07-12 22:27:10
## 5 D20230314T001205_IFCB134    56.66883     12.11303 2023-03-14 00:12:05
## 6 D20230314T003836_IFCB134    56.66884     12.11302 2023-03-14 00:38:36
##         date year month day     time ifcb_number near_land
## 1 2022-05-22 2022     5  22 00:04:39     IFCB134        NA
## 2 2022-05-22 2022     5  22 00:30:51     IFCB134        NA
## 3 2022-07-12 2022     7  12 21:08:55     IFCB134        NA
## 4 2022-07-12 2022     7  12 22:27:10     IFCB134        NA
## 5 2023-03-14 2023     3  14 00:12:05     IFCB134     FALSE
## 6 2023-03-14 2023     3  14 00:38:36     IFCB134     FALSE

For more accurate determination, a detailed coastline .shp file may be required (e.g. the EEA Coastline Polygon). Refer to the help pages of ifcb_is_near_land for further information.

Check Which Sub-Basin an IFCB Sample Is From

To identify the specific sub-basin of the Baltic Sea (or using a custom shape-file) from which an IFCB sample was collected, analyze the position data:

# Define example latitude and longitude vectors
latitudes <- c(55.337, 54.729, 56.311, 57.975)
longitudes <- c(12.674, 14.643, 12.237, 10.637)

# Check in which Baltic sea basin the points are in
points_in_the_baltic <- ifcb_which_basin(latitudes, 
                                         longitudes, 
                                         shape_file = NULL)
# Print output
print(points_in_the_baltic)
## [1] "13 - Arkona Basin"   "12 - Bornholm Basin" "16 - Kattegat"      
## [4] "17 - Skagerrak"
# Plot the points and the basins
ifcb_which_basin(latitudes, 
                 longitudes, 
                 plot = TRUE, 
                 shape_file = NULL) +
  # Set white background and ensure plot background is white
  theme(
    panel.background = element_rect(fill = "white", color = NA), 
    plot.background = element_rect(fill = "white", color = NA)
  )

This function reads a pre-packaged shapefile of the Baltic Sea, Kattegat, and Skagerrak basins from the iRfcb package by default, or a user-supplied shapefile if provided. The shapefiles provided in iRfcb originate from SHARK.

Check If Positions Are Within the Baltic Sea or Elsewhere

This check is useful if only you want to apply a classifier specifically to phytoplankton from the Baltic Sea.

# Define example latitude and longitude vectors
latitudes <- c(55.337, 54.729, 56.311, 57.975)
longitudes <- c(12.674, 14.643, 12.237, 10.637)

# Check if the points are in the Baltic Sea Basin
points_in_the_baltic <- ifcb_is_in_basin(latitudes, longitudes)

# Print results
print(points_in_the_baltic)
## [1]  TRUE  TRUE FALSE FALSE
# Plot the points and the basin
ifcb_is_in_basin(latitudes, longitudes, plot = TRUE) +
  # Set white background and ensure plot background is white
  theme(
    panel.background = element_rect(fill = "white", color = NA), 
    plot.background = element_rect(fill = "white", color = NA)
  )

This function reads a land-buffered shapefile of the Baltic Sea Basin from the iRfcb package by default, or a user-supplied shapefile if provided.

Find Missing Positions from RV Svea Ferrybox

This function is used by SMHI to collect and match stored ferrybox positions when they are not available in the .hdr files. An example ferrybox data file is provided in iRfcb with data matching sample D20220522T000439_IFCB134.

# Print available coordinates from .hdr files
head(gps_data, 4)
##                     sample gpsLatitude gpsLongitude           timestamp
## 1 D20220522T000439_IFCB134          NA           NA 2022-05-22 00:04:39
## 2 D20220522T003051_IFCB134          NA           NA 2022-05-22 00:30:51
## 3 D20220712T210855_IFCB134          NA           NA 2022-07-12 21:08:55
## 4 D20220712T222710_IFCB134          NA           NA 2022-07-12 22:27:10
##         date year month day     time ifcb_number near_land
## 1 2022-05-22 2022     5  22 00:04:39     IFCB134        NA
## 2 2022-05-22 2022     5  22 00:30:51     IFCB134        NA
## 3 2022-07-12 2022     7  12 21:08:55     IFCB134        NA
## 4 2022-07-12 2022     7  12 22:27:10     IFCB134        NA
# Define path where ferrybox data are located
ferrybox_folder <- "data/ferrybox_data"

# Get GPS position from ferrybox data
positions <- ifcb_get_ferrybox_data(gps_data$timestamp, 
                                    ferrybox_folder)

# Print result
head(positions)
## # A tibble: 6 × 3
##   timestamp           gpsLatitude gpsLongitude
##   <dttm>                    <dbl>        <dbl>
## 1 2022-05-22 00:04:39        55.0         13.6
## 2 2022-05-22 00:30:51        NA           NA  
## 3 2022-07-12 21:08:55        NA           NA  
## 4 2022-07-12 22:27:10        NA           NA  
## 5 2023-03-14 00:12:05        NA           NA  
## 6 2023-03-14 00:38:36        NA           NA

Find Contextual Ferrybox Data from RV Svea

The ifcb_get_ferrybox_data function can also be used to extract additional ferrybox parameters, such as temperature (parameter number 8180) and salinity (parameter number 8181).

# Get salinity and temperature from ferrybox data
ferrybox_data <- ifcb_get_ferrybox_data(gps_data$timestamp, 
                                        ferrybox_folder, 
                                        parameters = c("8180", "8181"))

# Print result
head(ferrybox_data)
## # A tibble: 6 × 3
##   timestamp           `8180` `8181`
##   <dttm>               <dbl>  <dbl>
## 1 2022-05-22 00:04:39   11.4   7.86
## 2 2022-05-22 00:30:51   NA    NA   
## 3 2022-07-12 21:08:55   NA    NA   
## 4 2022-07-12 22:27:10   NA    NA   
## 5 2023-03-14 00:12:05   NA    NA   
## 6 2023-03-14 00:38:36   NA    NA

This concludes this tutorial for the iRfcb package. For more detailed information, refer to the package documentation or the other tutorials. See how data pipelines can be constructed using iRfcb in the following Example Project. Happy analyzing!

Citation

## To cite package 'iRfcb' in publications use:
## 
##   Anders Torstensson (2025). I 'R' FlowCytobot (iRfcb): Tools for
##   Analyzing and Processing Data from the IFCB. R package version 0.4.0.
##   https://doi.org/10.5281/zenodo.12533225
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {I 'R' FlowCytobot (iRfcb): Tools for Analyzing and Processing Data from the IFCB},
##     author = {Anders Torstensson},
##     year = {2025},
##     note = {R package version 0.4.0},
##     url = {https://doi.org/10.5281/zenodo.12533225},
##   }

References

  • Hayashi, K., Walton, J., Lie, A., Smith, J. and Kudela M. Using particle size distribution (PSD) to automate imaging flow cytobot (IFCB) data quality in coastal California, USA. In prep.
  • Torstensson, A., Skjevik, A-T., Mohlin, M., Karlberg, M. and Karlson, B. (2024). SMHI IFCB Plankton Image Reference Library. SciLifeLab. Dataset. https://doi.org/10.17044/scilifelab.25883455.v3