Introduction
This tutorial demonstrates how to create a Darwin Core Archive (DwC-A) from Imaging FlowCytobot (IFCB) results processed using MATLAB code from the ifcb-analysis repository (Sosik and Olson 2007). However, the code can be adapted to process classifications from other machine learning algorithms as well. The example below is based on a subset of manually annotated image data from the SMHI IFCB Plankton Image Reference Library (version 3) (Torstensson et al. 2024), and aligns with the best practices outlined by Martin-Cabrera et al. (2022).
The DwC-A is a widely accepted standard for sharing biodiversity data. It organizes data into structured tables, such as sampling events, occurrences, and measurement or facts (MoF), which can be linked through unique identifiers. This standardized format facilitates data sharing, integration, and reuse across platforms, enabling interoperability with global biodiversity databases like the Global Biodiversity Information Facility (GBIF), Ocean Biodiversity Information System (OBIS) and the European Marine Observation and Data Network Biology (EMODNet Biology).
By using the iRfcb
package in combination with the LivingNorwayR
package, this tutorial guides you through creating a sampling
event-based DwC-A. The archive includes occurrence and MoF tables,
ensuring the IFCB results meet the requirements of major biodiversity
repositories.
With DwC-A, your data can become part of a global ecosystem of interoperable datasets, contributing to biodiversity research and monitoring on an international scale. Standardized datasets like these enable diverse applications, such as the development of digital twins—virtual models of ecosystems used to simulate and predict environmental changes. This tutorial provides a reproducible workflow to help you prepare your IFCB data for submission to these large databases while adhering to international data standards, broadening its potential for innovative uses in biodiversity science.
Getting Started
Installation
You can install the iRfcb
and LivingNorwayR
packages from GitHub using the devtools
package:
# install.packages("devtools")
devtools::install_github(c("EuropeanIFCBGroup/iRfcb",
"LivingNorway/LivingNorwayR"),
dependencies = TRUE)
Load the required libraries:
Download Sample Data
To get started, download sample data from the SMHI IFCB Plankton Image Reference Library (Torstensson et al. 2024) with the following function:
# Define data directory
data_dir <- "data"
# Download and extract test data in the data folder
ifcb_download_test_data(dest_dir = data_dir,
max_retries = 10,
sleep_time = 30,
verbose = FALSE)
Extract Data
Extract Positions and Timestamps
In this example, most of the coordinates are stored within the
.hdr
files and are extracted along with the corresponding
timestamps in the following step:
# Read HDR data and extract GPS position (when available) and timestamps
gps_data <- ifcb_read_hdr_data("data/data/",
gps_only = TRUE,
verbose = FALSE)
Summarize Counts, Biovolumes and Carbon Content from Manually Annotated IFCB Data
You can also apply this process to automatically classified data by
setting the mat_folder
parameter to point to your
class
folder, and setting class2use_file
to
NULL
.
# Summarize biovolume data using IFCB data from manual data folder
manual_biovolume_data <- ifcb_summarize_biovolumes(feature_folder = "data/features",
mat_folder = "data/manual",
class2use_file = "data/config/class2use.mat",
hdr_folder = "data/data",
verbose = FALSE)
The coordinates and biovolume data are now combined into a single unified dataframe.
Event Core
Parent Events
Each event can belong to a higher-level event, referred to as a Parent Event. In this example, the Parent Event is the dataset itself, and the main events are the samples. However, in more extensive datasets, Parent Events could represent broader categories, such as cruises, instrument numbers, specific years, or other hierarchical groupings. Later we can add MoF data to each event level.
Each Parent Event must have a unique, persistent identifier, referred
to as parentEventID. We generate these identifiers
using the uuid
package (Urbanek and Ts’o 2021), ensuring
they are both globally unique and consistent.
Additionally, each Parent Event is associated with specific date ranges, which must be captured as eventDate to reflect the temporal span of the observations or data collection. This helps provide clear temporal context for the data. Other project-specific terms can be defined here as well, such as datasetName.
# Add a single parentEventID for all samples in the dataset
data_event <- data_manual %>%
mutate(parentEventID = uuid::UUIDgenerate(use.time = FALSE))
# Event Date and info for parentEvents
data_parent_event <- data_event %>%
group_by(parentEventID) %>%
summarise(min = min(date), max=max(date),
ifcb_number = unique(ifcb_number)) %>%
mutate(eventID = parentEventID,
parentEventID = NA,
eventType = "Project",
datasetName = "iRfcb-DwC-A",
eventDate = paste0(min, "/", max)) %>%
select(-min, -max)
Sample Events
To organize our data effectively, we start by defining key event terms. These include general terms like eventType and ownerInstitutionCode. institutionID for many European institutions can be retrieved from various registries, such as the European Directory of Marine Organisations (EDMO). Each term represents specific metadata associated with the collected data. Below is the annotated R code that processes and structures the data into an event-focused format.
# Add metadata columns to the data
data_event <- data_event %>%
mutate(
# Defining the institution who own and are responsible for the data
ownerInstitutionCode = "SMHI",
institutionID = "https://edmo.seadatanet.org/report/545",
institutionCode = "SMHI",
# Defining CC-BY data licence
license = "http://creativecommons.org/licenses/by/4.0/legalcodeY",
# Specifying the type of record, which is Sample in this case
eventType = "Sample",
# Mapping existing date and time fields to standard terms
eventDate = date,
eventTime = time,
# Adding geographical information
decimalLatitude = gpsLatitude,
decimalLongitude = gpsLongitude,
locality = NA, # The specific description of the place, if available
verbatimLocality = NA, # The original textual description of the place
geodeticDatum = "EPSG:4326",
countryCode = "SE",
country = "Sweden",
# Specifying the size and unit of the sample analyzed
sampleSizeValue = ml_analyzed,
sampleSizeUnit = "Millilitres",
# Indicating the depth at which samples were taken
minimumDepthInMeters = 4,
maximumDepthInMeters = 4,
# Describing the sampling protocol
samplingProtocol = "Imaging FlowCytobot integrated into the Ferrybox system aboard the R/V Svea, continuously capturing plankton images from a depth of 4 meters"
) %>%
# Grouping by sample to assign unique event IDs
group_by(sample) %>%
mutate(eventID = uuid::UUIDgenerate(use.time = FALSE)) %>%
ungroup()
If the exact sample position is unknown, the coordinates can be estimated and paired with a coordinateUncertaintyInMeters value. In this case, the samples originate from the Swedish west coast, and we assign coordinates within the Skagerrak and Kattegat region and specify an uncertainty of 150 km, which encompasses most of these areas.
# Add estimated coordinates and uncertainty for events with missing positions
data_event <- data_event %>%
mutate(
coordinateUncertaintyInMeters = if_else(is.na(decimalLongitude) & is.na(decimalLatitude),
150000, NA),
decimalLongitude = if_else(is.na(decimalLongitude),
11.3, decimalLongitude),
decimalLatitude = if_else(is.na(decimalLatitude),
57.4, decimalLatitude)
)
Next, we extract the relevant columns for the Event tables and combine the Event and Parent Event tables into a single data frame.
# Create a clean data frame with selected columns
event_df <- data_event %>%
select(eventType, ownerInstitutionCode, institutionCode, institutionID,
parentEventID, eventID, license, samplingProtocol, sampleSizeValue,
sampleSizeUnit, eventDate, eventTime, year, month, day, country,
countryCode, decimalLatitude, decimalLongitude, geodeticDatum,
coordinateUncertaintyInMeters, locality, verbatimLocality,
minimumDepthInMeters, maximumDepthInMeters) %>%
mutate(eventDate = as.character(eventDate)) %>%
# Ensure rows are unique
distinct()
# Create a clean data frame with selected columns
parent_event_df <- data_parent_event %>%
select(-ifcb_number)
# Adjust eventDate to character format and append additional parent event data
event_df <- parent_event_df %>%
mutate(eventDate = as.character(eventDate)) %>%
bind_rows(event_df)
# Print the final table as tibble
tibble(event_df)
## # A tibble: 10 × 26
## parentEventID eventID eventType datasetName eventDate ownerInstitutionCode
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NA 3ca939… Project iRfcb-DwC-A 2022-05-… NA
## 2 3ca93920-a1fd-4… 6f535a… Sample NA 2022-05-… SMHI
## 3 3ca93920-a1fd-4… 68cf0f… Sample NA 2022-05-… SMHI
## 4 3ca93920-a1fd-4… e6938c… Sample NA 2022-07-… SMHI
## 5 3ca93920-a1fd-4… 72f817… Sample NA 2022-07-… SMHI
## 6 3ca93920-a1fd-4… 68cf47… Sample NA 2023-03-… SMHI
## 7 3ca93920-a1fd-4… 1d6cda… Sample NA 2023-03-… SMHI
## 8 3ca93920-a1fd-4… 36fc01… Sample NA 2023-08-… SMHI
## 9 3ca93920-a1fd-4… 9efef6… Sample NA 2023-09-… SMHI
## 10 3ca93920-a1fd-4… 8086c6… Sample NA 2023-09-… SMHI
## # ℹ 20 more variables: institutionCode <chr>, institutionID <chr>,
## # license <chr>, samplingProtocol <chr>, sampleSizeValue <dbl>,
## # sampleSizeUnit <chr>, eventTime <time>, year <dbl>, month <dbl>, day <int>,
## # country <chr>, countryCode <chr>, decimalLatitude <dbl>,
## # decimalLongitude <dbl>, geodeticDatum <chr>,
## # coordinateUncertaintyInMeters <dbl>, locality <lgl>,
## # verbatimLocality <lgl>, minimumDepthInMeters <dbl>, …
The final stage is to initialise an event object in the
livingNorwayR
package - this will be used later to build
the DwC compliant data package.
GBIF_Event <- initializeGBIFEvent(event_df,
idColumnInfo = "eventID",
nameAutoMap = TRUE)
Occurrence Extension
The Occurrence table captures information about individual organisms or observations, linking them to a specific event. For IFCB data, the basisOfRecord indicates how the observation was made. IFCB data are defined as MachineObservation, and since this example uses manually annotated images, the identificationVerificationStatus is set to ValidatedByHuman. For best practices in plankton imaging data management, see Martin-Cabrera et al. (2022).
Annotated Code: Define Occurrence Data
Each class observation is considered an occurrence in the IFCB data. The following code transforms the data into an occurrence table, adding essential fields such as type, collectionCode, occurrenceID, basisOfRecord, identificationVerificationStatus, identificationReferences, identifiedBy, and associatedMedia. These fields provide context and provenance information for each occurrence.
Links to raw images can be included in
associatedMedia. These links may point to resources
such as the IFCB Dashboard, Ecotaxa, or other image archives. Learn how
to prepare images for Ecotaxa using iRfcb
in this tutorial, or how to export
images to an image
repository.
# Create an occurrence table by transforming event data and adding fields
data_occurrences <- data_event %>%
rowwise() %>%
mutate(
type = "StillImage", # Specifies the record type as an image
collectionCode = "iRfcb", # Provides a collection identifier
occurrenceID = uuid::UUIDgenerate(use.time = FALSE), # Generates a unique identifier for each occurrence
basisOfRecord = "MachineObservation", # Indicates the data was recorded by a machine
identificationVerificationStatus = "ValidatedByHuman", # Indicate that the images have been validated
identificationReferences = "https://github.com/hsosik/ifcb-analysis/wiki/Instructions-for-manual-annotation-of-images",
identifiedBy = "John Doe", # Indicate who validated the image
associatedMedia = "https://ecotaxa.obs-vlfr.fr/prj/14392" # Link to images (if available)
)
Taxonomic Data Cleaning and Retrieval
Class names often include excess or inconsistent information, such as underscores or morphological descriptors, which can complicate the assignment of proper taxonomical names needed for the occurrence table. These names need to be cleaned before mapping them to higher taxonomic levels using external sources like WoRMS, as demonstrated below.
# Get taxa names
taxa_names <- unique(data_occurrences$class)
# Clean taxa_names by substituting specific patterns with spaces or empty strings
taxa_names_clean <- gsub("_", " ", taxa_names)
taxa_names_clean <- gsub(" single cell", "", taxa_names_clean)
taxa_names_clean <- gsub(" chain", "", taxa_names_clean)
taxa_names_clean <- gsub(" group", "", taxa_names_clean)
taxa_names_clean <- gsub("-like", "", taxa_names_clean)
taxa_names_clean <- gsub(" larger than 30unidentified", "", taxa_names_clean)
taxa_names_clean <- gsub(" smaller than 30unidentified", "", taxa_names_clean)
# Remove species flags from class names
taxa_names_clean <- gsub("\\<spp\\>", "", taxa_names_clean)
taxa_names_clean <- gsub(" ", " ", taxa_names_clean)
# Turn f to f. for forma
taxa_names_clean <- gsub("\\bf\\b", "f.", taxa_names_clean)
# Add "/" for multiple names with capital letters
# e.g. Heterocapsa_Azadinium to Heterocapsa/Azadinium
taxa_names_clean <- gsub(" ([A-Z])", "/\\1", taxa_names_clean)
taxa_names_clean <- gsub(" ([A-Z])", "/\\1", taxa_names_clean)
# Remove any whitespace
taxa_names_clean <- trimws(taxa_names_clean)
# Correct misspellings
taxa_names_clean <- gsub("Amphidnium", "Amphidinium", taxa_names_clean)
taxa_names_clean <- gsub("Enisiculifera", "Ensiculifera", taxa_names_clean)
# Standardize ambiguous class names by renaming them to their closest taxonomic relatives
taxa_names_clean <- gsub("Dinoflagellate", "Dinophyceae", taxa_names_clean)
taxa_names_clean <- gsub("Leptocylindrus danicus minimus", "Leptocylindrus", taxa_names_clean)
taxa_names_clean <- gsub("Heterocapsa/Azadinium", "Peridiniphycidae", taxa_names_clean)
taxa_names_clean <- gsub("Cylindrotheca/Nitzschia longissima", "Bacillariaceae", taxa_names_clean)
# Retrieve worms records
worms_records <- ifcb_match_taxa_names(taxa_names_clean,
marine_only = FALSE,
fuzzy = FALSE,
verbose = FALSE)
# Create data frame with taxa information and class names
class_names <- worms_records %>%
mutate(class_name = taxa_names, class_clean = taxa_names_clean)
The scientificName and verbatimIdentification fields are populated using the cleaned taxonomic names and the original class names, respectively.
data_occurrences <- data_occurrences %>%
rename(class_name = class) %>%
left_join(class_names, by = "class_name") %>%
mutate(scientificName = name,
scientificNameAuthorship = authority,
verbatimIdentification = class_name,
scientificNameID = lsid,
taxonRank = rank,
occurrenceStatus = "present")
The final Occurrence table includes all relevant fields for DwC-A formatting.
# Select relevant fields
occurrence_df <- data_occurrences %>%
select(occurrenceID, eventID, eventDate, occurrenceStatus, collectionCode,
type, basisOfRecord, identificationVerificationStatus,
identificationReferences, identifiedBy, associatedMedia,
scientificName, scientificNameAuthorship, scientificNameID,
taxonRank, kingdom, phylum, class, order, family, genus,
verbatimIdentification)
# Print the final table as tibble
tibble(occurrence_df)
## # A tibble: 101 × 22
## occurrenceID eventID eventDate occurrenceStatus collectionCode type
## <chr> <chr> <date> <chr> <chr> <chr>
## 1 cf3f8fb8-7684-48f8-… 6f535a… 2022-05-22 present iRfcb Stil…
## 2 696f5410-e600-4f23-… 6f535a… 2022-05-22 present iRfcb Stil…
## 3 8e73fad8-0853-4d8b-… 6f535a… 2022-05-22 present iRfcb Stil…
## 4 6ff04ba4-e596-4a63-… 6f535a… 2022-05-22 present iRfcb Stil…
## 5 5cea123e-a3a5-4f51-… 68cf0f… 2022-05-22 present iRfcb Stil…
## 6 3d7dcf6a-397e-439e-… e6938c… 2022-07-12 present iRfcb Stil…
## 7 f954114e-4860-4214-… e6938c… 2022-07-12 present iRfcb Stil…
## 8 a73d4a79-19c5-4e5f-… e6938c… 2022-07-12 present iRfcb Stil…
## 9 d5a6e26e-05a9-4924-… 72f817… 2022-07-12 present iRfcb Stil…
## 10 0b65e4d7-ddff-4681-… 72f817… 2022-07-12 present iRfcb Stil…
## # ℹ 91 more rows
## # ℹ 16 more variables: basisOfRecord <chr>,
## # identificationVerificationStatus <chr>, identificationReferences <chr>,
## # identifiedBy <chr>, associatedMedia <chr>, scientificName <chr>,
## # scientificNameAuthorship <chr>, scientificNameID <chr>, taxonRank <chr>,
## # kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## # genus <chr>, verbatimIdentification <chr>
The occurrence data is initialized for GBIF submission using the
initializeGBIFOccurrence
function, which maps fields
automatically based on the specified column.
GBIF_Occurrence <- initializeGBIFOccurrence(occurrence_df,
idColumnInfo = "occurrenceID",
nameAutoMap = TRUE)
MoF Extension
The MoF table allows us to capture any additional measurements and facts associated with occurrences, such as biological or environmental measurements associated with the events (samples) or the occurrences. For IFCB data, we can include information such as counts, abundance, biovolume concentration, and carbon content. These measurements provide essential context for understanding the ecological significance of the observations.
# Create a dataset for occurrences (no modifications made here)
data_occurrence_mof <- data_occurrences
# Add placeholder for occurrence IDs in the event dataset
data_event_mof <- data_event %>%
mutate(occurrenceID = NA)
# Add placeholders for occurrence IDs and specify instrument type in the parent event dataset
data_parent_mof <- data_parent_event %>%
mutate(occurrenceID = NA,
instrument = "IFCB")
Next, we extract the necessary columns from the dataset that will be used in the MoF table. This includes the measurement ID, associated event and occurrence IDs, and key IFCB-derived measurements such as counts, abundance, biovolume, and carbon concentration.
# Convert biovolume units and select the relevant columns for occurrence MoF
data_occurrence_mof <- data_occurrence_mof %>%
mutate(biovolume_um3_per_liter = biovolume_mm3_per_liter * 10^9) %>%
select(eventID, parentEventID, occurrenceID, counts,
counts_per_liter, biovolume_um3_per_liter, carbon_ug_per_liter)
# Select the relevant columns for parentEvent MoF
data_parent_mof <- data_parent_mof %>%
select(eventID, parentEventID, occurrenceID, instrument, ifcb_number)
# Select the relevant columns for event MoF
data_event_mof <- data_event_mof %>%
select(eventID, parentEventID, occurrenceID, ml_analyzed) %>%
distinct()
The table needs to be transformed into a “long format,” where all
measurements are placed into a single column called
measurementType, with their corresponding values in
measurementValue. This is done using the
pivot_longer
function.
We also standardize the measurement types to align with controlled vocabularies (e.g., Abundance, Biovolume concentration) for better compatibility with global biodiversity standards.
# Pivot and standardize occurrence measurements
data_occurrence_mof <- data_occurrence_mof %>%
pivot_longer(cols = c(counts, counts_per_liter, biovolume_um3_per_liter, carbon_ug_per_liter),
names_to = "measurementType",
values_to = "measurementValue") %>%
drop_na(measurementValue) %>%
mutate(measurementType = gsub("counts_per_liter", "Abundance", measurementType)) %>%
mutate(measurementType = gsub("biovolume_um3_per_liter", "Biovolume concentration", measurementType)) %>%
mutate(measurementType = gsub("carbon_ug_per_liter", "Carbon content", measurementType)) %>%
mutate(measurementType = gsub("counts", "Count", measurementType)) %>%
mutate(measurementValue = as.character(measurementValue))
# Pivot and standardize event measurements
data_event_mof <- data_event_mof %>%
mutate(ml_analyzed = as.character(ml_analyzed)) %>%
pivot_longer(cols = c(ml_analyzed),
names_to = "measurementType",
values_to = "measurementValue") %>%
mutate(measurementType = gsub("ml_analyzed", "Sample volume", measurementType))
# Pivot and standardize parent measurements
data_parent_mof <- data_parent_mof %>%
pivot_longer(cols = c(instrument, ifcb_number),
names_to = "measurementType",
values_to = "measurementValue") %>%
mutate(measurementType = gsub("instrument", "Imaging instrument name", measurementType)) %>%
mutate(measurementType = gsub("ifcb_number", "Instrument identification number", measurementType))
# Combine all standardized measurements into a single dataset and add unique measurementIDs for each measurment
data_mof <- bind_rows(data_parent_mof, data_event_mof, data_occurrence_mof) %>%
rowwise() %>%
mutate(measurementID = uuid::UUIDgenerate(use.time = FALSE))
To ensure data interoperability, we match each measurement type with its corresponding controlled vocabulary terms from the NERC Vocabulary Server. This includes assigning measurementTypeID, measurementUnit, and measurementUnitID for each measurementType.
# Create a lookup table for NERC vocabularies
nerc_vocab <- data.frame(
measurementValueID = c(rep(NA, 6),
"http://vocab.nerc.ac.uk/collection/L22/current/TOOL1588/"),
measurementType = c("Count",
"Abundance",
"Biovolume concentration",
"Carbon content",
"Sample volume",
"Instrument identification number",
"Imaging instrument name"),
measurementTypeID = c("http://vocab.nerc.ac.uk/collection/P01/current/OCOUNT01/",
"http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01",
"http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/",
"http://vocab.nerc.ac.uk/collection/P01/current/MDMAP010/",
"http://vocab.nerc.ac.uk/collection/P01/current/VOLXXXXX/",
"http://vocab.nerc.ac.uk/collection/P01/current/SERNUMZZ/",
"http://vocab.nerc.ac.uk/collection/P01/current/NMSPINST/"),
measurementUnit = c("Dimensionless",
"Individual per litre",
"Cubic micrometres per litre",
"Micrograms per litre",
"Millilitres",
"Not applicable",
"Not applicable"),
measurementUnitID = c("http://vocab.nerc.ac.uk/collection/P06/current/UUUU/",
"http://vocab.nerc.ac.uk/collection/P06/current/UCPL/",
"http://vocab.nerc.ac.uk/collection/P06/current/CUPL/",
"http://vocab.nerc.ac.uk/collection/P06/current/UGPL/",
"http://vocab.nerc.ac.uk/collection/P06/current/VVML/",
"http://vocab.nerc.ac.uk/collection/P06/current/XXXX/",
"http://vocab.nerc.ac.uk/collection/P06/current/XXXX/"))
# Merge the data with NERC vocabularies and relocate columns
mof_df <- data_mof %>%
left_join(nerc_vocab, by = "measurementType") %>%
relocate(measurementTypeID, .after = measurementType) %>%
relocate(measurementID, .before = 1)
# Print the final table as tibble
tibble(mof_df)
## # A tibble: 415 × 10
## measurementID eventID parentEventID occurrenceID measurementType
## <chr> <chr> <chr> <chr> <chr>
## 1 29669812-9c76-4c2d-8ecb-8… 3ca939… NA NA Imaging instru…
## 2 28b5cd12-9b4e-43de-9537-7… 3ca939… NA NA Instrument ide…
## 3 aa286129-6341-4fb0-aaef-2… 6f535a… 3ca93920-a1f… NA Sample volume
## 4 3e6c9c32-fd1f-4423-9403-f… 68cf0f… 3ca93920-a1f… NA Sample volume
## 5 31a6f38a-f22d-40a1-9f68-c… e6938c… 3ca93920-a1f… NA Sample volume
## 6 fcfb3c8d-e0b0-4b72-a52c-4… 72f817… 3ca93920-a1f… NA Sample volume
## 7 9fa61716-50bf-4ce8-8b93-6… 68cf47… 3ca93920-a1f… NA Sample volume
## 8 66edc03b-0dac-40b6-913f-8… 1d6cda… 3ca93920-a1f… NA Sample volume
## 9 2f7fe684-0805-4228-b79b-b… 36fc01… 3ca93920-a1f… NA Sample volume
## 10 98a7ea91-701a-464d-b953-a… 9efef6… 3ca93920-a1f… NA Sample volume
## # ℹ 405 more rows
## # ℹ 5 more variables: measurementTypeID <chr>, measurementValue <chr>,
## # measurementValueID <chr>, measurementUnit <chr>, measurementUnitID <chr>
Finally, the extended measurement or fact table is prepared for GBIF
by initializing the dataset in a compatible format using the
initializeGBIFMeasurementOrFact
function. The measurement
IDs are used as unique identifiers, and columns are mapped
automatically.
GBIF_MoF <- initializeGBIFMeasurementOrFact(mof_df,
idColumnInfo = "measurementID",
nameAutoMap = TRUE)
Metadata
Metadata is an integral part of biodiversity datasets, as it ensures data discoverability, transparency, and reusability. The Ecological Metadata Language (EML) is a widely accepted XML-based standard used to describe datasets, including those formatted as DwC-A. For datasets intended for submission to platforms like GBIF, EML provides a robust structure to communicate essential information about the dataset’s scope, methods, and contributors.
Creating Metadata for IFCB Data
To create metadata for the IFCB dataset in a standardized format, we
use the initializeDwCMetadata
function. This function
generates a template that follows DwC guidelines, which can then be
customized with specific details about the dataset.
Here, we create the metadata starting with a Markdown file (metadata-template.rmd
),
which will be rendered into an EML-compliant XML file for GBIF
submission:
# Initialize DwC metadata using a R Markdown template
GBIF_Metadata <- initializeDwCMetadata("metadata-template.rmd",
fileType = "rmarkdown")
DwC-A Creation
In this step, the finalized tables (GBIF_Event
,
GBIF_Occurrence
, and GBIF_MoF
) and the
metadata (GBIF_Metadata
) are bundled together into a DwC-A
.zip
file that is ready for submission to platforms like
GBIF using an Integrated Publishing Toolkit (IPT), e.g. the EurOBIS IPT.
Here is the code for initializing and exporting the DwC-A:
# Initialize the DwC-A
dwca_archive <- initializeDwCArchive(GBIF_Event,
list(GBIF_Occurrence, GBIF_MoF),
GBIF_Metadata)
# Export the archive as a zip file
dwca_archive$exportAsDwCArchive(file.path(data_dir, "iRfcb-DwC-A.zip"))
This concludes this tutorial for the iRfcb
package. For
more detailed information, refer to the package documentation or the
other tutorials, and the LivingNorwayR
documentation. See how more complex data pipelines can be constructed
using iRfcb
in the following Example
Project. Happy analyzing!
Citation
## To cite package 'iRfcb' in publications use:
##
## Anders Torstensson (2025). I 'R' FlowCytobot (iRfcb): Tools for
## Analyzing and Processing Data from the IFCB. R package version 0.4.0.
## https://doi.org/10.5281/zenodo.12533225
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {I 'R' FlowCytobot (iRfcb): Tools for Analyzing and Processing Data from the IFCB},
## author = {Anders Torstensson},
## year = {2025},
## note = {R package version 0.4.0},
## url = {https://doi.org/10.5281/zenodo.12533225},
## }
References
- Martin-Cabrera, P., Perez Perez, R., Irisson, J.-O., Lombard, F., Möller, K.O., Rühl, S., Creach, V., Lindh, M., Stemmann, L., Schepers, L. (2022). Best practices and recommendations for plankton imaging data management: ensuring effective data flow towards international data infrastructures. Version 1. Flanders Marine Institute: Ostend. 31 pp. https://dx.doi.org/10.25607/OBP-1742.
- Sosik, H. M. and Olson, R. J. (2007) Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnol. Oceanogr: Methods 5, 204–216.
- Torstensson, A., Skjevik, A-T., Mohlin, M., Karlberg, M. and Karlson, B. (2024). SMHI IFCB Plankton Image Reference Library. SciLifeLab. Dataset. https://doi.org/10.17044/scilifelab.25883455.v3
- Urbanek, Simon, and Theodore Ts’o. 2021. Uuid: Tools for Generating and Handling of UUIDs. https://CRAN.R-project.org/package=uuid.