1.2.1. Reading BIDS datasets¶
ClinicaDL reads neuroimaging datasets organised in the BIDS format, as well as BIDS derivatives, and CAPS directories produced by Clinica.
Reading a BIDS dataset involves three objects:
a
Bids, which knows how to navigate a BIDS-like directory;a
BidsFileType, which describes which files you want;a
BidsDataset, which ties the two together and loads the selected files intoSamples.
1.2.1.2. Describing the files to load: BidsFileType¶
A BidsFileType defines the NIfTI files you are interested
in. It is expressed in the vocabulary of the BIDS specification: a
suffix, a
data type (the folder, e.g.
"anat"), a file extension, and the entities
the files must (or must not) contain. Regular expressions
are accepted everywhere.
For example, to select all the isotropic T1-weighted images registered to the MNI space:
from clinicadl.io.bids import BidsFileType
file_type = BidsFileType(
data_type="anat",
suffix="T1w",
extension=".nii.gz",
with_entities={"space": r"MNI152.*", "res": "1x1x1"},
)
Clinica preprocessing pipelines¶
If your data has been preprocessed with Clinica, you do not have to write
the file type by hand: ClinicaDL ships ready-made BidsFileType subclasses that
match the output of the most common Clinica pipelines.
File type |
Selects |
|---|---|
T1-weighted images from t1-linear |
|
FLAIR images from flair-linear |
|
PET images from pet-linear |
|
DTI-based measures from dwi-dti |
from clinicadl.io.bids import T1Linear, PetLinear
t1 = T1Linear()
pet = PetLinear(tracer="18FFDG", suvr_reference_region="pons")
1.2.1.3. The BidsDataset¶
A BidsDataset is the object you will use most.
It reads a BIDS directory, loads the files described by a BidsFileType, and returns
one Sample per element of the data.
Consider a dataset whose metadata are stored in a TSV file:
bids_directory
├── dataset_description.json
├── metadata.tsv
├── sub-001
│ ├── ses-M000
│ │ └── anat
│ │ └── sub-001_ses-M000_T1w.nii.gz
│ ...
...
# metadata.tsv
participant_id session_id age diagnosis
sub-001 ses-M000 55.0 control
sub-002 ses-M000 62.0 patient
...
The simplest dataset selects the T1 images of the participants listed in the TSV file:
from clinicadl.data.datasets import BidsDataset
from clinicadl.io.bids import BidsFileType
dataset = BidsDataset(
bids="bids_directory",
file_type=BidsFileType(data_type="anat", suffix="T1w"),
data="bids_directory/metadata.tsv",
)
>>> len(dataset)
50 # one sample per line of metadata.tsv
>>> dataset[0]
Sample(Keys: ('file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 1)
>>> dataset[0].participant_id, dataset[0].session_id
('sub-001', 'ses-M000')
A few arguments shape what a BidsDataset contains and will output:
dataA
pandas.DataFrame(or a path to a TSV file) listing the(participant_id, session_id)pairs to keep, plus any extra columns. If omitted, all the pairs that have the requestedfile_typeare used.columnsThe columns of
datato carry into eachSample. You can pass a list of column names, or a dictionary mapping a column name to a function applied to the whole column, for instance to encode string labels as integers:import pandas as pd def encode_diagnosis(column: pd.Series) -> pd.Series: return column.map({"control": 0, "patient": 1}) dataset = BidsDataset( bids="bids_directory", file_type=BidsFileType(data_type="anat", suffix="T1w"), data="bids_directory/metadata.tsv", columns={"age": None, "diagnosis": encode_diagnosis}, )
>>> dataset[0]["diagnosis"] 0
masksMasks to load alongside each image, passed as a dictionary. The keys become the mask names in the
Sampleand the values describe where to find each mask, a single shared NIfTI file or aBidsFileType(for a participant- and session-specific mask in the same BIDS).bids_directory ├── dataset_description.json ├── metadata.tsv ├── sub-001 ... └── derivatives ├── registration │ ├── space-MNI152NLin2009cSym_mask.nii.gz │ ... └── masks ├── dataset_description.json ├── sub-001 │ ├── ses-M000 │ │ └── anat │ │ └── sub-001_ses-M000_label-brain_mask.nii.gz │ ... ...dataset = BidsDataset( bids="bids_directory", file_type=BidsFileType(data_type="anat", suffix="T1w"), masks={ "brain": ( # subject- and session-specific mask that is in another BIDS "bids_directory/derivatives/masks", BidsFileType( data_type="anat", suffix="mask", with_entities={"label": "brain"} ), ), "mni": "bids_directory/derivatives/registration/space-MNI152NLin2009cSym_mask.nii.gz", # same mask for all (subject, session) }, )
transformsA
TransformsHandlerdescribing the transforms to apply and whether you work on entire images, patches or slices, as we shall see in the next section:from clinicadl.transforms import TransformsHandler, extraction dataset = BidsDataset( bids="bids_directory", file_type=BidsFileType(data_type="anat", suffix="T1w"), data="bids_directory/metadata.tsv", transforms=TransformsHandler(extraction=extraction.Patch(patch_size=64)), )
>>> dataset[0].spatial_shape (64, 64, 64) # a patch, not the full image >>> len(dataset) 1800
Note
The length of a BidsDataset is the number of images times the number
of samples extracted per image. With 50 images and 36 patches each, the dataset
has \(50\times36 = 1800\) samples.
A BidsDataset exposes its metadata as a DataFrame
through df, and can be restricted to a
subset of (participant, session) pairs with
subset().
The next section presents some advanced tips for neuroimaging data manipulation, like speeding up data loading, joining multiple datasets (e.g. coming from different cohorts), or reading non-BIDS-compliant datasets.