clinicadl.data.datasets.BidsDataset¶

class clinicadl.data.datasets.BidsDataset(bids: ~pathlib.Path | str | ~clinicadl.io.bids.reader.Bids, file_type: ~clinicadl.io.bids.file_type.base.BidsFileType, data: ~pathlib.Path | str | ~pandas.core.frame.DataFrame | None = None, transforms: ~clinicadl.transforms.handlers.transforms.TransformsHandler = <clinicadl.transforms.handlers.transforms.TransformsHandler object>, columns: ~typing.Sequence[str] | dict[str, ~typing.Callable[[~pandas.core.series.Series], ~pandas.core.series.Series] | None] | None = None, masks: dict[str, ~pathlib.Path | str | ~clinicadl.io.bids.file_type.base.BidsFileType | tuple[~pathlib.Path | str | ~clinicadl.io.bids.reader.Bids, ~clinicadl.io.bids.file_type.base.BidsFileType]] | None = None)[source]¶

A Dataset working with neuroimaging data organized in BIDS (or derivative) format.

The user specifies the path to the BIDS directory via bids, the type of data to load via file_type, and the (participant, session) pairs to work on via data.

BidsDataset loads the image and the potential masks (see masks argument), and puts them in a Sample. The user can add additional data in this Sample via the arguments columns.

Transformations (e.g., preprocessing or data augmentation) can be applied to the loaded data (see transforms argument).

With BidsDataset, it is possible to work on the whole images, or on patches or slices extracted from the images. This is also specified via the transforms argument (e.g., transforms=TransformsHandler(extraction=Slice())).

Note

The size of the BidsDataset depends on the type of data you are working on. For example, if you have 10 images with 100 slices each, and you want to work on slices, the length of your dataset will be \(10\times100=1,000\).
To avoid confusion, we will use the term “sample” to refer to the actual element of the images we are working on (patch, slice or the whole image).

Finally, you may be interested in to_tensors(), that will convert your NIfTI images to tensors (saved in .pt files). Since opening a .pt file is much faster than opening a NIfTI file, this may speed up data loading.

Parameters:

bids (PathType | Bids) – The BIDS (or derivative) directory where the data will be loaded from. Can be passed as a path or directly as Bids.
file_type (BidsFileType) – Defines the files to load in the BIDS directory. The BidsFileType must contain the requirements necessary to select only the relevant files.
data (Optional[DataFrameType], default=None) –
A pandas.DataFrame (or a path to a TSV file containing the DataFrame) with the list of (participant, session) pairs to consider, as well as any other relevant information (e.g. the age of the participants). Only (participant, session) pairs mentioned in this TSV file will be in the BidsDataset.

If None, all (participant, session) pairs in bids that have the right file_type will be considered.

Warning

Be careful if you pass a DataFrame with a column named "n_samples". BidsDataset will understand it as the number of samples for each (participant, session) pair.
transforms (TransformsHandler, default=TransformsHandler()) – Transformation pipeline to apply to the data after loading. The user also specifies here whether to work on images, patches, or slices. See clinicadl.transforms.TransformsHandler.
columns (Optional[ColumnsType], default=None) –
Columns to get in the DataFrame data and to put in the output Sample.

Can be passed via:
- a list of strings (e.g. ["age", "sex"]), corresponding to the names of the columns;
- or a dictionary (e.g. {"age": <function>, "sex": None}), where the keys are the names of the columns, and the values are functions to apply to the columns. If the function is None, no function will be applied to the column.
Note

The potential functions applied to the columns are applied to the whole column. They must take as input a pandas.Series, and return a pandas.Series. For example, it is useful to convert string labels to integer labels for classification.
masks (Optional[MasksType], default=None) –
Masks to be loaded along with the images.

The masks are passed via a dictionary, whose names will be the names given to the masks in the output Sample, and whose values can be:
- a path (str or pathlib.Path) to a NIfTI image: the same mask is used for all the (participant, session) pairs.
- a BidsFileType: the mask is participant- and session-specific and the pattern to find the mask in the bids is given via the BidsFileType.
- a tuple (PathType | Bids, BidsFileType): the mask is participant- and session-specific but is not in the same BIDS dataset as the image. So, here the BIDS where to look for the mask must be passed in the first element of the tuple.

Raises:

DataFrameError – If the DataFrame in data is empty.
DataFrameError – If the DataFrame in data does not contain the columns "participant_id" and "session_id".
DataFrameError – If the DataFrame in data contains duplicated (participant_id, session_id) pairs.
RuntimeError – If for some (participant, session) pairs, an image corresponding to file_type cannot be found in bids.
ValueError – If a key is used in columns and masks.
ValueError – If a key in columns or masks is equal to the name of one of the attributes of Sample.

Examples

bids
├── dataset_description.json
├── metadata.tsv
├── sub-001
│   ├── ses-M000
│   │   └── anat
│   │       ├── sub-001_ses-M000_T1w.nii.gz
│   │       └── sub-001_ses-M000_label-head_mask.nii.gz
│   ...
...
└── derivatives
    ├── registration
    │   ├── space-MNI152NLin2009cSym_mask.nii.gz
    │   ...
    └── masks
        ├── dataset_description.json
        ├── sub-001
        │   ├── ses-M000
        │   │   └── anat
        │   │       └── sub-001_ses-M000_label-brain_mask.nii.gz
        │   ...
        ...

The "metadata.tsv" file looks like:

participant_id  session_id   age   sex   diagnosis
sub-001         ses-M000     55.0  M     control
sub-001         ses-M024     57.0  M     control
sub-002         ses-M000     62.0  F     control
sub-002         ses-M024     64.0  F     patient
sub-003         ses-M000     67.0  F     patient
...

from clinicadl.data.datasets import BidsDataset
from clinicadl.io.bids import BidsFileType
from clinicadl.transforms import TransformsHandler, extraction
from clinicadl.transforms.config import (
    ZNormalizationConfig,
    ResampleConfig,
    RandomFlipConfig,
)
import pandas as pd

# to convert diagnosis to numeric values
def diagnosis_to_number(column: pd.Series) -> pd.Series:
    encoding = {"control": 0, "patient": 1}
    return column.apply(lambda x: encoding[x])

>>> dataset = BidsDataset(
        bids="bids",
        file_type=BidsFileType(
            data_type="anat",
            suffix="T1w",
        ),
        data="bids/metadata.tsv",
        columns=["age"],
    )
>>> dataset[0]
Sample(Keys: ('age', 'file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 1)
>>> dataset[0].spatial_shape
(169, 208, 179)    # full image
>>> len(dataset)
50    # 50 lines in the metadata.tsv
>>> dataset[0].participant_id, dataset[0].session_id, dataset[0].age
'sub-001', 'ses-M000', 55.0

>>> dataset = BidsDataset(
        bids="bids",
        file_type=BidsFileType(
            data_type="anat",
            suffix="T1w",
        ),
        data="bids/metadata.tsv",
        columns={"age": None, "diagnosis": diagnosis_to_number},
        transforms=TransformsHandler(
            extraction=extraction.Patch(patch_size=64),
        ),
    )
>>> dataset[0]["diagnosis"]
0    # diagnosis is now encoded
>>> dataset[0].spatial_shape
(64, 64, 64)    # patches
>>> len(dataset)
1800    # 36 patches per image

>>>  dataset = BidsDataset(
        bids="bids",
        file_type=BidsFileType(
            data_type="anat",
            suffix="T1w",
        ),
        transforms=TransformsHandler(
            image_transforms=[
                ResampleConfig(target="mni"),    # masks can be used in transforms
                ZNormalizationConfig(masking_method="head"),
            ],
            augmentations=[RandomFlipConfig(flip_probability=0.5)],
        ),
        masks={
            "head": BidsFileType(
                data_type="anat", suffix="mask", with_entities={"label": "head"}
            ),    # participant- and session-specific mask that is in the same BIDS
            "brain": (
                "bids/derivatives/masks",
                BidsFileType(
                    data_type="anat", suffix="mask", with_entities={"label": "brain"}
                ),    # participant- and session-specific mask that is in another BIDS
            ),
            "mni": "bids/derivatives/registration/space-MNI152NLin2009cSym_mask.nii.gz",    # same mask for all (participant, session)
        },
    )
>>> dataset[0]
Sample(Keys: ('head', 'brain', 'mni', 'file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 4)
>>> len(dataset)
60    # all the (participant, session) that have T1w images. Not only the ones in metadata.tsv

to_tensors(conversion_name: str | None = None, spatial_checks: Iterable[str | SpatialCheck] | None = ('affine', 'shape', 'global_spacing'), save_transforms: bool = False, description: str | None = None, overwrite: bool = False, check_transforms: bool = True, n_proc: int = 1) → TensorDataset[source]¶

Converts NifTI files in the current BidsDataset to tensors (in PyTorch’s .pt format).

Conversion to tensors may significantly speed up data loading.

The tensors are saved in a BIDS derivative named tensors. The location of this folder relative to the original BIDS depends on the type of BIDS (see clinicadl.io.bids.Bids). A .json file describing the conversion is also saved at the root of tensors, as well as other metadata files (see examples).

Masks will be converted and saved in the same file as the image they are associated with.

The user has the possibility to save transformed images, i.e. images on which image transforms have already been applied (see image_transforms in clinicadl.transforms.TransformsHandler). This practice will speed up dataloading during training or inference as the images will not have to be transformed each time they are loaded.

Note

Images are converted to the same coordinate system (RAS+).

Parameters:

conversion_name (Optional[str], default=None) –
The name of the tensor conversion. Must be alphanumerical.

The output tensors and the output .json file describing the conversion will have "conv-<conversion_name>" in their filenames.

If a conversion with this name already exists:
- if overwrite=True, the old tensors will be overwritten;
- else, to_tensors will try to append the new conversion to the pre-existing tensor conversion if they concern the same type of data (same modality, same transforms applied, etc.), otherwise an error will be raised.
If None, the conversion name will be "raw".

conversion_name cannot be None if save_transforms=True.
spatial_checks (Optional[Iterable[str | SpatialCheck]], default=("affine", "shape", "global_spacing")) –
Potential spatial checks to perform on the images while converting them:
- "spacing": checks intra-sample voxel spacing consistency, i.e. that all the images and masks in the Sample output by the current dataset have the same voxel spacing.
- "affine": checks intra-sample affine matrix consistency (so it includes "spacing").
- "shape": checks intra-sample spatial shape consistency.
- "global_spacing": checks inter-sample voxel spacing consistency, i.e. that all the Samples in the dataset have the same voxel spacing (so it includes "spacing").
- ”global_shape": checks inter-sample spatial shape consistency (so it includes "shape").
If None, no spatial check performed.
save_transforms (bool, default=False) – Whether to save raw images without transforms as tensors (False), or images with the applied transforms (True).
description (Optional[str], default=None) – A potential description of the tensor conversion that will be saved in the description .json file.
overwrite (bool, default=False) – Whether to overwrite a pre-existing tensor conversion that has the same conversion_name. If a conversion named conversion_name already exists and overwrite=False, to_tensors will try to append the current tensor conversion to the pre-existing one.
check_transforms (bool, default=True) –
check_transforms determines whether transforms will be checked when appending to a pre-existing conversion. If True, to_tensors will check that the current transforms match the transforms applied during the pre-existing conversion.

check_transforms=False is useful when you use custom transforms (i.e. transforms not in ClinicaDL), which cannot be checked.

Note

If save_transforms=False, no such check will be performed.

Warning

To use carefully. You must be sure that the transforms match before setting check_transforms=False.
n_proc (int, default=1) – Number of cores to use to parallelize the conversion.

Returns:

TensorDataset – A TensorDataset containing the converted tensors.

Raises:

ValueError – If the user passed "raw" as a conversion_name.
ValueError – If conversion_name is None and save_transforms=True.
TensorConversionError – If a conversion named conversion_name already exists and the new conversion cannot be appended to the pre-existing one.
RuntimeError – If some checks in spatial_checks fail.

Examples

bids
├── dataset_description.json
├── metadata.tsv
├── sub-001
│   ├── ses-M000
│   │   └── anat
│   │       ├── sub-001_ses-M000_T1w.nii.gz
│   │       └── sub-001_ses-M000_label-head_mask.nii.gz
│   ...
...
└── derivatives
    └── registration
        ├── space-MNI152NLin2009cSym_mask.nii.gz
        ...

from clinicadl.data.datasets import BidsDataset
from clinicadl.io.bids import BidsFileType
from clinicadl.transforms import TransformsHandler, extraction
from clinicadl.transforms.config import ResampleConfig

dataset = BidsDataset(
    bids="bids",
    file_type=BidsFileType(
        data_type="anat",
        suffix="T1w",
    ),
    transforms=TransformsHandler(
        image_transforms=[
            ResampleConfig(target="mni"),
        ],
    ),
    masks={
        "head": BidsFileType(
            data_type="anat", suffix="mask", with_entities={"label": "head"}
        ),
        "mni": "bids/derivatives/registration/space-MNI152NLin2009cSym_mask.nii.gz",
    },
)

tensor_dataset = dataset.to_tensors(
    conversion_name="T1WithMasks",
    save_transforms=True,
)

Data are now as follows:

bids
├── dataset_description.json
├── metadata.tsv
├── sub-001
│   ├── ses-M000
│   │   └── anat
│   │       ├── sub-001_ses-M000_T1w.nii.gz
│   │       └── sub-001_ses-M000_label-head_mask.nii.gz
│   ...
...
└── derivatives
    ├── registration
    │   ├── space-MNI152NLin2009cSym_mask.nii.gz
    │   ...
    └── tensors
        ├── dataset_description.json
        ├── conversions.tsv                                                        <- contains the list of all the conversions
        ├── src-T1w_conv-T1WithMasks_description.json                              <- contains a description of the conversion
        ├── src-T1w_conv-T1WithMasks_participantsXsessions.tsv                     <- contains the list of the (participant, session) pairs converted
        ├── sub-001
        │   ├── ses-M000
        │   │   └── anat
        │   │       ├── sub-001_ses-M000_src-T1w_conv-T1WithMasks_tensors.json     <- contains path to the source files
        │   │       └── sub-001_ses-M000_src-T1w_conv-T1WithMasks_tensors.pt       <- contains the tensors (the transformed image and masks)
        │   ...
        ...

>>> tensor_dataset[0]
Sample(Keys: ('head', 'mni', 'file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 3)

__getitem__(idx: int) → Sample¶

Retrieves the sample at a given index.

Parameters:: idx (int) – Index of the sample in the dataset.
Returns:: Sample – A Sample containing the processed data and metadata.

__len__() → int¶

Computes the total number of samples in the dataset.

Returns:: int – Total number of samples in the dataset, i.e. the number of images times the number of samples per image.

describe() → dict[str, Any]¶

Returns a description of the dataset.

Returns:: dict[str, Any] – A dictionary describing the dataset.

property df¶: The DataFrame passed in data, with its columns processed with the functions passed in columns.

eval() → None¶

Sets the dataset to evaluation mode.

For example, disabling data augmentation in the transformation pipeline.

get_participant_session_couples() → set[tuple[str, str]]¶

Retrieves all (participant, session) pairs in the dataset.

Returns:: set[tuple[str, str]] – The set of (participant, session).

get_sample_info(idx: int, column: str) → Any¶

Retrieves information on a given sample in the metadata DataFrame. The information corresponds to the information on the image the sample was extracted from.

Parameters:

idx (int) – The index of the sample in the dataset.
column (str) – The information to look for, i.e. a column of df.

Returns:

Any – The value of the column for this sample.

sanity_check(spatial_checks: Iterable[str | SpatialCheck] | None = ('affine', 'shape', 'global_spacing')) → None¶

Performs a sanity check on the current dataset.

It will iterate over the whole dataset to check if images are loaded and transformed correctly, and potentially perform spatial checks on the loaded images.

Parameters:

spatial_checks (Optional[Iterable[str | SpatialCheck]], default=("affine", "shape", "global_spacing")) –

Spatial checks to perform on the images:

"spacing": checks intra-sample voxel spacing consistency, i.e. that all the images and masks in the Sample output by the current dataset have the same voxel spacing.
"affine": checks intra-sample affine matrix consistency (so it includes "spacing").
"shape": checks intra-sample spatial shape consistency.
"global_spacing": checks inter-sample voxel spacing consistency, i.e. that all the Samples in the dataset have the same voxel spacing (so it includes "spacing").
”global_shape": checks inter-sample spatial shape consistency (so it includes "shape").

If None, no spatial check performed.

sort() → None¶

Sorts the dataset by (participant, session) pairs (alphabetic order).

Examples

>>> dataset[0].participant_id
'sub-001'
>>> dataset[1].participant_id
'sub-000'
>>> dataset.sort()
>>> dataset[0].participant_id
'sub-000'

subset(participants_sessions: Path | str | DataFrame | Iterable[tuple[str, str]]) → Self¶

To get a subset of the dataset from a list of (participant, session) pairs.

Parameters:

data (Union[DataFrameType, Sequence[tuple[str, str]]]) –

Can be either:

a sequence of (participant, session);
a pandas.DataFrame (or a path to a TSV file containing the dataframe) with the list of (participant, session) pairs to extract. This list must be passed via two columns named "participant_id" and "session_id" (other columns won’t be considered).

Returns:

Self – A subset of the original dataset, restricted to the (participant, session) pairs mentioned in data.

train() → None¶

Sets the dataset to training mode.

For example, enabling data augmentation in the transformation pipeline.