clinicadl.data.datasets.BidsDataset¶
- class clinicadl.data.datasets.BidsDataset(bids: ~pathlib.Path | str | ~clinicadl.io.bids.reader.Bids, file_type: ~clinicadl.io.bids.file_type.base.BidsFileType, data: ~pathlib.Path | str | ~pandas.core.frame.DataFrame | None = None, transforms: ~clinicadl.transforms.handlers.transforms.TransformsHandler = <clinicadl.transforms.handlers.transforms.TransformsHandler object>, columns: ~typing.Sequence[str] | dict[str, ~typing.Callable[[~pandas.core.series.Series], ~pandas.core.series.Series] | None] | None = None, masks: dict[str, ~pathlib.Path | str | ~clinicadl.io.bids.file_type.base.BidsFileType | tuple[~pathlib.Path | str | ~clinicadl.io.bids.reader.Bids, ~clinicadl.io.bids.file_type.base.BidsFileType]] | None = None)[source]¶
A
Datasetworking with neuroimaging data organized in BIDS (or derivative) format.The user specifies the path to the BIDS directory via
bids, the type of data to load viafile_type, and the (participant, session) pairs to work on viadata.BidsDatasetloads the image and the potential masks (seemasksargument), and puts them in aSample. The user can add additional data in thisSamplevia the argumentscolumns.Transformations (e.g., preprocessing or data augmentation) can be applied to the loaded data (see
transformsargument).With
BidsDataset, it is possible to work on the whole images, or on patches or slices extracted from the images. This is also specified via thetransformsargument (e.g.,transforms=TransformsHandler(extraction=Slice())).Note
The size of the
BidsDatasetdepends on the type of data you are working on. For example, if you have 10 images with 100 slices each, and you want to work on slices, the length of your dataset will be \(10\times100=1,000\).To avoid confusion, we will use the term “sample” to refer to the actual element of the images we are working on (patch, slice or the whole image).
Finally, you may be interested in
to_tensors(), that will convert your NIfTI images to tensors (saved in.ptfiles). Since opening a.ptfile is much faster than opening a NIfTI file, this may speed up data loading.- Parameters:
bids (PathType | Bids) – The BIDS (or derivative) directory where the data will be loaded from. Can be passed as a path or directly as
Bids.file_type (BidsFileType) – Defines the files to load in the BIDS directory. The
BidsFileTypemust contain the requirements necessary to select only the relevant files.data (Optional[DataFrameType], default=None) –
A
pandas.DataFrame(or a path to aTSVfile containing the DataFrame) with the list of (participant, session) pairs to consider, as well as any other relevant information (e.g. the age of the participants). Only (participant, session) pairs mentioned in this TSV file will be in theBidsDataset.If
None, all (participant, session) pairs inbidsthat have the rightfile_typewill be considered.Warning
Be careful if you pass a DataFrame with a column named
"n_samples".BidsDatasetwill understand it as the number of samples for each (participant, session) pair.transforms (TransformsHandler, default=TransformsHandler()) – Transformation pipeline to apply to the data after loading. The user also specifies here whether to work on images, patches, or slices. See
clinicadl.transforms.TransformsHandler.columns (Optional[ColumnsType], default=None) –
Columns to get in the DataFrame
dataand to put in the outputSample.Can be passed via:
a list of strings (e.g.
["age", "sex"]), corresponding to the names of the columns;or a dictionary (e.g.
{"age": <function>, "sex": None}), where the keys are the names of the columns, and the values are functions to apply to the columns. If the function isNone, no function will be applied to the column.
Note
The potential functions applied to the columns are applied to the whole column. They must take as input a
pandas.Series, and return apandas.Series. For example, it is useful to convert string labels to integer labels for classification.masks (Optional[MasksType], default=None) –
Masks to be loaded along with the images.
The masks are passed via a dictionary, whose names will be the names given to the masks in the output
Sample, and whose values can be:a path (
stror pathlib.Path) to a NIfTI image: the same mask is used for all the (participant, session) pairs.a
BidsFileType: the mask is participant- and session-specific and the pattern to find the mask in thebidsis given via theBidsFileType.a tuple (PathType |
Bids,BidsFileType): the mask is participant- and session-specific but is not in the same BIDS dataset as the image. So, here the BIDS where to look for the mask must be passed in the first element of the tuple.
- Raises:
DataFrameError – If the DataFrame in
datais empty.DataFrameError – If the DataFrame in
datadoes not contain the columns"participant_id"and"session_id".DataFrameError – If the DataFrame in
datacontains duplicated (participant_id,session_id) pairs.RuntimeError – If for some (participant, session) pairs, an image corresponding to
file_typecannot be found inbids.ValueError – If a key is used in
columnsandmasks.ValueError – If a key in
columnsormasksis equal to the name of one of the attributes ofSample.
Examples
bids ├── dataset_description.json ├── metadata.tsv ├── sub-001 │ ├── ses-M000 │ │ └── anat │ │ ├── sub-001_ses-M000_T1w.nii.gz │ │ └── sub-001_ses-M000_label-head_mask.nii.gz │ ... ... └── derivatives ├── registration │ ├── space-MNI152NLin2009cSym_mask.nii.gz │ ... └── masks ├── dataset_description.json ├── sub-001 │ ├── ses-M000 │ │ └── anat │ │ └── sub-001_ses-M000_label-brain_mask.nii.gz │ ... ... The "metadata.tsv" file looks like: participant_id session_id age sex diagnosis sub-001 ses-M000 55.0 M control sub-001 ses-M024 57.0 M control sub-002 ses-M000 62.0 F control sub-002 ses-M024 64.0 F patient sub-003 ses-M000 67.0 F patient ...from clinicadl.data.datasets import BidsDataset from clinicadl.io.bids import BidsFileType from clinicadl.transforms import TransformsHandler, extraction from clinicadl.transforms.config import ( ZNormalizationConfig, ResampleConfig, RandomFlipConfig, ) import pandas as pd # to convert diagnosis to numeric values def diagnosis_to_number(column: pd.Series) -> pd.Series: encoding = {"control": 0, "patient": 1} return column.apply(lambda x: encoding[x])
>>> dataset = BidsDataset( bids="bids", file_type=BidsFileType( data_type="anat", suffix="T1w", ), data="bids/metadata.tsv", columns=["age"], ) >>> dataset[0] Sample(Keys: ('age', 'file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 1) >>> dataset[0].spatial_shape (169, 208, 179) # full image >>> len(dataset) 50 # 50 lines in the metadata.tsv >>> dataset[0].participant_id, dataset[0].session_id, dataset[0].age 'sub-001', 'ses-M000', 55.0
>>> dataset = BidsDataset( bids="bids", file_type=BidsFileType( data_type="anat", suffix="T1w", ), data="bids/metadata.tsv", columns={"age": None, "diagnosis": diagnosis_to_number}, transforms=TransformsHandler( extraction=extraction.Patch(patch_size=64), ), ) >>> dataset[0]["diagnosis"] 0 # diagnosis is now encoded >>> dataset[0].spatial_shape (64, 64, 64) # patches >>> len(dataset) 1800 # 36 patches per image
>>> dataset = BidsDataset( bids="bids", file_type=BidsFileType( data_type="anat", suffix="T1w", ), transforms=TransformsHandler( image_transforms=[ ResampleConfig(target="mni"), # masks can be used in transforms ZNormalizationConfig(masking_method="head"), ], augmentations=[RandomFlipConfig(flip_probability=0.5)], ), masks={ "head": BidsFileType( data_type="anat", suffix="mask", with_entities={"label": "head"} ), # participant- and session-specific mask that is in the same BIDS "brain": ( "bids/derivatives/masks", BidsFileType( data_type="anat", suffix="mask", with_entities={"label": "brain"} ), # participant- and session-specific mask that is in another BIDS ), "mni": "bids/derivatives/registration/space-MNI152NLin2009cSym_mask.nii.gz", # same mask for all (participant, session) }, ) >>> dataset[0] Sample(Keys: ('head', 'brain', 'mni', 'file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 4) >>> len(dataset) 60 # all the (participant, session) that have T1w images. Not only the ones in metadata.tsv
See also
TensorDataset,ConcatDataset,PairedDataset,UnpairedDataset- to_tensors(conversion_name: str | None = None, spatial_checks: Iterable[str | SpatialCheck] | None = ('affine', 'shape', 'global_spacing'), save_transforms: bool = False, description: str | None = None, overwrite: bool = False, check_transforms: bool = True, n_proc: int = 1) TensorDataset[source]¶
Converts NifTI files in the current
BidsDatasetto tensors (in PyTorch’s.ptformat).Conversion to tensors may significantly speed up data loading.
The tensors are saved in a BIDS derivative named
tensors. The location of this folder relative to the original BIDS depends on the type of BIDS (seeclinicadl.io.bids.Bids). A.jsonfile describing the conversion is also saved at the root oftensors, as well as other metadata files (see examples).Masks will be converted and saved in the same file as the image they are associated with.
The user has the possibility to save transformed images, i.e. images on which image transforms have already been applied (see
image_transformsinclinicadl.transforms.TransformsHandler). This practice will speed up dataloading during training or inference as the images will not have to be transformed each time they are loaded.Note
Images are converted to the same coordinate system (RAS+).
- Parameters:
conversion_name (Optional[str], default=None) –
The name of the tensor conversion. Must be alphanumerical.
The output tensors and the output
.jsonfile describing the conversion will have"conv-<conversion_name>"in their filenames.If a conversion with this name already exists:
if
overwrite=True, the old tensors will be overwritten;else,
to_tensorswill try to append the new conversion to the pre-existing tensor conversion if they concern the same type of data (same modality, same transforms applied, etc.), otherwise an error will be raised.
If
None, the conversion name will be"raw".conversion_namecannot beNoneifsave_transforms=True.spatial_checks (Optional[Iterable[str | SpatialCheck]], default=("affine", "shape", "global_spacing")) –
Potential spatial checks to perform on the images while converting them:
"spacing": checks intra-sample voxel spacing consistency, i.e. that all the images and masks in theSampleoutput by the current dataset have the same voxel spacing."affine": checks intra-sample affine matrix consistency (so it includes"spacing")."shape": checks intra-sample spatial shape consistency."global_spacing": checks inter-sample voxel spacing consistency, i.e. that all theSamplesin the dataset have the same voxel spacing (so it includes"spacing").”
global_shape": checks inter-sample spatial shape consistency (so it includes"shape").
If
None, no spatial check performed.save_transforms (bool, default=False) – Whether to save raw images without transforms as tensors (
False), or images with the applied transforms (True).description (Optional[str], default=None) – A potential description of the tensor conversion that will be saved in the description
.jsonfile.overwrite (bool, default=False) – Whether to overwrite a pre-existing tensor conversion that has the same
conversion_name. If a conversion namedconversion_namealready exists andoverwrite=False,to_tensorswill try to append the current tensor conversion to the pre-existing one.check_transforms (bool, default=True) –
check_transformsdetermines whether transforms will be checked when appending to a pre-existing conversion. IfTrue,to_tensorswill check that the current transforms match the transforms applied during the pre-existing conversion.check_transforms=Falseis useful when you use custom transforms (i.e. transforms not inClinicaDL), which cannot be checked.Note
If
save_transforms=False, no such check will be performed.Warning
To use carefully. You must be sure that the transforms match before setting
check_transforms=False.n_proc (int, default=1) – Number of cores to use to parallelize the conversion.
- Returns:
TensorDataset – A
TensorDatasetcontaining the converted tensors.- Raises:
ValueError – If the user passed
"raw"as aconversion_name.ValueError – If
conversion_nameisNoneandsave_transforms=True.TensorConversionError – If a conversion named
conversion_namealready exists and the new conversion cannot be appended to the pre-existing one.RuntimeError – If some checks in
spatial_checksfail.
Examples
bids ├── dataset_description.json ├── metadata.tsv ├── sub-001 │ ├── ses-M000 │ │ └── anat │ │ ├── sub-001_ses-M000_T1w.nii.gz │ │ └── sub-001_ses-M000_label-head_mask.nii.gz │ ... ... └── derivatives └── registration ├── space-MNI152NLin2009cSym_mask.nii.gz ...from clinicadl.data.datasets import BidsDataset from clinicadl.io.bids import BidsFileType from clinicadl.transforms import TransformsHandler, extraction from clinicadl.transforms.config import ResampleConfig dataset = BidsDataset( bids="bids", file_type=BidsFileType( data_type="anat", suffix="T1w", ), transforms=TransformsHandler( image_transforms=[ ResampleConfig(target="mni"), ], ), masks={ "head": BidsFileType( data_type="anat", suffix="mask", with_entities={"label": "head"} ), "mni": "bids/derivatives/registration/space-MNI152NLin2009cSym_mask.nii.gz", }, ) tensor_dataset = dataset.to_tensors( conversion_name="T1WithMasks", save_transforms=True, )
Data are now as follows:
bids ├── dataset_description.json ├── metadata.tsv ├── sub-001 │ ├── ses-M000 │ │ └── anat │ │ ├── sub-001_ses-M000_T1w.nii.gz │ │ └── sub-001_ses-M000_label-head_mask.nii.gz │ ... ... └── derivatives ├── registration │ ├── space-MNI152NLin2009cSym_mask.nii.gz │ ... └── tensors ├── dataset_description.json ├── conversions.tsv <- contains the list of all the conversions ├── src-T1w_conv-T1WithMasks_description.json <- contains a description of the conversion ├── src-T1w_conv-T1WithMasks_participantsXsessions.tsv <- contains the list of the (participant, session) pairs converted ├── sub-001 │ ├── ses-M000 │ │ └── anat │ │ ├── sub-001_ses-M000_src-T1w_conv-T1WithMasks_tensors.json <- contains path to the source files │ │ └── sub-001_ses-M000_src-T1w_conv-T1WithMasks_tensors.pt <- contains the tensors (the transformed image and masks) │ ... ...>>> tensor_dataset[0] Sample(Keys: ('head', 'mni', 'file_type', 'image_path', 'sample_type', 'sample_position', 'image', 'participant_id', 'session_id'); images: 3)
- __len__() int¶
Computes the total number of samples in the dataset.
- Returns:
int – Total number of samples in the dataset, i.e. the number of images times the number of samples per image.
- describe() dict[str, Any]¶
Returns a description of the dataset.
- Returns:
dict[str, Any] – A dictionary describing the dataset.
- property df¶
The DataFrame passed in
data, with its columns processed with the functions passed incolumns.
- eval() None¶
Sets the dataset to evaluation mode.
For example, disabling data augmentation in the transformation pipeline.
- get_participant_session_couples() set[tuple[str, str]]¶
Retrieves all (participant, session) pairs in the dataset.
- Returns:
set[tuple[str, str]] – The set of (participant, session).
- get_sample_info(idx: int, column: str) Any¶
Retrieves information on a given sample in the metadata DataFrame. The information corresponds to the information on the image the sample was extracted from.
- sanity_check(spatial_checks: Iterable[str | SpatialCheck] | None = ('affine', 'shape', 'global_spacing')) None¶
Performs a sanity check on the current dataset.
It will iterate over the whole dataset to check if images are loaded and transformed correctly, and potentially perform spatial checks on the loaded images.
- Parameters:
spatial_checks (Optional[Iterable[str | SpatialCheck]], default=("affine", "shape", "global_spacing")) –
Spatial checks to perform on the images:
"spacing": checks intra-sample voxel spacing consistency, i.e. that all the images and masks in theSampleoutput by the current dataset have the same voxel spacing."affine": checks intra-sample affine matrix consistency (so it includes"spacing")."shape": checks intra-sample spatial shape consistency."global_spacing": checks inter-sample voxel spacing consistency, i.e. that all theSamplesin the dataset have the same voxel spacing (so it includes"spacing").”
global_shape": checks inter-sample spatial shape consistency (so it includes"shape").
If
None, no spatial check performed.
- sort() None¶
Sorts the dataset by (participant, session) pairs (alphabetic order).
Examples
>>> dataset[0].participant_id 'sub-001' >>> dataset[1].participant_id 'sub-000' >>> dataset.sort() >>> dataset[0].participant_id 'sub-000'
- subset(participants_sessions: Path | str | DataFrame | Iterable[tuple[str, str]]) Self¶
To get a subset of the dataset from a list of (participant, session) pairs.
- Parameters:
data (Union[DataFrameType, Sequence[tuple[str, str]]]) –
Can be either:
a sequence of (participant, session);
a
pandas.DataFrame(or a path to aTSVfile containing the dataframe) with the list of (participant, session) pairs to extract. This list must be passed via two columns named"participant_id"and"session_id"(other columns won’t be considered).
- Returns:
Self – A subset of the original dataset, restricted to the (participant, session) pairs mentioned in
data.