clinicadl.data.datasets.UnpairedDataset

class clinicadl.data.datasets.UnpairedDataset(datasets: Iterable[Dataset], oversample: bool = False)[source]

For stacking multiple Dataset (e.g. different modalities from different datasets). By “stacking”, we mean randomly associating images across datasets.

So, UnpairedDataset differs from PairedDataset in that PairedDataset associates images across datasets via a unique mapping. Therefore, as opposed to PairedDataset, there is no need for the datasets forming the UnpairedDataset to contain the same (participant, session) pairs.

The randomness of the mapping between datasets can be controlled via set_epoch(). This enables to have different associations for each epoch.

The size of an UnpairedDataset is set to the size of its biggest underlying dataset if oversample=True, or to the size of its smallest underlying dataset if oversample=False: to handle datasets with different sizes, UnpairedDataset will randomly replicate some of their samples so that they reach the size of the biggest dataset if oversample=True, or will randomly drop some of their samples so that they reach the size of the smallest dataset if oversample=False. This randomness is also controlled via set_epoch().

An UnpairedDataset will return a tuple of Sample (one for each underlying dataset).

Parameters:
  • datasets (Iterable[Dataset]) – The Datasets to stack.

  • oversample (bool, default=False) –

    Strategy to adopt when the datasets have different sizes:

    • oversample=True: randomly replicate samples in smaller datasets so that they reach the size of the biggest dataset.

    • oversample=False: randomly drop samples in bigger datasets so that all datasets reach the size of the smallest dataset.

Examples

bids_t1
├── sub-001
│   └── ses-M000
│   │   └── anat
│   │       └── sub-001_ses-M000_T1w.nii.gz
    ...
...

bids_pet
├── sub-A
│   └── ses-M003
│   │   └── pet
│   │       └── sub-A_ses-M000_trc-18FAV45_pet.nii.gz
    ...
...
from clinicadl.data.datasets import BidsDataset, UnpairedDataset
from clinicadl.io.bids import BidsFileType

bids_t1 = BidsDataset("bids_t1", file_type=BidsFileType(data_type="anat", suffix="T1w"))
bids_pet = BidsDataset("bids_pet", file_type=BidsFileType(data_type="pet", suffix="pet"))

multimodal_dataset = UnpairedDataset([bids_t1, bids_pet], oversample=True)
>>> len(bids_t1)
4
>>> len(bids_pet)
2
>>> len(stacked)
4   # length of the biggest dataset

We can access the random mapping made between the datasets via .mapping:

>>> stacked.mapping
dataset_id      0       1
       idx
        0       2       0
        1       3       0
        2       1       0
        3       0       1

idx is the index of the sample in the UnpairedDataset. In column 0, you have the associated sample in the first dataset (bids_t1), and in column 1, the associated sample in the second dataset (bids_pet).

>>> bids_t1[2].participant_id, bids_t1[2].session_id,
('sub-002', 'ses-M000')

>>> bids_pet[0].participant_id, bids_pet[0].session_id
('sub-A', 'ses-M000')

>>> sample = stacked[0]
>>> len(sample)
2
>>> sample[0].participant_id, sample[0].session_id
('sub-002', 'ses-M000')
>>> sample[1].participant_id, sample[1].session_id
('sub-A', 'ses-M000')

Now we can change the random mapping with set_epoch():

>>> stacked.set_epoch(7)
>>> stacked.mapping
dataset_id      0       1
       idx
        0       2       1
        1       1       1
        2       0       0
        3       3       0

>>> sample = stacked[0]
>>> sample[1].participant_id, sample[1].session_id
('sub-B', 'ses-M000')

Finally, if oversample=False:

>>> stacked = UnpairedDataset([bids_t1, bids_pet], oversample=False)
>>> len(stacked)
2   # = length of the smallest dataset
>>> stacked.mapping
dataset_id      0       1
       idx
        0       2       0
        1       3       1
property df

The output of the merger of the metadata DataFrames of the underlying datasets.

property mapping: DataFrame

The random mapping between the samples of the underlying datasets.

get_sample_info(idx: int, column: str) tuple[Any, ...][source]

Retrieves information on a given sample.

In an UnpairedDataset, a sample is a tuple of “sub-samples” from the underlying datasets. Therefore, get_sample_info will also return a tuple, containing the information on all the sub-samples forming the sample.

If the information cannot be found for a sub-sample (because all the underlying datasets don’t necessarily contain the same information), get_sample_info will return None for this sub-sample.

See Dataset.get_sample_info for more details.

Parameters:
  • idx (int) – The index of the sample in the UnpairedDataset.

  • column (str) – The information to look for, i.e. a column present in the metadata DataFrame of at least one of the dataset forming the UnpairedDataset.

Returns:

tuple[Any, …] – The information (e.g. the age, the sex, etc.) found for each sub-sample.

Raises:

KeyError – If column is not in any DataFrame of the datasets forming the UnpairedDataset.

set_epoch(epoch: int) None[source]

Sets the epoch.

This ensures that the random mapping between the datasets is different for each epoch.

Parameters:

epoch (int) – Epoch number.

__len__() int[source]

The length of an UnpairedDataset is the length of its biggest dataset.

Returns:

int – The length of the dataset.

__getitem__(idx: int) tuple[Sample, ...][source]

Retrieves the collection of samples at a given index.

The random mapping between datasets (in self.mapping) is used to determine which samples to retrieve for each underlying dataset.

Parameters:

idx (int) – Index of the samples in the dataset.

Returns:

tuple[Sample, …] – A structured output containing the processed data and metadata from each dataset of the UnpairedDataset, as a tuple of Sample.

eval() None

Sets all the underlying datasets in evaluation mode.

get_participant_session_couples() set[tuple[str, str]]

Retrieves all (participant, session) pairs in the dataset.

Returns:

set[tuple[str, str]] – The set of (participant, session).

subset(particpants_sessions: Path | str | DataFrame | Iterable[tuple[str, str]]) Self

To get a subset of the dataset from a list of (participant, session) pairs.

Parameters:

data (Union[DataFrameType, Sequence[tuple[str, str]]]) –

Can be either:

  • a sequence of (participant, session);

  • a pandas.DataFrame (or a path to a TSV file containing the dataframe) with the list of (participant, session) pairs to extract. This list must be passed via two columns named "participant_id" and "session_id" (other columns won’t be considered).

Returns:

Self – A subset of the original dataset, restricted to the (participant, session) pairs mentioned in data.

train() None

Sets all the underlying datasets in training mode.