1.4. Splitting data¶
To estimate how well a model generalises, you split your data into a training set, a validation set, and usually a held-out test set. In neuroimaging, this split deserves special care because datasets are often longitudinal: the same participant is scanned at several sessions.
Warning
If two sessions of the same participant end up on different sides of a split — one in training, one in validation or test — the model can recognise the participant rather than the condition you are studying. This is a classic case of data leakage, and it leads to over-optimistic performance estimates.
The rule is therefore to split by participant: all the sessions of a participant stay together. Every ClinicaDL splitting tool enforces this.
ClinicaDL separates making a split (computing it once and saving it to disk) from reading it (applying it to a dataset). Saving the split to disk means the exact same partition can be reused and shared, which is essential for reproducible benchmarks.
1.4.1. Making a split¶
Two functions create a split from a DataFrame (or a TSV file) listing your
(participant_id, session_id) pairs. Both split participants and write the
resulting TSV files to a directory, which they return.
make_split()Creates a single train/test (or train/validation) split.
n_testsets the size of the held-out set, as a proportion of participants or as an absolute number.make_kfold()Creates a \(K\)-fold partition:
n_splitsfolds, each used in turn as the validation set.
from clinicadl.split import make_split, make_kfold
# a single 80/20 split
split_dir = make_split("bids_directory/metadata.tsv", n_test=0.2)
# a 5-fold partition
kfold_dir = make_kfold(split_dir / "training.tsv", n_splits=5)
Both functions support two important options:
longitudinalBy default (
longitudinal=False), only the baseline session of each participant is kept in the held-out set, while the training set always keeps all sessions. Setlongitudinal=Trueto keep every session of the held-out participants as well.stratificationEnsures the resulting sets have similar distributions for one or several variables (e.g. age, sex).
make_splitchecks the balance with statistical tests and retries until it finds a satisfactory split;make_kfoldstratifies on a single categorical variable. Aseedmakes the split reproducible.
split_dir = make_split(
"bids_directory/metadata.tsv",
n_test=0.2,
stratification=["age", "sex"],
longitudinal=True,
seed=42,
)
The output directory contains the train/test TSV files and, when stratification is used, statistics describing how well the variables are balanced:
splits/split
├── single_split_config.json
├── split_categorical_stats.tsv
├── split_continuous_stats.tsv
├── test_baseline.tsv
├── train.tsv
└── train_baseline.tsv
1.4.2. Reading a split¶
Once a split exists on disk, you apply it to a Dataset
with one of two reader objects, depending on how the split was made.
SingleSplitReads a directory produced by
make_split(). Itsget_split()method returns a singleSplit.from clinicadl.split import SingleSplit from clinicadl.data.datasets import BidsDataset from clinicadl.io.bids import BidsFileType dataset = BidsDataset( "bids_directory", file_type=BidsFileType(data_type="anat", suffix="T1w"), data="bids_directory/metadata.tsv", ) split = SingleSplit(split_dir).get_split(dataset)
KFoldReads a directory produced by
make_kfold(). Itsget_splits()method yields oneSplitper fold.from clinicadl.split import KFold splitter = KFold(kfold_dir) for split in splitter.get_splits(dataset): ... # train and evaluate on this fold
A Split simply bundles the two resulting datasets:
>>> split.train_dataset
<clinicadl.data.datasets.BidsDataset object ...>
>>> split.val_dataset
<clinicadl.data.datasets.BidsDataset object ...>
Note
The Split object also knows how to build the
DataLoader of each set. We look at
dataloaders in the next section.
Evaluating on a different view of the data¶
Both SingleSplit.get_split and KFold.get_splits
accept an eval_dataset argument. This is handy when you want to train on patches but evaluate on whole images:
pass the patch dataset as the main argument and the image dataset as eval_dataset.
from clinicadl.transforms import TransformsHandler, extraction
train_dataset = BidsDataset(
"bids_directory",
file_type=BidsFileType(data_type="anat", suffix="T1w"),
data="bids_directory/metadata.tsv",
transforms=TransformsHandler(extraction=extraction.Patch(patch_size=64)),
)
eval_dataset = BidsDataset(
"bids_directory",
file_type=BidsFileType(data_type="anat", suffix="T1w"),
data="bids_directory/metadata.tsv",
)
split = SingleSplit(split_dir).get_split(train_dataset, eval_dataset=eval_dataset)
>>> split.train_dataset[0].spatial_shape
(64, 64, 64) # patches for training
>>> split.val_dataset[0].spatial_shape
(181, 217, 181) # entire images for evaluation
Once your data is divided into training and validation sets, the final step before starting training is to specify how it will be batched.