2.3. Evaluating

Training tells you whether your model learns; evaluation tells you how well it performs. ClinicaDL evaluates a model by computing metrics on a dataset, either during training (on the validation set) or afterwards (on validation or held-out test data).

2.3.1. Metrics

A metric is described by a Metric, and the set of metrics computed during an evaluation phase is gathered in a MetricsHandler. In practice you rarely manipulate these directly: you declare the metrics you want when building the Trainer, passing them as a dictionary of metric configuration objects:

from clinicadl.train import Trainer
from clinicadl.metrics.config import LossMetricConfig, AveragePrecisionMetricConfig

trainer = Trainer(
    maps="maps_directory",
    model=model,
    metrics={
        "loss": LossMetricConfig(loss_name="loss"),
        "ap": AveragePrecisionMetricConfig(),
        "f1": ConfusionMatrixMetricConfig(metric_name="f1 score"),
    },
    callbacks=[ModelCheckpointCallback(metric="f1")],
)

Note

ModelCheckpointCallback saves here the best model obtained with respect to the F1-score. More details in Callbacks.

ClinicaDL provides metrics for classification (confusion-matrix metrics, ROC AUC, average precision), regression (MSE, MAE, RMSE), reconstruction (PSNR, SSIM) and segmentation (Dice, IoU, Hausdorff distance, etc.) — see clinicadl.metrics.config for the full list. The metrics defined here are the ones the Trainer can compute; which ones are actually computed is chosen at each evaluation call.

2.3.2. Evaluating during and after training

During training, the metrics are computed on the validation set at the interval set by the OptimizationConfig (see Training). You can specify the metrics to compute via the metrics argument of train() (by default they are all computed):

trainer.train(split, metrics=["loss", "f1"])     # metric 'ap' will not be computed

Important

The metrics mentioned in train() must have been defined first, when instantiating the Trainer or via add_metrics().

After training, two methods let you evaluate saved checkpoints. Both identify a checkpoint by an explicit name.

validate()

Computes new metrics on the validation data of a split — useful when you realise after training that you want an additional metric.

trainer.add_metrics(recall=ConfusionMatrixMetricConfig(metric_name="recall"))
trainer.validate(split_idx=0, metrics=["recall"], model_checkpoint="best-f1")

Note

Here we can ask for the checkpoint "best-f1" because we saved the best model with respect to F1-score via ModelCheckpointCallback(metric="f1").

test()

Evaluates a checkpoint on a held-out test set, identified by a group_name. ClinicaDL checks for data leakage between the test data and the training/validation data.

from clinicadl.data.dataloader import DataLoader

test_dataset = dataset.subset(split_dir / "test_baseline.tsv")
test_loader = DataLoader(test_dataset)

trainer.test(
    model_checkpoint="split-0_final",   # the final model obtained when training on split #0
    group_name="test",
    dataloader=test_loader,
)

All the results are written to the MAPS (see Chapter 3).

2.3.3. Customising inference

How an image is fed to the network at evaluation time is governed by an Inferer, attached to the model through its inferer argument (see Defining a model). ClinicaDL provides:

  • SimpleInferer — the default; the whole image is passed through the network, with optional post-processing of the output;

  • PatchesToImageInferer — splits a volume into 3D patches, runs the network on each, and merges the outputs back into a full volume;

  • SlicesToImageInferer — does the same with 2D slices and a 2D network.

from clinicadl.models import SupervisedModel
from clinicadl.infer import PatchesToImageInferer

model = SupervisedModel(
    network=...,
    loss=...,
    optimizer=...,
    inferer=PatchesToImageInferer(patch_size=64, overlap=0.25),
    label_key="mask",
)

Inferers also handle post-processing (e.g. activations, thresholding) applied to the network’s output before metrics are computed.

Note

metric configuration objects also have an argument postprocessing in case a metric requires a specific postprocessing.


Now that you understand how to train and evaluate a model, the next section shows how to customize your Trainer to tailor your training pipeline.