clinicadl train resume - Resume a prematurely stopped job¶
This functionality allows to resume a prematurely stopped job trained with
clinicadl train of clinicadl random-search generate tasks.
The files that are used by this function are the following:
commandline.jsondescribes the training parameters used to create the model,checkpoint.pth.tarcontains the last version of the weights of the network,optimizer.pth.tarcontains the last version of the parameters of the optimizer,training.tsvcontains the successive values of the metrics during training.
These files are organized in model_path as follows:
<model_path>
├── commandline.json
└── fold-<i>
├── models
│ ├── best_balanced_accuracy
│ │ └── model_best.pth.tar
│ ├── best_loss
│ │ └── model_best.pth.tar
│ ├── checkpoint.pth.tar
│ └── optimizer.pth.tar
├── tensorboard_logs
│ ├── train
│ │ └── events.out.tfevents.1616090758.r7i7n7
│ └── validation
│ └── events.out.tfevents.1616090758.r7i7n7
└── training.tsv
You should also ensure that the data at tsv_path and caps_dir in commandline.json
is still present and correspond to the ones used during training.
Prerequisites¶
Please check which preprocessing needs to
be performed in the commandline.json file in the results folder. If it has
not been performed, execute the preprocessing pipeline as well as clinicadl
extract to obtain the tensor versions of the images.
Running the task¶
This task can be run with the following command line:
clinicadl train resume <model_path>
model_path (str) is a path to the folder where the model and the json file
are stored.
By default the arguments corresponding to computational resources will be the same
than the ones defined in commandline.json. However it is possible to change them
by using the following options:
--nproc(int) changes the number of workers used by the DataLoader.--use_cpu(bool) forces to use CPU.--use_gpu(bool) forces to use GPU.--batch_size(int) changes the size of the batch used in the DataLoader.--evaluation_steps(int) changes the number of iterations to perform before computing an evaluation.
Outputs¶
The outputs correspond to the ones obtained using clinicadl train
Note
The files checkpoint.pth.tar and optimizer.pth.tar are automatically removed as soon
as the stopping criterion is reached and the
performances of the models are evaluated on the training and validation datasets.