clinicadl.networks.nn.ViTB16¶

class clinicadl.networks.nn.ViTB16(num_outputs: int | None, output_act: ActFunction | tuple[ActFunction, dict[str, Any]] | None = None, pretrained: bool = False) → None[source]¶

ViT-B/16, from An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Only the last fully connected layer will be changed to match num_outputs.

The user can use the pretrained models from torchvision. Note that the last fully connected layer will not use pretrained weights, as it is task specific.

Warning

Only works with 2D images of size (224, 224), with 3 channels.

Parameters:

num_outputs (Optional[int]) – Number of output variables after the last linear layer. If None, the feature map before the last fully connected layer will be returned.
output_act (Optional[ActivationParameters], default=None) –
A potential activation layer applied to the output of the network, and optionally its arguments. Must be passed as activation_name or (activation_name, arguments), where arguments is a dictionary. If None, no activation will be used.

activation_name can be any value in {"celu", "elu", "gelu", "leakyrelu", "logsoftmax", "mish", "prelu", "relu", "relu6", "selu", "sigmoid", "softmax", "tanh"}. Please refer to PyTorch activation functions to know the arguments for each of them.
pretrained (bool, default=False) – Whether to use pretrained weights. The pretrained weights used are the default ones from torchvision.models.vit_b_16().