tsvtool
- Prepare your metadata¶
This collection of tools aims at handling metadata of BIDS-formatted datasets. These tools perform three main tasks:
- Get the labels used in the classification task (
restrict
+getlabels
), - Split data to define test, validation and train cohorts (
split
+kfold
), - Analyze populations of interest (
analysis
).
Tip
Classical ratios in the scientific literature are 80%-20% (or 70%-30%) for train/validation. These values can be modified according to the size of the dataset, and the number of hyperparameters that are tuned. More information on the subject can be found online.
restrict
- Reproduce restrictions on specific datasets.¶
Description¶
In (Wen et al., 2020) whose source code is the former version of ClinicaDL, the specific restrictions were applied to datasets used for testing:
- in OASIS, cognitively normal subjects who were younger than the youngest demented patient (62 years old) were removed,
- in AIBL, subjects whose age could not be retrieved (because it is missing for all their sessions) were removed.
Running the task¶
clinicadl tsvtool restrict [OPTIONS] DATASET MERGED_TSV RESULTS_TSV
where:
DATASET
(str) is the name of the dataset. Choices areOASIS
orAIBL
.MERGED_TSV
(Path) is the output file of theclinica iotools merge-tsv
command.RESULTS_TSV
(Path) is the path to the output TSV file (filename included). This TSV file comprises the same columns asmerged_tsv
.
Tip
Add your custom restrictions in clinicadl/tools/tsv/restriction.py
to make
your own experiments reproducible.
getlabels
- Extract labels specific to Alzheimer's disease¶
Description¶
This tool writes a TSV file for each label asked by the user. The labels correspond to the following description:
- CN (cognitively normal): sessions of subjects who were diagnosed as cognitively normal during all their follow-up;
- AD (Alzheimer's disease): sessions of subjects who were diagnosed as demented during all their follow-up;
- MCI (mild cognitive impairment): sessions of subjects who were diagnosed as prodromal (i.e. MCI) at baseline, who did not encounter multiple reversions and conversions and who did not convert back to the cognitively normal status;
- pMCI (progressive MCI): sessions of subjects who were diagnosed as prodromal at baseline,
and progressed to dementia during the
time_horizon
period following the current visit; - sMCI (stable MCI): sessions of subjects who were diagnosed as prodromal at baseline,
remained stable during the 36 months
time_horizon
period following the current visit and never progressed to dementia nor converted back to the cognitively normal status.
These labels are specific to the Alzheimer's disease context and can only be extracted from cohorts used in (Wen et al., 2020).
However, users can define other label lists manually and give them in inputs of other functions of
tsvtool
. These TSV files will have to include the following columns: participant_id
,
session_id
, and the name of the value used for the label (for example diagnosis
).
Other functions of tsvtool
may also try to have similar distributions according to the age and the sex
of the participants. To benefit from this feature, the user must also include these two columns in
their TSV files.
Running the task¶
clinicadl tsvtool getlabels [OPTIONS] MERGED_TSV MISSING_MODS_DIRECTORY RESULTS_DIRECTORY
MERGED_TSV
(Path) is the output file of theclinica iotools merge-tsv
orclinicadl tsvtool restrict
commands.MISSING_MODS_DIRECTORY
(Path) is the folder containing the outputs of theclinica iotools check-missing-modalities
command.RESULTS_DIRECTORY
(Path) is the path to the folder where output TSV files will be written.
Options:
--modality
(str) Modality for which the sessions are selected. Sessions which do not include the modality will be excluded from the outputs. The name of the modality must correspond to a column of the TSV files inmissing_mods
. Default value:t1w
.--diagnoses
(List[str]) is the list of the labels that will be extracted. These labels must be chosen from {AD,CN,MCI,sMCI,pMCI}. Default will only process AD and CN labels.--time_horizon
(int) is the time horizon in months that is used to assess the stability of the MCI subjects. Default value:36
.--restriction_tsv
(Path) is a path to a TSV file containing the list of sessions that should be used. This argument is useful to integrate the result of a quality check procedure. Default will not perform any restriction.--variables_of_interest
(List[str]) is a list of columns present inMERGED_TSV
that will be included in the outputs.--keep_smc
(bool) if given the SMC participants are kept in theCN.tsv
file. Default setting remove these participants.
Output tree¶
The command will output one TSV file per label:
└── <results_path> ├── AD.tsv ├── CN.tsv ├── MCI.tsv ├── sMCI.tsv └── pMCI.tsv
Each TSV file comprises the participant_id
and session_id
values of all the sessions that correspond to the label.
The values of the column diagnosis
are equal to the label name.
The age and sex are also included in the TSV files. The names of these columns depend on the
columns of MERGED_TSV
.
split
- Single split observing similar age and sex distributions¶
Description¶
This tool independently splits each label in order to have the same sex and age distributions in both sets produced. The similarity of the age and sex distributions is assessed by a T-test and chi-square test, respectively.
By default, there is a special treatment of the MCI set and its subsets (sMCI and pMCI) to avoid data leakage. However, if there are too few patients, this can prevent finding a split with similar demographics for these labels.
Running the task¶
clinicadl tsvtool split [OPTIONS] FORMATTED_DATA_DIRECTORY
FORMATTED_DATA_DIRECTORY
(Path) is the folder containing a TSV file per label which is going to be split (output ofclinicadl tsvtool getlabels|split|kfold
).
Options:
--subset_name
(str) is the name of the subset that is complementary to train. Default value:validation
.-
--n_test
(float) gives the number of subjects that will be put in the set complementary to train:- If > 1, corresponds to the number of subjects to put in set with name
subset_name
. - If < 1, proportion of subjects to put in set with name
subset_name
. - If = 0, no training set is created and the whole dataset is considered as one set
with name
subset_name
.
Default value:
100
. - If > 1, corresponds to the number of subjects to put in set with name
-
--no_mci_sub_categories
(bool) is a flag that disables the special treatment of the MCI set and its subsets. This will allow sets with more similar age and sex distributions, but it will cause data leakage for transfer learning tasks involving these sets. Default value:True
. --p_age_threshold
(float) is the threshold on the p-value used for the T-test on age distributions. Default value:0.80
.--p_sex_threshold
(float) is the threshold on the p-value used for the chi2 test on sex distributions. Default value:0.80
.
Output tree¶
The command will generate the following output tree:
└── formatted_data_path ├── label-1.tsv ├── ... ├── label-n.tsv ├── train | ├── label-1.tsv | ├── label-1_baseline.tsv | ├── ... | ├── label-n.tsv | └── label-n_baseline.tsv └── test ├── label-1_baseline.tsv ├── ... └── label-n_baseline.tsv
The columns of the produced TSV files are participant_id
, session_id
and diagnosis
.
TSV files ending with _baseline.tsv
only include the baseline session of each subject (or
the session closest to baseline if the latter does not exist).
kfold
- K-fold split¶
Description¶
This tool independently splits each label to perform a k-fold cross-validation.
Running the task¶
clinicadl tsvtool kfold [OPTIONS] FORMATTED_DATA_DIRECTORY
FORMATTED_DATA_DIRECTORY
(str) is the folder containing a TSV file per label which is going to be split
(output of clinicadl tsvtool getlabels|split|kfold
).
Options:
--subset_name
(str) is the name of the subset that is complementary to train. Default value:validation
.--n_splits
(int) is the value of k. If 0 is given, all subjects are considered as test subjects. Default value:5
.--no-mci_sub_categories
(bool) is a flag that disables the special treatment of the MCI set and its subsets. This will cause data leakage for transfer learning tasks involving these sets. Default value:False
.stratification
(str) is the name of the variable used to stratify the k-fold split. By default, the value isNone
which means there is no stratification.
Output tree¶
The command will generate the following output tree:
└── formatted_data_path ├── label-1.tsv ├── ... ├── label-n.tsv ├── train_splits-<<n_splits> | ├── split-0 | ├── ... | └── split-<n_splits-1> | ├── label-1.tsv | ├── label-1_baseline.tsv | ├── ... | ├── label-n.tsv | └── label-n_baseline.tsv └── <subset_name>_splits-<n_splits> ├── split-0 ├── ... └── split-<n_splits-1> ├── label-1.tsv ├── label-1_baseline.tsv ├── ... ├── label-n.tsv └── label-n_baseline.tsv
The columns of the produced TSV files are participant_id
, session_id
and diagnosis
.
TSV files ending with _baseline.tsv
only include the baseline session of each subject (or
the session closest to baseline if the latter does not exist).
analysis
¶
Description¶
This tool writes a TSV file that summarizes the demographics and clinical distributions of the asked labels. Continuous variables are described with statistics (mean, standard deviation, minimum and maximum), whereas categorical values are grouped by categories. The variables of interest are: age, sex, mini-mental state examination (MMSE) and global clinical dementia rating (CDR).
Running the task¶
clinicadl tsvtool analysis [OPTIONS] MERGED_TSV FORMATTED_DATA_DIRECTORY RESULTS_DIRECTORY
MERGED_TSV
(Path) is the output file of theclinica iotools merge-tsv
orclinicadl tsvtool restrict
commands.FORMATTED_DATA_DIRECTORY
(Path) is a folder containing one TSV file per label (output ofclinicadl tsvtool getlabels|split|kfold
).RESULTS_DIRECTORY
(Path) is the path to the TSV file that will be written (filename included).
Options:
--diagnoses
(List[str]) is the list of the labels that will be extracted. These labels must be chosen from {AD,CN,MCI,sMCI,pMCI}. Default will only process AD and CN labels.