tsvtools
- Prepare your metadata¶
This collection of tools aims at handling metadata of BIDS-formatted datasets. These tools perform three main tasks:
- Get the labels used in the classification task (
get-labels
), - Get the progression of Alzheimer's disease (
get-progression
), - Split data to define test, validation and train cohorts (
split
+kfold
orprepare-experiment
), - Get additional data (
get-metadata
), - Analyze populations of interest (
analysis
).
Tip
Classical ratios in the scientific literature are 80%-20% (or 70%-30%) for train/validation. These values can be modified according to the size of the dataset, and the number of hyperparameters that are tuned. More information on the subject can be found online.
get-labels
- Extract labels¶
Description¶
This tool writes a unique TSV file containing the labels asked by the user. The labels correspond to the following description and are stored in the column named diagnosis:
- CN (cognitively normal): sessions of subjects who were diagnosed as cognitively normal during all their follow-up;
- AD (Alzheimer's disease): sessions of subjects who were diagnosed as demented during all their follow-up;
- MCI (mild cognitive impairment): sessions of subjects who were diagnosed as prodromal (i.e. MCI) at baseline,
These labels are specific to the Alzheimer's disease context and can only be extracted from cohorts used in (Wen et al., 2020).
However, users can define other label lists manually and give them in inputs of other functions of tsvtools
. These TSV files will have to include the following columns: participant_id
, session_id
, and the name of the value used for the label (for example diagnosis
).
Other functions of tsvtools
may also try to have similar distributions according to the age and the sex of the participants. To benefit from this feature, the user must also include these two columns in their TSV files.
Example of TSV produced by get-labels
:¶
participant_id | session_id | diagnosis | age | sex | ... |
---|---|---|---|---|---|
sub-CLNC0001 | ses-M000 | MCI | 72 | M | ... |
sub-CLNC0001 | ses-M012 | MCI | 65 | F | ... |
sub-CLNC0001 | ses-M036 | AD | 66 | F | ... |
sub-CLNC0002 | ses-M000 | CN | 89 | F | ... |
Running the task¶
clinicadl tsvtools get-labels [OPTIONS] BIDS_DIRECTORY OUTPUT_DIRECTORY
BIDS_DIRECTORY
(Path) is the input folder containing the dataset in a BIDS hierarchy.OUTPUT_DIRECTORY
(Path) is the path to the directory containing all outputs generated by the command.
Options:
--modality
(str) is the modality for which the sessions are selected. Sessions which do not include the modality will be excluded from the outputs. The name of the modality must correspond to a column of the TSV files inmissing_mods
. Default value:t1w
.--diagnoses
(List[str]) is the list of the labels that will be extracted. Labels can be either a group or a subgroup. These labels must be chosen from the combination of group and subgroup seen above. All the sessions for which the diagnosis is not in the given list will be remove in the output TSV file. Default will be focus on the Alzheimer's disease and only process AD and CN labels.--restriction_tsv
(Path) is a path to a TSV file containing the list of sessions that should be used. This argument is useful to integrate the result of a quality check procedure. Default will not perform any restriction.--variables_of_interest
(List[str]) is a list of columns present inMERGED_TSV
that will be included in the outputs.--keep_smc
(bool) if given the SMC participants are kept in theCN.tsv
file. Default setting removes these participants.--caps_directory
(Path) is a folder containing a CAPS compliant dataset--merged_tsv
(Path) is a path to a TSV file containing the results ofclinica iotools merge-tsv
command.- If not run before, this command will be run in the task and the output will be save in the
RESULTS_DIRECTORY
given with the namemerged.tsv
. - If run before, the output has to be store with the name
merged.tsv
in theRESULTS_DIRECTORY
or the path has to be given with this option. It avoids to re-run themerge-tsv
command which can be very long (more than 30min).
- If not run before, this command will be run in the task and the output will be save in the
--missing_mods
(Path) is a path to a TSV file containing the results ofclinica iotools missing-modalities
command.- If not run before, this command will be run in the task and the output directory will be save in the
RESULTS_DIRECTORY
given with the namemissing_mods
. - If run before, the output directory has to be store with the name
missing_mods
in theRESULTS_DIRECTORY
or the path has to be given with this option. It avoids to re-run theclinica iotools check-missing-modalities
command which can be very long (more than 30min, depending on the hardware).
- If not run before, this command will be run in the task and the output directory will be save in the
--remove-unique-session
(bool) if given, participants having only one session will be removed. Default setting keeps these participants.
Example of how to run the task :¶
clinicadl tsvtools get-labels Data/BIDS Data --merged-tsv merged.tsv --diagnosis AD --diagnosis MCI
This commandline will create a new TSV file (Data/labels.tsv
) with a list of AD
and MCI
participants from Data/Bids
. This command will run clinica iotools check-missing-modalities
but used the given merged.tsv
file.
Output tree¶
The command will output a TSV file and a JSON file and will store intermediate results from clinica iotools merge-tsv
or clinicadl tsvtool missing_mods
commands:
└── <OUTPUT_DIRECTORY> ├── labels.tsv ├── labels.json ├── merge.tsv └── missing_mods └── ...
split
- Single split observing similar age and sex distributions¶
Description¶
This tool splits all the data in order to have the same sex, age and diagnosis distributions in both sets produced. The similarity of the age and sex distributions is assessed by a T-test and chi-square test, respectively.
Running the task¶
clinicadl tsvtools split [OPTIONS] DATA_TSV
DATA_TSV
(Path) is the TSV file with the data that are going to be split (output ofclinicadl tsvtools get-labels|split|kfold
).
Options:
--subset_name
(str) is the name of the subset that is complementary to train. Default value:validation
.-
--n_test
(float) gives the number of subjects that will be put in the set complementary to train. Default value:100
.- If > 1, corresponds to the number of subjects to put in set with name
subset_name
. - If < 1, proportion of subjects to put in set with name
subset_name
. - If = 0, no training set is created and the whole dataset is considered as one set
with name
subset_name
.
- If > 1, corresponds to the number of subjects to put in set with name
-
--p_age_threshold
(float) is the threshold on the p-value used for the T-test on age distributions. Default value:0.80
. --p_sex_threshold
(float) is the threshold on the p-value used for the chi2 test on sex distributions. Default value:0.80
.--ignore_demographics
(bool) is a flag that disable the use of age, sex and group to balance the split. Default value:False
--categorical_split_variable
(str) is the name of a categorical variable used for a stratified shuffle split (in addition to age and sex selection). Default value:None
Tip
Even if the split is performed on age and sex, these columns do not need to be part of the columns. If they are not present, the pipeline will try to go back to the labels.tsv to find them.
Output tree¶
The command will generate new tsv files, one containing the keys included in the subset_name
set and one containing the ones included in the train
set. These files will be stored in the same directory as the FORMATED_DATA_TSV
.
└── <split> ├── train.tsv ├── train_baseline.tsv └── subset_name_baseline.tsv
The columns of the produced TSV files are only the keys :participants_id
, session_id
.
TSV files ending with _baseline.tsv
only include the baseline session of each subject (or
the session closest to baseline if the latter does not exist).
kfold
- K-fold split¶
Description¶
This tool splits data to perform a k-fold cross-validation.
Running the task¶
clinicadl tsvtools kfold [OPTIONS] DATA_TSV
DATA_TSV
(str) is the TSV file containing the data that are going to be split
(output of clinicadl tsvtool getlabels|split|kfold
).
Options:
--subset_name
(str) is the name of the subset that is complementary to train. Default value:validation
.--n_splits
(int) is the value of k. If 0 is given, all subjects are considered as test subjects. Default value:5
.--stratification
(str) is the name of the variable used to stratify the k-fold split. By default, the value isNone
which means there is no stratification.
Output tree¶
The command will generate a new folder k-fold
stored in the same directory as the DATA_TSV
file and containing the different split folders. Each of these files contains the keys (participants_id
, session_id
).
For each key, it is expliciting which set it belongs to for each split according to the following structure (example for a 2-fold validation):
└── 2_fold
├── split_0
│ ├── train_baseline.tsv
│ ├── train.tsv
│ └── validation_baseline.tsv
└── split_1
├── train_baseline.tsv
├── train.tsv
└── validation_baseline.tsv
prepare-experiment
¶
Description¶
This tool performs a single split to prepare testing data and then can perform either k-fold or single split to prepare validation data. It is an easy way to quickly prepare data with basic options.
Running the task¶
clinicadl tsvtools prepare-experiment [OPTIONS] DATA_TSV
DATA_TSV
(Path) is a TSV file output ofclinicadl tsvtool get-labels|split|kfold
.
Options:
-
--n_test
(float) gives the number of subjects that will be put in the test set:- If > 1, corresponds to the number of subjects to put in set with name
subset_name
. - If < 1, proportion of subjects to put in set with name
subset_name
. - If = 0, no training set is created and the whole dataset is considered as one set
with name
subset_name
.
Default value:
100
. ---validation_type
(str) is the name of the split wanted for the validation. It can only besplit
orkfold
. Default value:split
. ---n_validation
(float) gives the number of subjects that will be put in the validation set: - Forsplit
, it is the number of subjects to put in validation set if it is a SingleSplit. - Forkfold
, it is the number of folds in the k-folds split. - If = 0, there is no training set and the whole dataset is considered as a validation set. - If > 1, corresponds to the number of subjects to put in set with name
Tip
Even if the split is performed on age and sex, these columns do not need to be part of the columns. If they are not present, the pipeline will go back to the labels.tsv to find them.
get-progression
¶
Description¶
This tool adds a new column progression
to the TSV file given. It corresponds to the progression of the Alzheimer's disease. The diagnosis must be one of those : CN
(cognitively norma), MCI
(mild cognitive impairment) or AD
(alzheimer's disease).
The progression label corresponds to the following description:
- s (stable): diagnosis remains identical during the
time_horizon
period following the current visit, - p (progressive): diagnosis progresses to the following state during the
time_horizon
period following the current visit (eg. MCI --> AD), - r (regressive): diagnosis regresses to the previous state during the
time_horizon
period following the current visit (eg. MCI --> CN), - uk (unknown): there are not enough sessions to assess the reliability of the label but no changes were spotted,
- us (unstable): otherwise (multiple conversions / regressions).
This pipeline provides the progression of Alzheimer's disease based on the list of sessions provided. If any of the sessions are missing from the input file, the computed disease progression may be inaccurate.
Warning
For example, consider the following list:
participant_id | session_id | diagnosis | ... |
---|---|---|---|
sub-ADNI135S6586 | ses-M000 | MCI | ... |
sub-ADNI135S6586 | ses-M012 | CN | ... |
sub-ADNI135S6586 | ses-M036 | MCI | ... |
sub-ADNI135S6586 | ses-M048 | MCI | ... |
sub-ADNI135S6586 | ses-M060 | CN | ... |
sub-ADNI135S6586 | ses-M072 | AD | ... |
The participant is unstable, but if, for some reasons, you dropped some sessions, you can end up with the following list:
participant_id | session_id | diagnosis | ... |
---|---|---|---|
sub-ADNI135S6586 | ses-M000 | MCI | ... |
sub-ADNI135S6586 | ses-M036 | MCI | ... |
sub-ADNI135S6586 | ses-M048 | MCI | ... |
sub-ADNI135S6586 | ses-M072 | AD | ... |
Then, if you try to get the progression, the command will return the following list:
participant_id | session_id | diagnosis | progression |
---|---|---|---|
sub-ADNI135S6586 | ses-M000 | MCI | p |
sub-ADNI135S6586 | ses-M036 | MCI | p |
sub-ADNI135S6586 | ses-M048 | MCI | p |
sub-ADNI135S6586 | ses-M072 | AD | s |
What is wrong because the participant is unstable.
Running the task¶
clinicadl tsvtools get-progression [OPTIONS] DATA_TSV
DATA_TSV
(str) is the TSV file containing the data
(output of clinicadl tsvtool getlabels|split|kfold
).
Options:
- --time_horizon
(int) is the time horizon in months that is used to assess the stability of the MCI subjects.
Default value: 36
.
Tip
The diagnosis column do not need to be part of the columns, the pipeline will go back to the labels.tsv to calculate the progression.
get-metadata
¶
Description¶
This tool add extra columns to the tsv file given.
For example, the output of clinicadl tsvtools split/kfold
returns a TSV file with only two columns participant_id
and session_id
and with this command, it is possible to retrieve the age, the diagnosis, the sex or other variables of interest in other tsv files.
Running the task¶
clinicadl tsvtools get-metadata [OPTIONS] DATA_TSV MERGED_TSV
DATA_TSV
(Path) is a folder containing one TSV file per label (output ofclinicadl tsvtool get-labels|split|kfold
).MERGED_TSV
(Path) is a path to a TSV file containing the results ofclinica iotools merge-tsv
command.
Options:
--variables_of_interest
(List[str]) is a list of columns present inMERGED_TSV
that will be included in the outputs.
If no variables_of_interest
is given, all columns present in MERGED_TSV
will be included.
analysis
¶
Description¶
This tool writes a TSV file that summarizes the demographics and clinical distributions of the asked labels. Continuous variables are described with statistics (mean, standard deviation, minimum and maximum), whereas categorical values are grouped by categories. The variables of interest are: age, sex, mini-mental state examination (MMSE) and global clinical dementia rating (CDR).
Running the task¶
clinicadl tsvtools analysis [OPTIONS] MERGED_TSV DATA_TSV OUTPUT_TSV
MERGED_TSV
(Path) is the output file of theclinica iotools merge-tsv
commands. If thclinicadl tsvtools getlabels
command was run before, this file already exists and is stored in the output folder of this command.DATA_TSV
(Path) is a folder containing one TSV file per label (output ofclinicadl tsvtool getlabels|split|kfold
).OUTPUT_TSV
(Path) is the path to the TSV file that will be written (filename included).
Options:
--diagnoses
(List[str]) is the list of the labels that will be extracted. These labels must be chosen from {AD,CN,MCI}. Default will only process AD and CN labels.
Summary¶
This summary is a little help to better understand how to pre-process data before training a network, using Clinica and ClinicaDL.