SAmple Data Extraction Tool

About

Samples from various projects and/or with various purposes might be sequenced and analyzed together, while requiring separate handling at a later point. The purpose of this tool is to extract (from LocalApp or TSOPPI output directories) all relevant data for a user-defined set of samples while ignoring potentially sensitive data of other (non-specified) samples.

Functionality overview

The tool separates all files within a specified LocalApp/TSOPPI output directory into two groups, based on an input list of IDs that define export-eligible samples:

  • files to be exported: files related to individual export-eligible samples, as well as general files that lack sensitive information;

  • files to be skipped: any remaining files (files related to individual samples not eligible for export and general files containing sensitive information).

For each of the two groups, a list with relevant file paths is created. The export-eligible files can then be packaged into an encrypted archive and have their md5 checksums generated - a bash script for this purpose is created (and optionally executed) by the tool.

The tool also checks whether the specified LocalApp/TSOPPI directory contains the expected files and issues a WARNING each time a file is considered missing.

Input files

  • A LocalApp output directory OR a directory with output (i.e., patient-wise sub-directories) generated by the sample data post-processing TSOPPI tool.

  • An ID file: A text file specifying which samples are eligible for export (please see the notes below for details);

  • A password file: A text file specifying the password which should be used during gpg archive encryption.

Output files

  • [output_directory]/[output_file_prefix]_[LocalApp|TSOPPI]_files_to_export.txt - a list of paths to export-eligible files; the paths are relative to the processed LocalApp/TSOPPI results directory;

  • [output_directory]/[output_file_prefix]_[LocalApp|TSOPPI]_files_to_skip.txt - a list of paths to files that should not be exported;

  • [output_directory]/[output_file_prefix]_[LocalApp|TSOPPI].log - a copy of log messages seen on the stdout during the tool’s runtime;

  • * [optional] [output_directory]/[output_file_prefix]_[LocalApp|TSOPPI].tar.gpg - an encrypted archive with export-eligible files; generated by default, but the creation during runtime can be disabled with option “--generate_export_script_only”;

  • * [output_directory]/[output_file_prefix]_[LocalApp|TSOPPI]_[container|host_system]_export.sh - a bash script for creating the encrypted archive and md5 checksums (the “container” version with container file paths will be created by default, the “host_system” version with host system file paths can be created using the “--generate_export_script_only” option);

  • * [output_directory]/[output_file_prefix]_[LocalApp|TSOPPI][_individual_files|.tar.gpg].md5 - a file with md5 checksums; the checksums are by default created on individual export-eligible files, but can be created on the tar.gpg archive instead using the “--archive_level_md5sum” option.

(* These files will not be created if no samples qualify for extraction.)

Additional notes

  • It is strongly recommended to use sequencing run ID as value for the “--output_file_prefix” option. As the LocalApp and TSOPPI pipelines to a large degree operate on sequencing run level, this should be easy to implement. The default output file prefix values (based on time and date) are not very informative.

  • It is also recommended not to use the “--archive_level_md5sum” option, as archives contain only copies of the original files. By default, the tool will create md5 checksums on the original copies of the export-eligible files.

  • The tool’s ability to rewrite output is disabled by default. This behavior can be changed by enabling the “--rewrite_output” option.

  • By default, the tool will create the encrypted archive and md5 checksums during its runtime. With the “--generate_export_script_only” option enabled, the tool will instead create a bash script for archive and md5sum creation on the host system at a later point. The host system route will possibly be more resource efficient/less time consuming.

  • The file specified by the “--sample_ID_list” option should be a tab-separated text file containing at least two columns and an initial header line. The header line should specify fields/columns “matching_method” and “target_ID” (in any order; any other columns will be ignored). The “target_ID” column should contain non-empty ID strings that identify export-eligible samples (lines with target_ID value of “.” will be ignored). The “matching_method” column should specify how the ID string from the associated “target_ID” column should be matched to sample IDs encountered in the processed data. Currently, the only implemented ID matching method is “prefix”. E.g., input ID string “PAT” (using the “prefix” matching method) would make samples with IDs “PATIENT_01”, “PATOLOGY-74” and “PAT” all eligible for export (while samples with IDs such as “PAR” or “IPAT” would not be eligible).

  • The “--require_inpred_nomenclature” option can be used to further constrain which samples should be eligible for export (the filtering is based on InPreD nomenclature v.3 rules). E.g., any sample with more or fewer than 19 characters in its ID will automatically become ineligible regardless of whether a suitable prefix could be found for it among the supplied IDs.

  • TSOPPI output directories are processed one patient sub-directory at a time. Export for a given sub-directory is initiated only if all samples listed in the contained “sample_list.tsv” file pass the imposed ID checks.

Running the tool

Command line options:

usage: SADET.py [-h] [--version] --input_data_directory INPUT_DATA_DIRECTORY --gpg_password_file GPG_PASSWORD_FILE --sample_ID_list SAMPLE_ID_LIST --output_directory OUTPUT_DIRECTORY
              --input_type {LocalApp,TSOPPI} --host_system_mounting_directory HOST_SYSTEM_MOUNTING_DIRECTORY [--output_file_prefix OUTPUT_FILE_PREFIX] [--generate_export_script_only]
              [--parallel_export_and_md5sum] [--require_inpred_nomenclature] [--archive_level_md5sum] [--rewrite_output] [--container_mounting_directory CONTAINER_MOUNTING_DIRECTORY]

Extract data of specified patients (from LocalApp of TSOPPI output).

options:
  -h, --help            show this help message and exit
  --version             show program`s version number and exit
  --input_data_directory INPUT_DATA_DIRECTORY
                        Absolute path to a LocalApp or TSOPPI output directory (from which data should be extracted).
  --gpg_password_file GPG_PASSWORD_FILE
                        Absolute path to a text file specifying a password that should be utilized for encryption of the extracted data. The file should not contain anything except for the password on the
                        first line. At least 16 characters (including a number, a small letter, a capital letter and an underscore) are recommended. Whitespace characters are not allowed.
  --sample_ID_list SAMPLE_ID_LIST
                        Absolute path to a text file specifying the IDs of samples whose data should be extracted. A header-enabled tab-seperated file with at least two columns is expected on input (the
                        column order does not matter). A column titled 'target_ID' should specify the ID strings. A column titled 'matching_method' should specify an ID-matching method to be used with the
                        corresponding ID (e.g., 'prefix').
  --output_directory OUTPUT_DIRECTORY
                        Absolute path to the directory in which all of the output files should be stored. If not existing, the directory will be created.
  --input_type {LocalApp,TSOPPI}
                        Type of TSO500 solid results that should serve as input for data extraction. (default value: LocalApp)
  --host_system_mounting_directory HOST_SYSTEM_MOUNTING_DIRECTORY
                        Absolute path to the host system mounting directory. The specified directory should include all input and output file paths in its directory tree.
  --output_file_prefix OUTPUT_FILE_PREFIX
                        Prefix used for all output files. If not set, a time-stamp based prefix will be generated. A prefix based on sequencing run ID is recommended. Note: Only alphanumeric characters and
                        underscores are allowed.
  --generate_export_script_only
                        Only generate a script for the required data export (encryption and packaging), do not run the script. (disabled by default)
  --parallel_export_and_md5sum
                        Run gpg/tar and md5sum in parallel. (disabled by default)
  --require_inpred_nomenclature
                        Require that all input IDs are compatible with the InPreD sample nomenclature. (disabled by default)
  --archive_level_md5sum
                        Whether the md5sum should be created on the final tar.gpg archive instead of being creating on individual files. (disabled by default)
  --rewrite_output      Allow rewriting already existing output files. (disabled by default)
  --container_mounting_directory CONTAINER_MOUNTING_DIRECTORY
                        Container`s inner mounting point. The host system mounting directory path/prefix will be replaced by the container mounting directory path in all input and output file paths (this
                        parameter shouldn`t be changed during regular use). (default value: /inpred/data)

Example invocation using the Docker image:

$ [sudo] docker run \
  --rm \
  -it \
  -v /hs_prefix_path:/inpred/data \
  inpred/sadet_main:0.1.0 python3 /inpred/SADET.py \
  --host_system_mounting_directory /hs_prefix_path \
  --input_data_directory /hs_prefix_path/.../240512_A09999_0001_BBBBBBBBB_LocalApp_output \
  --output_directory /hs_prefix_path/.../SADET_output_dir \
  --gpg_password_file /hs_prefix_path/.../gpg_secret.txt \
  --sample_ID_list /hs_prefix_path/.../project_X_sample_ID_prefixes.tsv \
  --input_type LocalApp \
  --require_inpred_nomenclature \
  --output_file_prefix 240512_A09999_0001_BBBBBBBBB

(last updated: 2025-03-11)