Find all the CPTAC subjects¶
tables()
['file', 'mutation', 'observation', 'project', 'subject', 'treatment', 'upstream_identifiers']
I'm a researcher, and I want to reuse data from the Clinical Proteomic Tumor Analysis Consortium, but it's been stored across multiple data centers. I just want an easy way to track it all down.
First, decide what column to search. I'm looking for columns that have to do with project:
columns(column=["*project*"])
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
project_name has the definition I'm looking for, so I'm going to search that for cptac. I want information about people, not files, so I am using get_subject_data and summarize_subject. The first will give me back a big table of all the subject metadata, which I'm saving directly to a file for later, the second will show me some helpful summary statistics:
get_subject_data( match_all = "project_name = *cptac*", return_data_as='tsv', output_file= "cptac_subjects.tsv" )
summarize_subjects( match_all = "project_name = *cptac*" )
╔═══════════════════════════════╗ ║ number_of_matching_subjects ║ ╠═══════════════════════════════╣ ║ 3843 ║ ╚═══════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_files_related_to_matching_subjects ║ ╠════════════════════════════════════════════════╣ ║ 305217 ║ ╚════════════════════════════════════════════════╝ ╔════════════╦══════════════════════╗ ║ subjects ║ data_source ║ ╠════════════╬══════════════════════╣ ║ 1118 ║ IDC + GC + GDC + PDC ║ ║ 890 ║ PDC only ║ ║ 753 ║ IDC + GDC + PDC ║ ║ 491 ║ IDC only ║ ║ 272 ║ GDC + PDC ║ ║ 105 ║ GDC only ║ ║ 101 ║ IDC + GDC ║ ║ 45 ║ GC + GDC + PDC ║ ║ 30 ║ IDC + PDC ║ ║ 16 ║ IDC + GC ║ ║ 11 ║ GC + PDC ║ ║ 11 ║ IDC + GC + PDC ║ ╚════════════╩══════════════════════╝ ╔════════════════╦═══════════════════════╗ ║ count_result ║ species ║ ╠════════════════╬═══════════════════════╣ ║ 3207 ║ human ║ ║ 595 ║ <NA> ║ ║ 41 ║ human/mouse xenograft ║ ╚════════════════╩═══════════════════════╝ ╔════════════════╦══════════════════════════════════╗ ║ count_result ║ race ║ ╠════════════════╬══════════════════════════════════╣ ║ 2109 ║ White ║ ║ 1295 ║ <NA> ║ ║ 325 ║ Asian ║ ║ 107 ║ Black or African American ║ ║ 7 ║ American Indian or Alaska Native ║ ╚════════════════╩══════════════════════════════════╝ ╔════════════════╦══════════════════════════╗ ║ count_result ║ cause_of_death ║ ╠════════════════╬══════════════════════════╣ ║ 3315 ║ <NA> ║ ║ 446 ║ Cancer-Related Death ║ ║ 46 ║ Non-Cancer Related Death ║ ║ 15 ║ Infection ║ ║ 13 ║ Cardiovascular Disorder ║ ║ 8 ║ Surgical Complication ║ ╚════════════════╩══════════════════════════╝ ╔════════════════╦════════════════════╗ ║ count_result ║ ethnicity ║ ╠════════════════╬════════════════════╣ ║ 2642 ║ <NA> ║ ║ 1113 ║ Non-Hispanic ║ ║ 88 ║ Hispanic or Latino ║ ╚════════════════╩════════════════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_birth ║ ╠════════════════╬═════════════════╣ ║ mean ║ 1953 ║ ║ min ║ 1908 ║ ║ lower quartile ║ 1945 ║ ║ median ║ 1952 ║ ║ upper quartile ║ 1960 ║ ║ max ║ 2002 ║ ╚════════════════╩═════════════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_death ║ ╠════════════════╬═════════════════╣ ║ mean ║ 2016 ║ ║ min ║ 1992 ║ ║ lower quartile ║ 2017 ║ ║ median ║ 2018 ║ ║ upper quartile ║ 2020 ║ ║ max ║ 2023 ║ ╚════════════════╩═════════════════╝
I've found just over 3000 subjects who participated in CPTAC studies. If I'm going to go retrieve all of these data, I'm going to need to know where the files are. Its easiest to summarize this info to look at:
summarize_files( match_all = "project_name = *cptac*")
╔════════════════════════════╗ ║ number_of_matching_files ║ ╠════════════════════════════╣ ║ 270438 ║ ╚════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_subjects_related_to_matching_files ║ ╠════════════════════════════════════════════════╣ ║ 3843 ║ ╚════════════════════════════════════════════════╝ ╔═════════╦═══════════════╗ ║ files ║ data_source ║ ╠═════════╬═══════════════╣ ║ 158951 ║ GDC only ║ ║ 94648 ║ PDC only ║ ║ 14340 ║ IDC only ║ ║ 2499 ║ GC only ║ ╚═════════╩═══════════════╝ ╔════════════════╦═══════════╗ ║ count_result ║ format ║ ╠════════════════╬═══════════╣ ║ 40520 ║ TSV ║ ║ 32828 ║ VCF ║ ║ 32058 ║ TBI ║ ║ 27727 ║ <NA> ║ ║ 22859 ║ mzML ║ ║ 21367 ║ mzIdentML ║ ║ 20789 ║ BAM ║ ║ 17634 ║ BAI ║ ║ 14978 ║ MAF ║ ║ 14340 ║ DICOM ║ ║ 10419 ║ BEDPE ║ ║ 7201 ║ TXT ║ ║ 4768 ║ IDAT ║ ║ 1787 ║ TAR ║ ║ 261 ║ FASTQ ║ ║ 171 ║ CDF ║ ║ 146 ║ XLSX ║ ║ 111 ║ ZIP ║ ║ 83 ║ HTML ║ ║ 74 ║ MEX ║ ║ 69 ║ GCT ║ ║ 60 ║ PDF ║ ║ 52 ║ DOCX ║ ║ 46 ║ TAR.GZ ║ ║ 36 ║ HDF5 ║ ║ 27 ║ FASTA ║ ║ 17 ║ CSV ║ ║ 7 ║ idpDB ║ ║ 1 ║ SKYLINE ║ ║ 1 ║ SQLITE3 ║ ║ 1 ║ XLS ║ ╚════════════════╩═══════════╝ ╔════════════════╦════════════════════════════════════════════╗ ║ count_result ║ file_type ║ ╠════════════════╬════════════════════════════════════════════╣ ║ 45297 ║ Open Standard ║ ║ 32058 ║ Somatic Mutation Index ║ ║ 27734 ║ Proprietary ║ ║ 21092 ║ Text ║ ║ 20789 ║ Aligned Reads ║ ║ 20469 ║ Annotated Somatic Mutation ║ ║ 17634 ║ Aligned Reads Index ║ ║ 16495 ║ Raw Simple Somatic Mutation ║ ║ 12252 ║ Transcript Fusion ║ ║ 8586 ║ Structural Rearrangement ║ ║ 7377 ║ VL Whole Slide Microscopy Image Storage ║ ║ 4768 ║ Masked Intensities ║ ║ 3137 ║ Gene Expression Quantification ║ ║ 3099 ║ CT Image Storage ║ ║ 3063 ║ Splice Junction Quantification ║ ║ 2718 ║ Isoform Expression Quantification ║ ║ 2718 ║ miRNA Expression Quantification ║ ║ 2499 ║ <NA> ║ ║ 2384 ║ Methylation Beta Value ║ ║ 2355 ║ Aggregated Somatic Mutation ║ ║ 2355 ║ Masked Somatic Mutation ║ ║ 1877 ║ MR Image Storage ║ ║ 1789 ║ RT Structure Set Storage ║ ║ 1787 ║ Copy Number Segment ║ ║ 1787 ║ Intermediate Analysis Archive ║ ║ 1744 ║ Allele-specific Copy Number Segment ║ ║ 1744 ║ Gene Level Copy Number ║ ║ 277 ║ Document ║ ║ 157 ║ Archive ║ ║ 89 ║ Web ║ ║ 80 ║ Secondary Capture Image Storage ║ ║ 75 ║ Positron Emission Tomography Image Storage ║ ║ 72 ║ Single Cell Analysis ║ ║ 37 ║ Segmentation Storage ║ ║ 36 ║ Differential Gene Expression ║ ║ 6 ║ Ultrasound Image Storage ║ ║ 1 ║ Database ║ ║ 1 ║ Skyline Document ║ ╚════════════════╩════════════════════════════════════════════╝ ╔════════════════╦════════════╗ ║ count_result ║ access ║ ╠════════════════╬════════════╣ ║ 132381 ║ open ║ ║ 88365 ║ controlled ║ ║ 49692 ║ <NA> ║ ╚════════════════╩════════════╝ ╔════════════════╦════════════════════════════════════╗ ║ count_result ║ category ║ ╠════════════════╬════════════════════════════════════╣ ║ 69439 ║ Simple Nucleotide Variation ║ ║ 42734 ║ Peptide Spectral Matches ║ ║ 38317 ║ Sequencing Reads ║ ║ 27770 ║ Raw Mass Spectra ║ ║ 23030 ║ Processed Mass Spectra ║ ║ 13014 ║ Structural Variation ║ ║ 12117 ║ Somatic Structural Variation ║ ║ 11744 ║ Transcriptome Profiling ║ ║ 7377 ║ Slide Microscopy ║ ║ 7152 ║ DNA Methylation ║ ║ 7062 ║ Copy Number Variation ║ ║ 3143 ║ Computed Tomography ║ ║ 1907 ║ Magnetic Resonance ║ ║ 1789 ║ RT Structure Set ║ ║ 1103 ║ WXS ║ ║ 790 ║ WGS ║ ║ 524 ║ Protein Assembly ║ ║ 252 ║ Other Metadata ║ ║ 220 ║ <NA> ║ ║ 166 ║ Quality Metrics ║ ║ 156 ║ snATAC-Seq ║ ║ 110 ║ RNA-Seq ║ ║ 106 ║ Raw Sequencing Data ║ ║ 105 ║ ATAC-Seq ║ ║ 80 ║ Positron emission tomography ║ ║ 77 ║ Publication Supplementary Material ║ ║ 46 ║ Spectral Library ║ ║ 37 ║ Segmentation ║ ║ 31 ║ Alternate Processing Pipeline ║ ║ 17 ║ Supplementary Data ║ ║ 15 ║ miRNA-Seq ║ ║ 7 ║ Ultrasound ║ ║ 1 ║ Skyline document ║ ╚════════════════╩════════════════════════════════════╝ ╔════════════════╦═════════════════════════════╗ ║ count_result ║ anatomic_site ║ ╠════════════════╬═════════════════════════════╣ ║ 78448 ║ blood ║ ║ 75916 ║ <NA> ║ ║ 44545 ║ lung ║ ║ 36412 ║ kidney ║ ║ 27819 ║ uterus ║ ║ 22402 ║ pancreas ║ ║ 13089 ║ brain ║ ║ 9789 ║ stomach ║ ║ 6923 ║ bone marrow ║ ║ 6637 ║ pyloric antrum ║ ║ 6590 ║ frontal lobe ║ ║ 4176 ║ tongue ║ ║ 4050 ║ larynx ║ ║ 3016 ║ abdomen ║ ║ 3016 ║ pylorus ║ ║ 2392 ║ fundus of stomach ║ ║ 2027 ║ occipital cortex ║ ║ 1468 ║ mouth floor ║ ║ 1154 ║ pelvic region of trunk ║ ║ 1065 ║ telencephalon ║ ║ 763 ║ temporal cortex ║ ║ 690 ║ chest ║ ║ 653 ║ breast ║ ║ 597 ║ skin of body ║ ║ 444 ║ mouth ║ ║ 394 ║ craniocervical region ║ ║ 372 ║ colon ║ ║ 356 ║ oropharynx ║ ║ 304 ║ lip ║ ║ 287 ║ cardia of stomach ║ ║ 275 ║ alveolar ridge ║ ║ 224 ║ right kidney ║ ║ 222 ║ ovary ║ ║ 190 ║ trunk ║ ║ 188 ║ buccal mucosa ║ ║ 169 ║ hematopoietic system ║ ║ 157 ║ liver ║ ║ 148 ║ left kidney ║ ║ 96 ║ abdominopelvic cavity ║ ║ 87 ║ supraglottic part of larynx ║ ║ 76 ║ left adrenal gland ║ ║ 68 ║ tonsil ║ ║ 58 ║ retroperitoneal lymph node ║ ║ 44 ║ right adrenal gland ║ ║ 40 ║ pancreatic duct ║ ║ 32 ║ parietal pelvic lymph node ║ ║ 28 ║ left pelvic girdle region ║ ║ 27 ║ body proper ║ ║ 24 ║ right lung ║ ║ 20 ║ mediastinal lymph node ║ ║ 20 ║ paraaortic lymph node ║ ║ 18 ║ pelvic lymph node ║ ║ 16 ║ inguinal lymph node ║ ║ 16 ║ left lung ║ ║ 13 ║ appendage ║ ║ 12 ║ hepatic lymph node ║ ║ 12 ║ right pelvic girdle region ║ ║ 10 ║ inferior vena cava ║ ║ 10 ║ left ovary ║ ║ 10 ║ neck ║ ║ 10 ║ right ovary ║ ║ 8 ║ lymph node ║ ║ 7 ║ head ║ ║ 6 ║ adrenal gland ║ ║ 6 ║ mesenteric lymph node ║ ║ 6 ║ perigastric lymph node ║ ║ 6 ║ postcranial axial skeleton ║ ║ 6 ║ retroperitoneal space ║ ║ 6 ║ urinary bladder ║ ║ 4 ║ abdominal wall ║ ║ 4 ║ cervical lymph node ║ ║ 4 ║ humerus ║ ║ 4 ║ left renal vein ║ ║ 4 ║ paratracheal lymph node ║ ║ 4 ║ thyroid gland ║ ║ 2 ║ abdominal lymph node ║ ║ 2 ║ axillary lymph node ║ ║ 2 ║ lumbar lymph node ║ ║ 2 ║ spleen ║ ║ 2 ║ vertebra ║ ╚════════════════╩═════════════════════════════╝ ╔════════════════╦═══════════════════╗ ║ count_result ║ tumor_vs_normal ║ ╠════════════════╬═══════════════════╣ ║ 219361 ║ tumor ║ ║ 158438 ║ normal ║ ║ 22334 ║ <NA> ║ ╚════════════════╩═══════════════════╝ ╔════════════════╦══════════════╗ ║ ║ size ║ ╠════════════════╬══════════════╣ ║ mean ║ 2859094700 ║ ║ min ║ 20 ║ ║ lower quartile ║ 128218 ║ ║ median ║ 4232597 ║ ║ upper quartile ║ 135653662 ║ ║ max ║ 660097945420 ║ ╚════════════════╩══════════════╝
There are a lot of files, so to start I'm just going to get a subset. Today I'm interested in kidney and bladder patients:
summarize_files( match_all = ["project_name = *cptac*"], match_any= [ 'anatomic_site = *kidney*', 'anatomic_site = *bladder*'])
╔════════════════════════════╗ ║ number_of_matching_files ║ ╠════════════════════════════╣ ║ 36418 ║ ╚════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_subjects_related_to_matching_files ║ ╠════════════════════════════════════════════════╣ ║ 280 ║ ╚════════════════════════════════════════════════╝ ╔═════════╦═══════════════╗ ║ files ║ data_source ║ ╠═════════╬═══════════════╣ ║ 26997 ║ GDC only ║ ║ 6989 ║ PDC only ║ ║ 2327 ║ IDC only ║ ║ 105 ║ GC only ║ ╚═════════╩═══════════════╝ ╔════════════════╦══════════════════════════════╗ ║ count_result ║ category ║ ╠════════════════╬══════════════════════════════╣ ║ 11492 ║ Simple Nucleotide Variation ║ ║ 5652 ║ Sequencing Reads ║ ║ 3207 ║ Raw Mass Spectra ║ ║ 2454 ║ Somatic Structural Variation ║ ║ 2266 ║ Peptide Spectral Matches ║ ║ 2252 ║ Structural Variation ║ ║ 2176 ║ Transcriptome Profiling ║ ║ 1536 ║ DNA Methylation ║ ║ 1516 ║ Processed Mass Spectra ║ ║ 1412 ║ Copy Number Variation ║ ║ 782 ║ Slide Microscopy ║ ║ 642 ║ RT Structure Set ║ ║ 612 ║ Computed Tomography ║ ║ 254 ║ Magnetic Resonance ║ ║ 105 ║ ATAC-Seq ║ ║ 37 ║ Segmentation ║ ║ 23 ║ Raw Sequencing Data ║ ╚════════════════╩══════════════════════════════╝ ╔════════════════╦═════════════════════════════════════════╗ ║ count_result ║ file_type ║ ╠════════════════╬═════════════════════════════════════════╣ ║ 5561 ║ Somatic Mutation Index ║ ║ 3271 ║ Annotated Somatic Mutation ║ ║ 3207 ║ Proprietary ║ ║ 3098 ║ Aligned Reads ║ ║ 2828 ║ Raw Simple Somatic Mutation ║ ║ 2649 ║ Open Standard ║ ║ 2577 ║ Aligned Reads Index ║ ║ 2084 ║ Transcript Fusion ║ ║ 1748 ║ Structural Rearrangement ║ ║ 1133 ║ Text ║ ║ 1024 ║ Masked Intensities ║ ║ 782 ║ VL Whole Slide Microscopy Image Storage ║ ║ 642 ║ RT Structure Set Storage ║ ║ 611 ║ CT Image Storage ║ ║ 560 ║ Gene Expression Quantification ║ ║ 521 ║ Splice Junction Quantification ║ ║ 519 ║ Isoform Expression Quantification ║ ║ 519 ║ miRNA Expression Quantification ║ ║ 512 ║ Methylation Beta Value ║ ║ 353 ║ Aggregated Somatic Mutation ║ ║ 353 ║ Allele-specific Copy Number Segment ║ ║ 353 ║ Copy Number Segment ║ ║ 353 ║ Gene Level Copy Number ║ ║ 353 ║ Intermediate Analysis Archive ║ ║ 353 ║ Masked Somatic Mutation ║ ║ 251 ║ MR Image Storage ║ ║ 105 ║ <NA> ║ ║ 38 ║ Single Cell Analysis ║ ║ 37 ║ Segmentation Storage ║ ║ 19 ║ Differential Gene Expression ║ ║ 4 ║ Secondary Capture Image Storage ║ ╚════════════════╩═════════════════════════════════════════╝ ╔════════════════╦════════════╗ ║ count_result ║ access ║ ╠════════════════╬════════════╣ ║ 14361 ║ controlled ║ ║ 13919 ║ open ║ ║ 8138 ║ <NA> ║ ╚════════════════╩════════════╝ ╔════════════════╦═══════════╗ ║ count_result ║ format ║ ╠════════════════╬═══════════╣ ║ 5561 ║ TBI ║ ║ 5561 ║ VCF ║ ║ 4646 ║ TSV ║ ║ 3207 ║ <NA> ║ ║ 3098 ║ BAM ║ ║ 2577 ║ BAI ║ ║ 2327 ║ DICOM ║ ║ 2118 ║ MAF ║ ║ 1916 ║ BEDPE ║ ║ 1516 ║ mzML ║ ║ 1218 ║ TXT ║ ║ 1133 ║ mzIdentML ║ ║ 1024 ║ IDAT ║ ║ 353 ║ TAR ║ ║ 105 ║ FASTQ ║ ║ 39 ║ MEX ║ ║ 19 ║ HDF5 ║ ╚════════════════╩═══════════╝ ╔════════════════╦════════════════════════════╗ ║ count_result ║ anatomic_site ║ ╠════════════════╬════════════════════════════╣ ║ 36412 ║ kidney ║ ║ 15481 ║ blood ║ ║ 918 ║ abdomen ║ ║ 224 ║ right kidney ║ ║ 148 ║ left kidney ║ ║ 99 ║ chest ║ ║ 55 ║ liver ║ ║ 48 ║ left adrenal gland ║ ║ 35 ║ abdominopelvic cavity ║ ║ 30 ║ right adrenal gland ║ ║ 24 ║ retroperitoneal lymph node ║ ║ 20 ║ right lung ║ ║ 19 ║ trunk ║ ║ 18 ║ mediastinal lymph node ║ ║ 16 ║ paraaortic lymph node ║ ║ 10 ║ inferior vena cava ║ ║ 8 ║ left lung ║ ║ 6 ║ hepatic lymph node ║ ║ 6 ║ lung ║ ║ 6 ║ pelvic region of trunk ║ ║ 6 ║ urinary bladder ║ ║ 4 ║ abdominal wall ║ ║ 4 ║ adrenal gland ║ ║ 4 ║ appendage ║ ║ 4 ║ craniocervical region ║ ║ 4 ║ humerus ║ ║ 4 ║ left renal vein ║ ║ 4 ║ pancreas ║ ║ 4 ║ paratracheal lymph node ║ ║ 4 ║ thyroid gland ║ ║ 2 ║ abdominal lymph node ║ ║ 2 ║ axillary lymph node ║ ║ 2 ║ spleen ║ ║ 2 ║ uterus ║ ╚════════════════╩════════════════════════════╝ ╔════════════════╦═══════════════════╗ ║ count_result ║ tumor_vs_normal ║ ╠════════════════╬═══════════════════╣ ║ 30443 ║ tumor ║ ║ 24898 ║ normal ║ ║ 1545 ║ <NA> ║ ╚════════════════╩═══════════════════╝ ╔════════════════╦══════════════╗ ║ ║ size ║ ╠════════════════╬══════════════╣ ║ mean ║ 2209594871 ║ ║ min ║ 20 ║ ║ lower quartile ║ 45699 ║ ║ median ║ 2649734 ║ ║ upper quartile ║ 67007115 ║ ║ max ║ 256770133938 ║ ╚════════════════╩══════════════╝
And since I will want to go get these files I'm, also going to run my query with get_file_data. This will get me one row per file, and importantly, the drs_uri for each file, which I need to build a manifest and/or to import these files into a cloud workspace like the Cancer Genomics Cloud or Terra
get_file_data( match_all = "project_name = *cptac*", match_any= [ 'anatomic_site = *kidney*', 'anatomic_site = *bladder*' ] )
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|