Find subjects using a file as search input
I have been working with data from IDC for a cohort of 100 individuals, and I'd like to see if any other data is available about my subjects, and where. I want to submit a file that has all of my individuals IDs so I can search them all at once. My file looks like this:
So, I want to match the subject column in mydatafile.tsv to the cda column for subject_id. And since I want to see where data exists about my subjects, I'm doing add_columns and telling it upstream_identifiers.*. * means "anything", so this will add all of the upstream identifier columns so I can see where each of my subjects have data:
get_subject_data(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'}, add_columns = 'upstream_identifiers.*')
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
92 of my subjects have data in CDA, and it looks like they have data spread across GC, GDC, PDC, and IDC. Let's see a summary of what files and subjects there are:
summarize_files(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'})
╔════════════════════════════╗ ║ number_of_matching_files ║ ╠════════════════════════════╣ ║ 10603 ║ ╚════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_subjects_related_to_matching_files ║ ╠════════════════════════════════════════════════╣ ║ 92 ║ ╚════════════════════════════════════════════════╝ ╔═════════╦═══════════════╗ ║ files ║ data_source ║ ╠═════════╬═══════════════╣ ║ 7085 ║ GDC only ║ ║ 2696 ║ PDC only ║ ║ 772 ║ IDC only ║ ║ 50 ║ GC only ║ ╚═════════╩═══════════════╝ ╔════════════════╦══════════════════════════════╗ ║ count_result ║ category ║ ╠════════════════╬══════════════════════════════╣ ║ 2630 ║ Simple Nucleotide Variation ║ ║ 1348 ║ Peptide Spectral Matches ║ ║ 1330 ║ Sequencing Reads ║ ║ 1033 ║ Copy Number Variation ║ ║ 674 ║ Processed Mass Spectra ║ ║ 674 ║ Raw Mass Spectra ║ ║ 448 ║ Structural Variation ║ ║ 398 ║ Biospecimen ║ ║ 386 ║ Transcriptome Profiling ║ ║ 328 ║ Somatic Structural Variation ║ ║ 317 ║ Segmentation ║ ║ 285 ║ DNA Methylation ║ ║ 258 ║ Slide Microscopy ║ ║ 203 ║ <NA> ║ ║ 106 ║ Magnetic Resonance ║ ║ 80 ║ Annotation ║ ║ 66 ║ Proteome Profiling ║ ║ 16 ║ WXS ║ ║ 11 ║ Structured Report Document ║ ║ 9 ║ RNA-Seq ║ ║ 2 ║ WGS ║ ║ 1 ║ miRNA-Seq ║ ╚════════════════╩══════════════════════════════╝ ╔════════════════╦═════════════╗ ║ count_result ║ format ║ ╠════════════════╬═════════════╣ ║ 1567 ║ TSV ║ ║ 1108 ║ VCF ║ ║ 1103 ║ TBI ║ ║ 842 ║ TXT ║ ║ 772 ║ DICOM ║ ║ 715 ║ BAM ║ ║ 674 ║ <NA> ║ ║ 674 ║ mzIdentML ║ ║ 674 ║ mzML ║ ║ 615 ║ BAI ║ ║ 523 ║ MAF ║ ║ 324 ║ BEDPE ║ ║ 224 ║ SVS ║ ║ 190 ║ IDAT ║ ║ 166 ║ CEL ║ ║ 164 ║ BCR XML ║ ║ 83 ║ PDF ║ ║ 82 ║ BCR SSF XML ║ ║ 68 ║ TAR ║ ║ 19 ║ BCR Biotab ║ ║ 8 ║ GCT ║ ║ 7 ║ BCR OMF XML ║ ║ 1 ║ CSV ║ ╚════════════════╩═════════════╝ ╔════════════════╦════════════╗ ║ count_result ║ access ║ ╠════════════════╬════════════╣ ║ 5562 ║ open ║ ║ 3323 ║ controlled ║ ║ 1718 ║ <NA> ║ ╚════════════════╩════════════╝ ╔════════════════╦════════════════════════════════════════════╗ ║ count_result ║ file_type ║ ╠════════════════╬════════════════════════════════════════════╣ ║ 1348 ║ Open Standard ║ ║ 1103 ║ Somatic Mutation Index ║ ║ 817 ║ Annotated Somatic Mutation ║ ║ 715 ║ Aligned Reads ║ ║ 674 ║ Proprietary ║ ║ 674 ║ Text ║ ║ 615 ║ Aligned Reads Index ║ ║ 518 ║ Raw Simple Somatic Mutation ║ ║ 392 ║ Transcript Fusion ║ ║ 317 ║ Segmentation Storage ║ ║ 258 ║ VL Whole Slide Microscopy Image Storage ║ ║ 256 ║ Structural Rearrangement ║ ║ 242 ║ Gene Level Copy Number ║ ║ 231 ║ Copy Number Segment ║ ║ 224 ║ Slide Image ║ ║ 190 ║ Masked Intensities ║ ║ 174 ║ Biospecimen Supplement ║ ║ 166 ║ Raw Intensities ║ ║ 166 ║ Simple Germline Variation ║ ║ 163 ║ Allele-specific Copy Number Segment ║ ║ 163 ║ Masked Copy Number Segment ║ ║ 105 ║ MR Image Storage ║ ║ 98 ║ Clinical Supplement ║ ║ 98 ║ Gene Expression Quantification ║ ║ 98 ║ Splice Junction Quantification ║ ║ 95 ║ Isoform Expression Quantification ║ ║ 95 ║ Methylation Beta Value ║ ║ 95 ║ miRNA Expression Quantification ║ ║ 83 ║ Pathology Report ║ ║ 80 ║ Microscopy Bulk Simple Annotations Storage ║ ║ 77 ║ Aggregated Somatic Mutation ║ ║ 77 ║ Masked Somatic Mutation ║ ║ 68 ║ Intermediate Analysis Archive ║ ║ 66 ║ Protein Expression Quantification ║ ║ 50 ║ <NA> ║ ║ 11 ║ Comprehensive SR Storage ║ ║ 1 ║ Secondary Capture Image Storage ║ ╚════════════════╩════════════════════════════════════════════╝ ╔════════════════╦═════════════════╗ ║ count_result ║ anatomic_site ║ ╠════════════════╬═════════════════╣ ║ 9831 ║ <NA> ║ ║ 772 ║ breast ║ ╚════════════════╩═════════════════╝ ╔════════════════╦═══════════════════╗ ║ count_result ║ tumor_vs_normal ║ ╠════════════════╬═══════════════════╣ ║ 9189 ║ tumor ║ ║ 4325 ║ normal ║ ║ 404 ║ <NA> ║ ╚════════════════╩═══════════════════╝ ╔════════════════╦══════════════╗ ║ ║ size ║ ╠════════════════╬══════════════╣ ║ mean ║ 4253289943 ║ ║ min ║ 72 ║ ║ lower quartile ║ 65307 ║ ║ median ║ 3432095 ║ ║ upper quartile ║ 82141684 ║ ║ max ║ 480722740993 ║ ╚════════════════╩══════════════╝
summarize_subjects(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'})
╔═══════════════════════════════╗ ║ number_of_matching_subjects ║ ╠═══════════════════════════════╣ ║ 92 ║ ╚═══════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_files_related_to_matching_subjects ║ ╠════════════════════════════════════════════════╣ ║ 10603 ║ ╚════════════════════════════════════════════════╝ ╔════════════╦══════════════════════╗ ║ subjects ║ data_source ║ ╠════════════╬══════════════════════╣ ║ 72 ║ GDC + IDC ║ ║ 9 ║ PDC + GC + GDC + IDC ║ ║ 5 ║ PDC + GDC + IDC ║ ║ 5 ║ GC + GDC + IDC ║ ║ 1 ║ PDC + IDC ║ ╚════════════╩══════════════════════╝ ╔════════════════╦════════════════════╗ ║ count_result ║ ethnicity ║ ╠════════════════╬════════════════════╣ ║ 75 ║ Non-Hispanic ║ ║ 12 ║ <NA> ║ ║ 5 ║ Hispanic or Latino ║ ╚════════════════╩════════════════════╝ ╔════════════════╦══════════════════╗ ║ count_result ║ cause_of_death ║ ╠════════════════╬══════════════════╣ ║ 92 ║ <NA> ║ ╚════════════════╩══════════════════╝ ╔════════════════╦═══════════════════════════╗ ║ count_result ║ race ║ ╠════════════════╬═══════════════════════════╣ ║ 67 ║ White ║ ║ 11 ║ Black or African American ║ ║ 8 ║ <NA> ║ ║ 6 ║ Asian ║ ╚════════════════╩═══════════════════════════╝ ╔════════════════╦═══════════╗ ║ count_result ║ species ║ ╠════════════════╬═══════════╣ ║ 92 ║ human ║ ╚════════════════╩═══════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_death ║ ╠════════════════╬═════════════════╣ ║ mean ║ 2009 ║ ║ min ║ 2008 ║ ║ lower quartile ║ 2008 ║ ║ median ║ 2008 ║ ║ upper quartile ║ 2009 ║ ║ max ║ 2009 ║ ╚════════════════╩═════════════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_birth ║ ╠════════════════╬═════════════════╣ ║ mean ║ 1950 ║ ║ min ║ 1938 ║ ║ lower quartile ║ 1943 ║ ║ median ║ 1950 ║ ║ upper quartile ║ 1958 ║ ║ max ║ 1963 ║ ╚════════════════╩═════════════════╝