Find subjects using a file as search input

init_notebook_modetrusted

I have been working with data from IDC for a cohort of 100 individuals, and I'd like to see if any other data is available about my subjects, and where. I want to submit a file that has all of my individuals IDs so I can search them all at once. My file looks like this:

So, I want to match the subject column in mydatafile.tsv to the cda column for subject_id. And since I want to see where data exists about my subjects, I'm doing add_columns and telling it upstream_identifiers.*. * means "anything", so this will add all of the upstream identifier columns so I can see where each of my subjects have data:

In [2]:

Copied!

get_subject_data(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'}, add_columns = 'upstream_identifiers.*')
get_subject_data(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'}, add_columns = 'upstream_identifiers.*')

Out[2]:

Loading ITables v2.7.3 from the init_notebook_mode cell... (need help?)

92 of my subjects have data in CDA, and it looks like they have data spread across GC, GDC, PDC, and IDC. Let's see a summary of what files and subjects there are:

In [3]:

Copied!

summarize_files(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'})
summarize_files(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'})

╔════════════════════════════╗
║ number_of_matching_files   ║
╠════════════════════════════╣
║ 10603                      ║
╚════════════════════════════╝
╔════════════════════════════════════════════════╗
║ number_of_subjects_related_to_matching_files   ║
╠════════════════════════════════════════════════╣
║ 92                                             ║
╚════════════════════════════════════════════════╝
╔═════════╦═══════════════╗
║   files ║   data_source ║
╠═════════╬═══════════════╣
║    7085 ║      GDC only ║
║    2696 ║      PDC only ║
║     772 ║      IDC only ║
║      50 ║       GC only ║
╚═════════╩═══════════════╝
╔════════════════╦══════════════════════════════╗
║   count_result ║                     category ║
╠════════════════╬══════════════════════════════╣
║           2630 ║  Simple Nucleotide Variation ║
║           1348 ║     Peptide Spectral Matches ║
║           1330 ║             Sequencing Reads ║
║           1033 ║        Copy Number Variation ║
║            674 ║       Processed Mass Spectra ║
║            674 ║             Raw Mass Spectra ║
║            448 ║         Structural Variation ║
║            398 ║                  Biospecimen ║
║            386 ║      Transcriptome Profiling ║
║            328 ║ Somatic Structural Variation ║
║            317 ║                 Segmentation ║
║            285 ║              DNA Methylation ║
║            258 ║             Slide Microscopy ║
║            203 ║                         <NA> ║
║            106 ║           Magnetic Resonance ║
║             80 ║                   Annotation ║
║             66 ║           Proteome Profiling ║
║             16 ║                          WXS ║
║             11 ║   Structured Report Document ║
║              9 ║                      RNA-Seq ║
║              2 ║                          WGS ║
║              1 ║                    miRNA-Seq ║
╚════════════════╩══════════════════════════════╝
╔════════════════╦═════════════╗
║   count_result ║      format ║
╠════════════════╬═════════════╣
║           1567 ║         TSV ║
║           1108 ║         VCF ║
║           1103 ║         TBI ║
║            842 ║         TXT ║
║            772 ║       DICOM ║
║            715 ║         BAM ║
║            674 ║        <NA> ║
║            674 ║   mzIdentML ║
║            674 ║        mzML ║
║            615 ║         BAI ║
║            523 ║         MAF ║
║            324 ║       BEDPE ║
║            224 ║         SVS ║
║            190 ║        IDAT ║
║            166 ║         CEL ║
║            164 ║     BCR XML ║
║             83 ║         PDF ║
║             82 ║ BCR SSF XML ║
║             68 ║         TAR ║
║             19 ║  BCR Biotab ║
║              8 ║         GCT ║
║              7 ║ BCR OMF XML ║
║              1 ║         CSV ║
╚════════════════╩═════════════╝
╔════════════════╦════════════╗
║   count_result ║     access ║
╠════════════════╬════════════╣
║           5562 ║       open ║
║           3323 ║ controlled ║
║           1718 ║       <NA> ║
╚════════════════╩════════════╝
╔════════════════╦════════════════════════════════════════════╗
║   count_result ║                                  file_type ║
╠════════════════╬════════════════════════════════════════════╣
║           1348 ║                              Open Standard ║
║           1103 ║                     Somatic Mutation Index ║
║            817 ║                 Annotated Somatic Mutation ║
║            715 ║                              Aligned Reads ║
║            674 ║                                Proprietary ║
║            674 ║                                       Text ║
║            615 ║                        Aligned Reads Index ║
║            518 ║                Raw Simple Somatic Mutation ║
║            392 ║                          Transcript Fusion ║
║            317 ║                       Segmentation Storage ║
║            258 ║    VL Whole Slide Microscopy Image Storage ║
║            256 ║                   Structural Rearrangement ║
║            242 ║                     Gene Level Copy Number ║
║            231 ║                        Copy Number Segment ║
║            224 ║                                Slide Image ║
║            190 ║                         Masked Intensities ║
║            174 ║                     Biospecimen Supplement ║
║            166 ║                            Raw Intensities ║
║            166 ║                  Simple Germline Variation ║
║            163 ║        Allele-specific Copy Number Segment ║
║            163 ║                 Masked Copy Number Segment ║
║            105 ║                           MR Image Storage ║
║             98 ║                        Clinical Supplement ║
║             98 ║             Gene Expression Quantification ║
║             98 ║             Splice Junction Quantification ║
║             95 ║          Isoform Expression Quantification ║
║             95 ║                     Methylation Beta Value ║
║             95 ║            miRNA Expression Quantification ║
║             83 ║                           Pathology Report ║
║             80 ║ Microscopy Bulk Simple Annotations Storage ║
║             77 ║                Aggregated Somatic Mutation ║
║             77 ║                    Masked Somatic Mutation ║
║             68 ║              Intermediate Analysis Archive ║
║             66 ║          Protein Expression Quantification ║
║             50 ║                                       <NA> ║
║             11 ║                   Comprehensive SR Storage ║
║              1 ║            Secondary Capture Image Storage ║
╚════════════════╩════════════════════════════════════════════╝
╔════════════════╦═════════════════╗
║   count_result ║   anatomic_site ║
╠════════════════╬═════════════════╣
║           9831 ║            <NA> ║
║            772 ║          breast ║
╚════════════════╩═════════════════╝
╔════════════════╦═══════════════════╗
║   count_result ║   tumor_vs_normal ║
╠════════════════╬═══════════════════╣
║           9189 ║             tumor ║
║           4325 ║            normal ║
║            404 ║              <NA> ║
╚════════════════╩═══════════════════╝
╔════════════════╦══════════════╗
║                ║ size         ║
╠════════════════╬══════════════╣
║           mean ║ 4253289943   ║
║            min ║ 72           ║
║ lower quartile ║ 65307        ║
║         median ║ 3432095      ║
║ upper quartile ║ 82141684     ║
║            max ║ 480722740993 ║
╚════════════════╩══════════════╝

In [4]:

Copied!

summarize_subjects(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'})
summarize_subjects(match_from_file = {'input_column': 'subject', 'input_file': 'mydatafile.tsv', 'cda_column_to_match':'subject_id'})

╔═══════════════════════════════╗
║ number_of_matching_subjects   ║
╠═══════════════════════════════╣
║ 92                            ║
╚═══════════════════════════════╝
╔════════════════════════════════════════════════╗
║ number_of_files_related_to_matching_subjects   ║
╠════════════════════════════════════════════════╣
║ 10603                                          ║
╚════════════════════════════════════════════════╝
╔════════════╦══════════════════════╗
║   subjects ║          data_source ║
╠════════════╬══════════════════════╣
║         72 ║            GDC + IDC ║
║          9 ║ PDC + GC + GDC + IDC ║
║          5 ║      PDC + GDC + IDC ║
║          5 ║       GC + GDC + IDC ║
║          1 ║            PDC + IDC ║
╚════════════╩══════════════════════╝
╔════════════════╦════════════════════╗
║   count_result ║          ethnicity ║
╠════════════════╬════════════════════╣
║             75 ║       Non-Hispanic ║
║             12 ║               <NA> ║
║              5 ║ Hispanic or Latino ║
╚════════════════╩════════════════════╝
╔════════════════╦══════════════════╗
║   count_result ║   cause_of_death ║
╠════════════════╬══════════════════╣
║             92 ║             <NA> ║
╚════════════════╩══════════════════╝
╔════════════════╦═══════════════════════════╗
║   count_result ║                      race ║
╠════════════════╬═══════════════════════════╣
║             67 ║                     White ║
║             11 ║ Black or African American ║
║              8 ║                      <NA> ║
║              6 ║                     Asian ║
╚════════════════╩═══════════════════════════╝
╔════════════════╦═══════════╗
║   count_result ║   species ║
╠════════════════╬═══════════╣
║             92 ║     human ║
╚════════════════╩═══════════╝
╔════════════════╦═════════════════╗
║                ║ year_of_death   ║
╠════════════════╬═════════════════╣
║           mean ║ 2009            ║
║            min ║ 2008            ║
║ lower quartile ║ 2008            ║
║         median ║ 2008            ║
║ upper quartile ║ 2009            ║
║            max ║ 2009            ║
╚════════════════╩═════════════════╝
╔════════════════╦═════════════════╗
║                ║ year_of_birth   ║
╠════════════════╬═════════════════╣
║           mean ║ 1950            ║
║            min ║ 1938            ║
║ lower quartile ║ 1943            ║
║         median ║ 1950            ║
║ upper quartile ║ 1958            ║
║            max ║ 1963            ║
╚════════════════╩═════════════════╝

In [ ]: