Summarize the subject metadata available for bam files¶
I'm a developer, and I have written a new mutation calling pipeline. I've tested it on my own small dataset, but now I'm looking for a larger set of bam files that I can run through it.
First, decide what column to search. I'm looking for columns that are part of the file table:
columns(table="file")
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
file_format is what I'm looking for. Now I want to see whether bam is a valid value in it:
column_values("format")
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
The value in the database is BAM, so I'm going to use that. According to the column values, there's around 200 thousand files, but I'm interested in comparing variants for individuals, so I want a count of how many subjects have bam files:
get_subject_data( match_all=["format = bam"])
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
I can also get a summary of the subject (or any other table) information for these files so I can decide what to filter next:
summarize_subjects( match_all=["format = bam"])
╔═══════════════════════════════╗ ║ number_of_matching_subjects ║ ╠═══════════════════════════════╣ ║ 38111 ║ ╚═══════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_files_related_to_matching_subjects ║ ╠════════════════════════════════════════════════╣ ║ 227615 ║ ╚════════════════════════════════════════════════╝ ╔════════════╦══════════════════════╗ ║ subjects ║ data_source ║ ╠════════════╬══════════════════════╣ ║ 14560 ║ GDC only ║ ║ 8957 ║ GDC + IDC ║ ║ 6889 ║ GC only ║ ║ 4633 ║ GC + GDC + IDC ║ ║ 1117 ║ GC + PDC + GDC + IDC ║ ║ 753 ║ PDC + GDC + IDC ║ ║ 435 ║ ICDC only ║ ║ 429 ║ PDC + GDC ║ ║ 168 ║ GC + IDC ║ ║ 80 ║ ICDC + IDC ║ ║ 45 ║ GC + GDC ║ ║ 45 ║ GC + PDC + GDC ║ ╚════════════╩══════════════════════╝ ╔════════════════╦═══════════╗ ║ count_result ║ species ║ ╠════════════════╬═══════════╣ ║ 33285 ║ human ║ ║ 4311 ║ <NA> ║ ║ 515 ║ dog ║ ╚════════════════╩═══════════╝ ╔════════════════╦══════════════════════════╗ ║ count_result ║ cause_of_death ║ ╠════════════════╬══════════════════════════╣ ║ 36613 ║ <NA> ║ ║ 1240 ║ Cancer-Related Death ║ ║ 198 ║ Non-Cancer Related Death ║ ║ 24 ║ Infection ║ ║ 15 ║ Cardiovascular Disorder ║ ║ 14 ║ Surgical Complication ║ ║ 7 ║ Toxicity ║ ╚════════════════╩══════════════════════════╝ ╔════════════════╦════════════════════╗ ║ count_result ║ ethnicity ║ ╠════════════════╬════════════════════╣ ║ 22928 ║ Non-Hispanic ║ ║ 12097 ║ <NA> ║ ║ 3086 ║ Hispanic or Latino ║ ╚════════════════╩════════════════════╝ ╔════════════════╦═══════════════════════════════════════════╗ ║ count_result ║ race ║ ╠════════════════╬═══════════════════════════════════════════╣ ║ 24085 ║ White ║ ║ 9082 ║ <NA> ║ ║ 3141 ║ Black or African American ║ ║ 1602 ║ Asian ║ ║ 109 ║ American Indian or Alaska Native ║ ║ 91 ║ Native Hawaiian or Other Pacific Islander ║ ║ 1 ║ More than one race ║ ╚════════════════╩═══════════════════════════════════════════╝ ╔════════════════╦══════════╗ ║ count_result ║ format ║ ╠════════════════╬══════════╣ ║ 227615 ║ BAM ║ ╚════════════════╩══════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_death ║ ╠════════════════╬═════════════════╣ ║ mean ║ 2015 ║ ║ min ║ 1992 ║ ║ lower quartile ║ 2013 ║ ║ median ║ 2018 ║ ║ upper quartile ║ 2019 ║ ║ max ║ 2023 ║ ╚════════════════╩═════════════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_birth ║ ╠════════════════╬═════════════════╣ ║ mean ║ 1973 ║ ║ min ║ 1908 ║ ║ lower quartile ║ 1950 ║ ║ median ║ 1965 ║ ║ upper quartile ║ 2003 ║ ║ max ║ 2022 ║ ╚════════════════╩═════════════════╝
For instance, I might look for all of the subjects who have both a tumor sample and a normal control associated with them, because I'm interested in de novo cancer mutations. This means that I want to search the same column twice, once for subjects who have normal tissue and once for subjects that have tumor tissue, and only keep the intersection of those results:
normal = get_subject_data( match_all=["tumor_vs_normal = normal", "format = bam"])
tumor = get_subject_data( match_all=["tumor_vs_normal = tumor", "format = bam"])
both = intersect_subject_results(normal, tumor)
both
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
It looks like there are about 22 thousand bams that contain comparisons between normal and tumor tissue. Now I could save that output to a file to browse through
both.to_csv('bams_tumor_normal.csv')
Or I could keep filtering by other fields. For instance, I might want just one anatomicial location:
normal_lung = get_subject_data( match_all=["tumor_vs_normal = normal", "format = bam", "anatomic_site = lung"])
tumor_lung = get_subject_data( match_all=["tumor_vs_normal = tumor", "format = bam", "anatomic_site = lung"])
both_lung = intersect_subject_results(normal_lung, tumor_lung)
both_lung
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|