Summarize the subject metadata available for bam files¶
I'm a developer, and I have written a new mutation calling pipeline. I've tested it on my own small dataset, but now I'm looking for a larger set of bam files that I can run through it.
First, decide what column to search. I'm looking for columns that are part of the file table:
columns(table="file")
table | column | data_type | nullable | description |
---|---|---|---|---|
Loading ITables v2.2.5 from the init_notebook_mode cell...
(need help?) |
file_format
is what I'm looking for. Now I want to see whether bam
is a valid value in it:
column_values("file_format")
file_format | count |
---|---|
Loading ITables v2.2.5 from the init_notebook_mode cell...
(need help?) |
The value in the database is BAM
, so I'm going to use that. According to the column values, there's around 200 thousand files, but I'm interested in comparing bams for individuals, so I want a count of how many subjects have bam files:
fetch_rows(table="subject", match_all=["file_format = bam"], count_only=True)
{'distinct_subject_rows': 35063, 'total_result_rows': 35063}
I can also get a summary of the subject (or any other table) information for these files so I can decide what to filter next:
summary_counts(table="subject", match_all=["file_format = bam"])
╔═════════════════════════╗ ║ total_subject_matches ║ ╠═════════════════════════╣ ║ 35063 ║ ╚═════════════════════════╝ ╔═══════════════════════╗ ║ total_related_files ║ ╠═══════════════════════╣ ║ 3443178 ║ ╚═══════════════════════╝ ╔═════════╦═══════════════════════╗ ║ count ║ subject_data_source ║ ╠═════════╬═══════════════════════╣ ║ 25270 ║ GDC ║ ║ 13812 ║ IDC ║ ║ 11746 ║ CDS ║ ║ 1996 ║ PDC ║ ║ 486 ║ ICDC ║ ╚═════════╩═══════════════════════╝ ╔═════════╦════════╗ ║ count ║ sex ║ ╠═════════╬════════╣ ║ 17297 ║ female ║ ║ 16760 ║ male ║ ║ 1006 ║ <NA> ║ ╚═════════╩════════╝ ╔═════════╦═══════════════════════════════════════════╗ ║ count ║ race ║ ╠═════════╬═══════════════════════════════════════════╣ ║ 22733 ║ White ║ ║ 7630 ║ <NA> ║ ║ 2971 ║ Black or African American ║ ║ 1521 ║ Asian ║ ║ 103 ║ American Indian or Alaska Native ║ ║ 71 ║ Native Hawaiian or Other Pacific Islander ║ ║ 34 ║ More than one race ║ ╚═════════╩═══════════════════════════════════════════╝ ╔═════════╦════════════════════╗ ║ count ║ ethnicity ║ ╠═════════╬════════════════════╣ ║ 22242 ║ Non-Hispanic ║ ║ 9979 ║ <NA> ║ ║ 2842 ║ Hispanic or Latino ║ ╚═════════╩════════════════════╝ ╔═════════╦══════════════════════════╗ ║ count ║ cause_of_death ║ ╠═════════╬══════════════════════════╣ ║ 33808 ║ <NA> ║ ║ 1013 ║ Cancer-Related Death ║ ║ 191 ║ Non-Cancer Related Death ║ ║ 19 ║ Infection ║ ║ 14 ║ Cardiovascular Disorder ║ ║ 11 ║ Surgical Complication ║ ║ 7 ║ Toxicity ║ ╚═════════╩══════════════════════════╝
For instance, I might look for all of the subjects who have both a tumor sample and a normal control associated with them, because my pipeline requires both:
summary_counts(table="subject", match_all=["file_format = bam", "source_material_type = normal*", "source_material_type = tumor*"])
╔═════════════════════════╗ ║ total_subject_matches ║ ╠═════════════════════════╣ ║ 24981 ║ ╚═════════════════════════╝ ╔═══════════════════════╗ ║ total_related_files ║ ╠═══════════════════════╣ ║ 3296499 ║ ╚═══════════════════════╝ ╔═════════╦═══════════════════════╗ ║ count ║ subject_data_source ║ ╠═════════╬═══════════════════════╣ ║ 21305 ║ GDC ║ ║ 13629 ║ IDC ║ ║ 5981 ║ CDS ║ ║ 1975 ║ PDC ║ ║ 134 ║ ICDC ║ ╚═════════╩═══════════════════════╝ ╔═════════╦════════╗ ║ count ║ sex ║ ╠═════════╬════════╣ ║ 12506 ║ male ║ ║ 12422 ║ female ║ ║ 53 ║ <NA> ║ ╚═════════╩════════╝ ╔═════════╦═══════════════════════════════════════════╗ ║ count ║ race ║ ╠═════════╬═══════════════════════════════════════════╣ ║ 16897 ║ White ║ ║ 4263 ║ <NA> ║ ║ 2336 ║ Black or African American ║ ║ 1314 ║ Asian ║ ║ 79 ║ American Indian or Alaska Native ║ ║ 60 ║ Native Hawaiian or Other Pacific Islander ║ ║ 32 ║ More than one race ║ ╚═════════╩═══════════════════════════════════════════╝ ╔═════════╦════════════════════╗ ║ count ║ ethnicity ║ ╠═════════╬════════════════════╣ ║ 16768 ║ Non-Hispanic ║ ║ 6237 ║ <NA> ║ ║ 1976 ║ Hispanic or Latino ║ ╚═════════╩════════════════════╝ ╔═════════╦══════════════════════════╗ ║ count ║ cause_of_death ║ ╠═════════╬══════════════════════════╣ ║ 23846 ║ <NA> ║ ║ 912 ║ Cancer-Related Death ║ ║ 179 ║ Non-Cancer Related Death ║ ║ 16 ║ Infection ║ ║ 14 ║ Cardiovascular Disorder ║ ║ 11 ║ Surgical Complication ║ ║ 3 ║ Toxicity ║ ╚═════════╩══════════════════════════╝