Summarize the subject metadata available for bam files¶
I'm a developer, and I have written a new mutation calling pipeline. I've tested it on my own small dataset, but now I'm looking for a larger set of bam files that I can run through it.
First, decide what column to search. I'm looking for columns that are part of the file table:
columns(table="file")
table | column | data_type | nullable | description |
---|---|---|---|---|
Loading... (need help?) |
file_format
is what I'm looking for. Now I want to see whether bam
is valid value in it:
column_values("file_format")
file_format | count |
---|---|
Loading... (need help?) |
The value in the database is BAM
, so I'm going to use that. According to the column values, there's around 150 thousand files, but I'm interested in comparing bams for individuals, so I want a count of how many subjects have bam files:
fetch_rows(table="subject", match_all=["file_format = bam"], count_only=True)
{'distinct_subject_rows': 30565, 'total_result_rows': 30565}
I can also get a summary of the subject (or any other table) information for these files so I can decide what to filter next:
summary_counts(table="subject", match_all=["file_format = bam"])
[ total_subject_matches 0 30565, total_related_files 0 3055727, system count 0 GDC 24183 1 IDC 12329 2 CDS 8796 3 PDC 1893, sex count 0 male 11859 1 female 11765 2 Female 3220 3 Male 3100 4 519 5 F 53 6 M 48 7 unspecified 1, race count 0 White 20167 1 6410 2 Black or African American 2512 3 Asian 1339 4 American Indian or Alaska Native 85 5 Native Hawaiian or Other Pacific Islander 52, ethnicity count 0 Not Hispanic or Latino 19691 1 10731 2 White 143, cause_of_death count 0 29712 1 Cancer Related 672 2 Not Cancer Related 142 3 Infection 19 4 Cardiovascular Disorder 12 5 Surgical Complication 8]
For instance, I might look for all of the subjects who have both a tumor sample and a normal control associated with them, because my pipeline requires both:
summary_counts(table="subject", match_all=["file_format = bam", "source_material_type = normal*", "source_material_type = tumor*"])
[ total_subject_matches 0 23302, total_related_files 0 2966452, system count 0 GDC 20336 1 IDC 12261 2 CDS 5380 3 PDC 1857, sex count 0 female 10178 1 male 10052 2 Male 1606 3 Female 1360 4 56 5 F 31 6 M 19, race count 0 White 15645 1 4135 2 Black or African American 2175 3 Asian 1235 4 American Indian or Alaska Native 67 5 Native Hawaiian or Other Pacific Islander 45, ethnicity count 0 Not Hispanic or Latino 15559 1 7640 2 White 103, cause_of_death count 0 22565 1 Cancer Related 571 2 Not Cancer Related 130 3 Infection 16 4 Cardiovascular Disorder 12 5 Surgical Complication 8]