Summarize the subject metadata available for bam files¶

I'm a developer, and I have written a new mutation calling pipeline. I've tested it on my own small dataset, but now I'm looking for a larger set of bam files that I can run through it.

First, decide what column to search. I'm looking for columns that are part of the file table:

In [3]:

Copied!

columns(table="file")
columns(table="file")

Out[3]:

table	column	data_type	nullable	description
Loading... (need help?)

file_format is what I'm looking for. Now I want to see whether bam is valid value in it:

In [4]:

Copied!

column_values("file_format")
column_values("file_format")

Out[4]:

file_format	count
Loading... (need help?)

The value in the database is BAM, so I'm going to use that. According to the column values, there's around 150 thousand files, but I'm interested in comparing bams for individuals, so I want a count of how many subjects have bam files:

In [5]:

Copied!

fetch_rows(table="subject", match_all=["file_format = bam"], count_only=True)
fetch_rows(table="subject", match_all=["file_format = bam"], count_only=True)

Out[5]:

{'distinct_subject_rows': 30565, 'total_result_rows': 30565}

I can also get a summary of the subject (or any other table) information for these files so I can decide what to filter next:

In [6]:

Copied!

summary_counts(table="subject", match_all=["file_format = bam"])
summary_counts(table="subject", match_all=["file_format = bam"])

Out[6]:

[   total_subject_matches
 0                  30565,
    total_related_files
 0              3055727,
   system  count
 0    GDC  24183
 1    IDC  12329
 2    CDS   8796
 3    PDC   1893,
            sex  count
 0         male  11859
 1       female  11765
 2       Female   3220
 3         Male   3100
 4                 519
 5            F     53
 6            M     48
 7  unspecified      1,
                                         race  count
 0                                      White  20167
 1                                              6410
 2                  Black or African American   2512
 3                                      Asian   1339
 4           American Indian or Alaska Native     85
 5  Native Hawaiian or Other Pacific Islander     52,
                 ethnicity  count
 0  Not Hispanic or Latino  19691
 1                          10731
 2                   White    143,
             cause_of_death  count
 0                           29712
 1           Cancer Related    672
 2       Not Cancer Related    142
 3                Infection     19
 4  Cardiovascular Disorder     12
 5    Surgical Complication      8]

For instance, I might look for all of the subjects who have both a tumor sample and a normal control associated with them, because my pipeline requires both:

In [7]:

Copied!

summary_counts(table="subject", match_all=["file_format = bam", "source_material_type = normal*", "source_material_type = tumor*"])
summary_counts(table="subject", match_all=["file_format = bam", "source_material_type = normal*", "source_material_type = tumor*"])

Out[7]:

[   total_subject_matches
 0                  23302,
    total_related_files
 0              2966452,
   system  count
 0    GDC  20336
 1    IDC  12261
 2    CDS   5380
 3    PDC   1857,
       sex  count
 0  female  10178
 1    male  10052
 2    Male   1606
 3  Female   1360
 4             56
 5       F     31
 6       M     19,
                                         race  count
 0                                      White  15645
 1                                              4135
 2                  Black or African American   2175
 3                                      Asian   1235
 4           American Indian or Alaska Native     67
 5  Native Hawaiian or Other Pacific Islander     45,
                 ethnicity  count
 0  Not Hispanic or Latino  15559
 1                           7640
 2                   White    103,
             cause_of_death  count
 0                           22565
 1           Cancer Related    571
 2       Not Cancer Related    130
 3                Infection     16
 4  Cardiovascular Disorder     12
 5    Surgical Complication      8]