Find subjects with data at multiple data centers
I work at GDC, and I want to see how many of our subjects have data at the other data centers so we can set up collaborations to better host shared data.
All I really care about to start is what is shared and with who, so I'm only going to filter by data source:
In [3]:
Copied!
summary_counts(table='subject', data_source = ["GDC", "PDC"])
summary_counts(table='subject', data_source = ["GDC", "PDC"])
Out[3]:
[ total_subject_matches 0 1894, total_related_files 0 630454, system count 0 GDC 1894 1 PDC 1894 2 IDC 1687 3 CDS 1124, sex count 0 female 1158 1 male 724 2 F 12, race count 0 White 1246 1 332 2 Asian 239 3 Black or African American 72 4 American Indian or Alaska Native 5, ethnicity count 0 1011 1 Not Hispanic or Latino 781 2 White 102, cause_of_death count 0 1606 1 Cancer Related 221 2 Not Cancer Related 40 3 Cardiovascular Disorder 10 4 Infection 10 5 Surgical Complication 7]
In [4]:
Copied!
summary_counts(table='subject', data_source = ["GDC", "IDC"])
summary_counts(table='subject', data_source = ["GDC", "IDC"])
Out[4]:
[ total_subject_matches 0 12500, total_related_files 0 2534647, system count 0 GDC 12500 1 IDC 12500 2 CDS 2369 3 PDC 1687, sex count 0 female 6507 1 male 5892 2 51 3 F 31 4 M 19, race count 0 White 9194 1 1311 2 Black or African American 1014 3 Asian 938 4 American Indian or Alaska Native 30 5 Native Hawaiian or Other Pacific Islander 13, ethnicity count 0 Not Hispanic or Latino 8793 1 3604 2 White 103, cause_of_death count 0 12208 1 Cancer Related 231 2 Not Cancer Related 34 3 Cardiovascular Disorder 11 4 Infection 9 5 Surgical Complication 7]