Find subjects with data at multiple data centers
I work at GDC, and I want to see how many of our subjects have data at the other data centers so we can set up collaborations to better host shared data.
All I really care about to start is what is shared and with who, so I'm only going to filter by data source:
In [3]:
Copied!
summary_counts(table='subject', data_source = ["GDC"])
summary_counts(table='subject', data_source = ["GDC"])
╔═════════════════════════╗ ║ total_subject_matches ║ ╠═════════════════════════╣ ║ 45030 ║ ╚═════════════════════════╝ ╔═══════════════════════╗ ║ total_related_files ║ ╠═══════════════════════╣ ║ 3400025 ║ ╚═══════════════════════╝ ╔═════════╦═══════════════════════╗ ║ count ║ subject_data_source ║ ╠═════════╬═══════════════════════╣ ║ 45030 ║ GDC ║ ║ 12570 ║ IDC ║ ║ 2440 ║ CDS ║ ║ 1997 ║ PDC ║ ╚═════════╩═══════════════════════╝ ╔═════════╦════════╗ ║ count ║ sex ║ ╠═════════╬════════╣ ║ 23190 ║ female ║ ║ 21173 ║ male ║ ║ 667 ║ <NA> ║ ╚═════════╩════════╝ ╔═════════╦═══════════════════════════════════════════╗ ║ count ║ race ║ ╠═════════╬═══════════════════════════════════════════╣ ║ 23909 ║ <NA> ║ ║ 17258 ║ White ║ ║ 2412 ║ Black or African American ║ ║ 1336 ║ Asian ║ ║ 67 ║ American Indian or Alaska Native ║ ║ 48 ║ Native Hawaiian or Other Pacific Islander ║ ╚═════════╩═══════════════════════════════════════════╝ ╔═════════╦════════════════════╗ ║ count ║ ethnicity ║ ╠═════════╬════════════════════╣ ║ 26014 ║ <NA> ║ ║ 17029 ║ Non-Hispanic ║ ║ 1987 ║ Hispanic or Latino ║ ╚═════════╩════════════════════╝ ╔═════════╦══════════════════════════╗ ║ count ║ cause_of_death ║ ╠═════════╬══════════════════════════╣ ║ 43696 ║ <NA> ║ ║ 1089 ║ Cancer-Related Death ║ ║ 192 ║ Non-Cancer Related Death ║ ║ 19 ║ Infection ║ ║ 14 ║ Cardiovascular Disorder ║ ║ 13 ║ Surgical Complication ║ ║ 7 ║ Toxicity ║ ╚═════════╩══════════════════════════╝
It looks like at least 12,000 of my subjects have data at the other data centers, time to start collaborating!