Find subjects with data at multiple data centers
I work at GDC, and I want to see how many of our subjects have data at the other data centers so we can set up collaborations to better host shared data.
All I really care about to start is what is shared and with who, so I'm only going to filter by the upstream (from CDA that is) source:
In [2]:
Copied!
summarize_subjects( data_source = 'GDC')
summarize_subjects( data_source = 'GDC')
╔═══════════════════════════════╗ ║ number_of_matching_subjects ║ ╠═══════════════════════════════╣ ║ 50213 ║ ╚═══════════════════════════════╝ ╔════════════════════════════════════════════════╗ ║ number_of_files_related_to_matching_subjects ║ ╠════════════════════════════════════════════════╣ ║ 2043397 ║ ╚════════════════════════════════════════════════╝ ╔════════════╦══════════════════════╗ ║ subjects ║ data_source ║ ╠════════════╬══════════════════════╣ ║ 33994 ║ GDC only ║ ║ 9100 ║ GDC + IDC ║ ║ 4728 ║ GC + GDC + IDC ║ ║ 1118 ║ PDC + GC + GDC + IDC ║ ║ 753 ║ PDC + GDC + IDC ║ ║ 429 ║ PDC + GDC ║ ║ 46 ║ GC + GDC ║ ║ 45 ║ PDC + GC + GDC ║ ╚════════════╩══════════════════════╝ ╔════════════════╦════════════════════╗ ║ count_result ║ ethnicity ║ ╠════════════════╬════════════════════╣ ║ 28422 ║ <NA> ║ ║ 19183 ║ Non-Hispanic ║ ║ 2608 ║ Hispanic or Latino ║ ╚════════════════╩════════════════════╝ ╔════════════════╦═══════════════════════════════════════════╗ ║ count_result ║ race ║ ╠════════════════╬═══════════════════════════════════════════╣ ║ 25705 ║ <NA> ║ ║ 20010 ║ White ║ ║ 2768 ║ Black or African American ║ ║ 1534 ║ Asian ║ ║ 102 ║ Native Hawaiian or Other Pacific Islander ║ ║ 94 ║ American Indian or Alaska Native ║ ╚════════════════╩═══════════════════════════════════════════╝ ╔════════════════╦══════════════════════════╗ ║ count_result ║ cause_of_death ║ ╠════════════════╬══════════════════════════╣ ║ 48504 ║ <NA> ║ ║ 1411 ║ Cancer-Related Death ║ ║ 238 ║ Non-Cancer Related Death ║ ║ 24 ║ Infection ║ ║ 15 ║ Cardiovascular Disorder ║ ║ 14 ║ Surgical Complication ║ ║ 7 ║ Toxicity ║ ╚════════════════╩══════════════════════════╝ ╔════════════════╦═══════════╗ ║ count_result ║ species ║ ╠════════════════╬═══════════╣ ║ 50213 ║ human ║ ╚════════════════╩═══════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_death ║ ╠════════════════╬═════════════════╣ ║ mean ║ 2013 ║ ║ min ║ 1992 ║ ║ lower quartile ║ 2008 ║ ║ median ║ 2017 ║ ║ upper quartile ║ 2019 ║ ║ max ║ 2023 ║ ╚════════════════╩═════════════════╝ ╔════════════════╦═════════════════╗ ║ ║ year_of_birth ║ ╠════════════════╬═════════════════╣ ║ mean ║ 1971 ║ ║ min ║ 1908 ║ ║ lower quartile ║ 1949 ║ ║ median ║ 1963 ║ ║ upper quartile ║ 2003 ║ ║ max ║ 2018 ║ ╚════════════════╩═════════════════╝
The subject numbers for data_source are exclusive, so that means that 43030-32207 = 10823 subjects at GDC also have data in at least one other DC.