VCF and image files available for the same patients¶
I'm a cancer researcher, and I have a hypothesis that I can correlate a specific mutation with a specific kidney cancer morphology. I'm looking for subjects that have kidney cancer with both CT and sequencing data that I might be able to incorporate into my research.
First, decide what column to search. I'm looking for columns that have to do with file type:
columns(table="file")
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
column_values("file_type")
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
CT image storage and annotated somatic mutation files should give me the data I want. Now to find a column to seperate out the kidney cancer subjects:
columns()
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
column_values("anatomic_site", filters= "*kidney*" )
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
Looks like there are lots of kidney patients, however kidney has been specified a few different ways, so I'll have to search with a wildcard. This makes my three search parameters:
'anatomic_site = *kidney'(the*means match anything that ends with kidney no matter what words or letters are after it.)'file_type = CT Image Storage''file_type = Annotated Somatic Mutation'
But if I run a search with file_type in it twice, it will only give me back files that are both a CT Image Storage AND an Annotated Somatic Mutation. No one file can be both of those things, so what I will really want is to do two searches, one for kidney and CT:
get_subject_data(match_all=['anatomic_site = *kidney', 'file_type = CT Image Storage'])
and one for kidney and mutation:
get_subject_data(match_all=['anatomic_site = *kidney', 'file_type = Annotated Somatic Mutation'])
and then combine the results.
I want subject data, so I'll search by subject, and I'll add the file data on so I know where to get the files. Again * means "anything" so doing add_columns = "file.*" will add all the file columns.
So all together we have
Get the results for each search and save them in variables:
CT = get_subject_data(match_all=['anatomic_site = *kidney', 'file_type = CT Image Storage'], add_columns="file.*")
mutation = get_subject_data(match_all=['anatomic_site = *kidney', 'file_type = Annotated Somatic Mutation'], add_columns="file.*")
get the intersection of the results:
CTmutation = intersect_subject_results(CT, mutation)
CTmutation
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
looking at the file_type column and file_id column, I can't actually tell which file has which type. To make it easier see which bits of data go together I'm going to re-run my search, but add collate_results= True. That will take all of the columns with multiple values and pair them up, so I can tell which files have which file types:
CT = get_subject_data(match_all=['anatomic_site = *kidney', 'file_type = CT Image Storage'], add_columns="file.*", collate_results= True)
mutation = get_subject_data(match_all=['anatomic_site = *kidney', 'file_type = Annotated Somatic Mutation'], add_columns="file.*", collate_results= True)
CTmutation = intersect_subject_results(CT, mutation)
CTmutation
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
Now I have a new column called file_data, where each row has the data_source, file_id, and access collated by file. This is great if i'm putting it into another command, but its kind of hard to read as a human, so I'm going to expand_subject_results, which will take all the info in this subject search, and expand it out into a new dataframe:
expand_subject_results(CTmutation, 'file_data')
Loading ITables v2.7.3 from the init_notebook_mode cell...
(need help?)
|
Note that my search originally had 62 subjects, one per row, with all their information for each column squished into that row. In the expanded version, I still have 62 subjects, but now subjects can have lots of rows, and each column only has the values that go together. Scroll through the table, and see that for the Computed Tomography results, there are still lots of drs_uris in a single row. Thats because CT scans themselves are usually composed of lots of layered images, so a given row is showing you the image set that makes up the CT, and the other variables that go with it. You'll have to scroll way to the right, but you can also see that CTs don't usually have tumor_normal values.
If you go to the second page of results, there are some Annotated Somatic Mutation files. These only have one drs_uri each, because mutation files are usually not split up, and if you scroll way to the right, you'll see that they do have tumor_vs_normal values. Thats because the files themselves are information about mutations, and to know if a basepair is a mutation, you need to compare it to something. In all of these MAFs and VCFs, the mutation is one found in one persons tumor when compared to their normal tissue, so all of these are built with both normal and tumor data.