Skip to content

fetch_rows()

Get CDA data records ('result rows') from table that match user-specified criteria.

fetch_rows(table=None, *, match_all=[], match_any=[], data_source=[], add_columns=[], link_to_table='', provenance=False, count_only=False, return_data_as='dataframe', output_file='', debug=False)

Arguments

table

( string; required ): The table whose rows are to be filtered and retrieved. (Run the tables() function to get a list.)

match_all

( string or list of strings; optional ): One or more conditions, expressed as filter strings (see below), ALL of which must be met by all result rows.

match_any

( string or list of strings; optional ): One or more conditions, expressed as filter strings (see below), AT LEAST ONE of which must be met by all result rows.

data_source

( string or list of strings; optional ): Restrict results to those deriving from the given upstream data source(s). Current valid values are 'GDC', 'IDC', 'PDC', 'CDS' and 'ICDC'. (Default: no filter.)

add_columns

( string or list of strings; optional ): One or more columns from a second table to add to result data from table. If multiple values from an added column are all associated with a single table row, that row will be repeated once for each distinct value, with the added data appended to each row

( string; optional ): A second table from which to fetch entire rows related to the row results from table that this function produces. link_to_table results will be appended to table rows to which they're related: any table row related to more than one link_to_table row will be repeated in the returned data, with one distinct link_to_table row appended to each repeated copy of its related table row. If link_to_table is specified, add_columns cannot be used.

provenance

( boolean; optional ): If True, fetch_rows() will attach cross-reference information to each row result describing the upstream data sources from which it was derived. Rows deriving from more than one upstream source will be repeated in the output, once per data source, as with link_to_table and add_columns (except with provenance metadata attached, instead of information from other CDA tables). If provenance is set to True, link_to_table and add_columns cannot be used.

return_data_as

( string; optional: 'dataframe' or 'tsv' ): Specify how fetch_rows() should return results: as a pandas DataFrame, or as output written to a TSV file named by the user. If this argument is omitted, fetch_rows() will default to returning results as a DataFrame.

output_file

( string; optional ): If return_data_as='tsv' is specified, output_file should contain a resolvable path to a file into which fetch_rows() will write tab-delimited results.

count_only

( boolean; optional ): If set to True, fetch_rows() will return two integers: the number of CDA table rows matching the specified filters, and the total number of rows that this function would return if count_only were not True. (These numbers will be identical if no data from outside table has been joined to result rows. If count_only is set to False (the default), fetch_rows() will return a pandas DataFrame containing all CDA table rows that match the given filters.

debug

( boolean; optional ): If set to True, internal process details will be printed to the standard error stream as fetch_rows() is running. If False (the default), ...they won't.

Filter strings

Filter strings are expressions of the form "COLUMN_NAME OP VALUE" (note in particular that the whitespace surrounding OP is required), where

  • COLUMN_NAME is a searchable CDA column (see the columns() function for details)

  • OP is one of: < <= > >= = !=

  • VALUE is a particular value of whatever data type is stored in COLUMN_NAME (see the columns() function for details), or the special keyword NULL, indicating the filter should match missing (null) values in COLUMN_NAME.

Operators = and != will work on numeric, boolean and string VALUEs.

Operators < <= > >= will only work on numeric VALUEs.

Users can require partial matches to string VALUEs by adding * to either or both ends. For example:

primary_disease_type = *duct*

sex = F*

String VALUEs need not be quoted inside of filter strings. For example, to include the filters specified just above in the match_all argument, when querying the subject table, we can write:

fetch_rows( table='subject', match_all=[ 'primary_disease_type = *duct*', 'sex = F*' ] )

NULL is a special VALUE which can be used to match missing data. For example, to get researchsubject rows where the primary_diagnosis_site field is missing data, we can write:

fetch_rows( table='researchsubject', match_all=[ 'primary_diagnosis_site = NULL' ] )

Returns

(Default) A pandas.DataFrame containing CDA table rows matching the user-specified filter criteria. The DataFrame's named columns will match columns in table, and each row in the DataFrame will contain one CDA table row (possibly with related data from a second table appended to it, according to user directives).

OR

two integers representing the total number of CDA table rows matching the given filters and the total number of result rows. These two counts will generally differ if extra data from non-table sources is joined to result rows using link_to_table or add_columns, because table rows will be repeated for any one-to-many associations that are returned; otherwise they will be the same.

OR

returns nothing, but writes results to a user-specified TSV file