summary_counts()
For a set of rows in a user-specified table that all match a user-specified set of filters, get a report showing counts of values present in that set of rows, profiled across a small set of pre-selected columns.
summary_counts(table='', *, match_all=[], match_some=[], data_source=[], debug=False)
Arguments¶
table¶
( string; required ): The table whose rows are to be filtered and counted. (Run the tables() function to get a list.)
return_data_as¶
( string; optional: 'dataframe_list' or 'dict' or 'json' ): Specify how summary_counts() should return results: as a list of pandas DataFrames, as a Python dictionary, or as output written to a JSON file named by the user. If this argument is omitted, summary_counts() will default to returning results as a list of DataFrames.
output_file¶
( string; optional ): If return_data_as='json' is specified, output_file should contain a resolvable path to a file into which summary_counts() will write JSON-formatted results.
match_all¶
( string or list of strings; optional ): One or more conditions, expressed as filter strings (see below), ALL of which must be met by all result rows.
match_any¶
( string or list of strings; optional ): One or more conditions, expressed as filter strings (see below), AT LEAST ONE of which must be met by all result rows.
data_source¶
( string or list of strings; optional ): Restrict results to those deriving from the given upstream data source(s), such as 'GDC', 'IDC', 'PDC', or 'CDS'. (Default: no filter.)
debug¶
( boolean; optional ): If set to True, internal process details will be printed to the standard error stream as summary_counts() is running. If False (the default), ...they won't.
Filter strings¶
Filter strings are expressions of the form "COLUMN_NAME OP VALUE" (note in particular that the whitespace surrounding OP is required), where
-
COLUMN_NAME is a searchable CDA column (see the columns() function for details)
-
OP is one of:
< <= > >= = !=
-
VALUE is a particular value of whatever data type is stored in COLUMN_NAME (see the columns() function for details), or the special keyword NULL, indicating the filter should match missing (null) values in COLUMN_NAME.
Operators =
and !=
will work on numeric, boolean and string VALUEs.
Operators < <= > >=
will only work on numeric VALUEs.
Users can require partial matches to string VALUEs by adding *
to either or
both ends. For example:
primary_disease_type = *duct*
sex = F*
String VALUEs need not be quoted inside of filter strings. For example, to include
the filters specified just above in the match_all
argument, we can write:
summary_counts( table='subject', match_all=[ 'primary_disease_type = *duct*', 'sex = F*' ] )
NULL is a special VALUE which can be used to match missing data. For
example, to get a count summary for rows where the sex
field is missing data,
we can write:
summary_counts( table='subject', match_all=[ 'sex = NULL' ] )
Returns¶
list of pandas DataFrames, with one DataFrame for each of a small set of
pre-selected columns, enumerating counts of all of that column's data values
appearing in any of the rows of the user-specified table
that match the
user-specified filter critera (the 'result rows'). One or two DataFrames
in this list -- titled 'total_table
_matches' and sometimes also
'total_related_files', where appropriate -- will contain integers representing
the number of result rows and the number of files related to those rows,
respectively. All other DataFrames in the list will each be titled with
a CDA column name and contain counts for all observed values from that
column in the result row set.
OR Python dictionary enumerating counts of all data values (from a small set of pre-selected columns)
appearing in any of the rows of the user-specified table
that match the user-specified filter criteria
(the 'result rows'). One or two summary keys in this dictionary -- 'total_table
_matches', and
sometimes 'total_related_files', where appropriate -- will point to integers representing
the number of result rows and the number of files related to those rows, respectively. All other keys
in the dictionary will each contain a CDA column name; each corresponding value will itself be a
dictionary enumerating all the specific values appearing in the result rows for the CDA column
named in the key. Each value in that (sub-)dictionary will represent the total number of times
that its corresponding key appears in the result rows.
OR JSON-formatted text representing the same structure as the return_data_as='dict'
option, written to output_file
.
And yes, we know how those first two paragraphs look. We apologize to the entire English language.