Skip to content

column_values()

Show all distinct values present in column, along with a count of occurrences for each value.

column_values
(column='', *, return_data_as='dataframe', output_file='', sort_by='', filters='', data_source='', force=False, debug=False)

Arguments

column

( string; required ): The column to fetch values from.

return_data_as

( string; optional:'dataframe'(default) or 'list' or 'tsv' ): Specify how column_values() should return results: as a pandas DataFrame, a Python list, or as output written to a TSV file named by the user.

output_file

( string; optional ): If return_data_as='tsv' is specified, output_file should contain a resolvable path to a file into which column_values will write tab-delimited results.

sort_by

( string; optional:'count'( default for return_data_as='dataframe' and return_data_as='tsv' ) or 'value'( default for return_data_as='list' ) or 'count:desc' or 'value:desc' or 'count:asc' or 'value:asc' ):

Specify the primary column to sort when preparing result data: on values, or on counts of values.

A column name with a suffix of ':desc' appended to it will be sorted in reverse order; adding ':asc' will ensure ascending sort order. Example: sort_by='value:desc'

Secondary sort order is automatic: if the results are to be primarily sorted by count, then the automatic behavior will be to also (alphabetically) sort by value within each group of values that all share the same count. If results are primarily sorted by value, then there is no secondary sort -- each value is unique by design, so results don't contain groups with the same value but different counts, so there's nothing to arrange once the primary sort has been applied.

filters

( string or list of strings; optional ):

Restrict returned values to those matching any of the given strings. A wildcard (asterisk) at either end (or both ends) of each string will allow partial matches. Case will be ignored. Use an empty filter string '' to match and count missing (null) values.

data_source

( string; optional ): Restrict returned values to the given upstream data source, such as 'GDC', 'IDC', 'PDC', or 'CDS'. Defaults to '' (no filter).

force

( boolean; optional ): Force execution of high-overhead queries on columns (like IDs) flagged as having large numbers of values. Defaults to False, in which case attempts to retrieve values for flagged columns will result in a warning.

debug

( boolean; optional ): If set to True, print internal process details to the standard error stream.

Returns

pandas.DataFrame

OR

list

OR

returns nothing, but writes retrieved data to a user-specified TSV file