arche.rules.category module

arche.rules.category.get_coverage_per_category(df: pandas.core.frame.DataFrame, category_names: List[T]) → arche.rules.result.Result

Get value counts per column, excluding nan.

Parameters
  • df – a source data to assess

  • category_names – list of columns which values counts to see

Returns

Number of categories per field, value counts series for each field.

arche.rules.category.get_difference(source_key: str, target_key: str, source_df: pandas.core.frame.DataFrame, target_df: pandas.core.frame.DataFrame, category_names: List[str]) → arche.rules.result.Result

Find and show differences between categories coverage, including nan values. Coverage means value counts divided on total size.

Parameters
  • source_key – name of data you want to compare

  • target_key – name of data you want to compare source with

  • source_df – a data you want to compare

  • target_df – a data you want to compare with

  • category_names – list of columns which values to compare

Returns

A result instance with messages containing significant difference defined by thresholds, a dataframe showing all normalized value counts in percents, a series containing significant difference.