arche.rules.coverage module

arche.rules.coverage.check_fields_coverage(df: pandas.core.frame.DataFrame) → arche.rules.result.Result

Get fields coverage from df. Coverage reflects the percentage of real values (exluding nan) per column.

Parameters

df – a data to count the coverage

Returns

A result with coverage for all columns in provided df. If column contains only nan, treat it as an error.

arche.rules.coverage.compare_scraped_fields(source_df: pandas.core.frame.DataFrame, target_df: pandas.core.frame.DataFrame) → arche.rules.result.Result

Find new or missing columns between source_df and target_df

arche.rules.coverage.get_difference(source_job: scrapinghub.client.jobs.Job, target_job: scrapinghub.client.jobs.Job) → arche.rules.result.Result

Get difference between jobs coverages. The coverage is job fields counts divided on the job size.

Parameters
  • source_job – a base job, the difference is calculated from it

  • target_job – a job to compare

Returns

A Result instance with huge dif and stats with fields counts coverage and dif