Quickstart

Basic Usage

  • make sure required environment variables are set

  • create a schema with arche.basic_json_schema(job_key)

  • create an Arche instance g = Arche(source=job_key, schema=temp_schema)

  • run and report all tests with g.report_all()

  • run DQR with g.data_quality_report()

job_key can be either a usual job key, e.g. 000001/1/1, or a collection key - 00001/collections/s/reviews

schema argument accepts either a dict or a s3 bucket link to a schema.

Schema Validation

Arche allows to use custom rules, available with tag keyword. The value contains tag names, and could be either a string or a list or strings.

For a complete list of tags, check the code https://github.com/scrapinghub/arche/blob/master/src/arche/schema_tools.py#L18

For example:

 "name": {
   "type": "string",
   "tag": ["unique"]
 },

Environment variables

Next env variables are required:

  • SH_APIKEY - This key should have read permissions for the project you want to get items from.

If you also wish access your schemas from S3, set AWS credentials

  • AWS_ACCESS_KEY_ID

  • AWS_SECRET_ACCESS_KEY