Basics

[13]:
from arche import *

Any project requires auth which is done with an api key set in SH_APIKEY

[5]:
a = Arche("235801/1/15")
[6]:
a.report_all()



Job Outcome:
        Finished

Job Errors:
        No errors

Responses Per Item Ratio:
        Number of responses / Number of scraped items - 1.05

Fields Coverage:
        0 totally empty field(s)




RULE: Fields Coverage
(1 message(s))

             Values Count  Percent
Field
description           998       99
_key                 1000      100
_type                1000      100
category             1000      100
price                1000      100
title                1000      100

We just ran a minimal number of rules. The validation can be improved with adding a json schema, let’s infer one from the data we already have.

JSON schema

[7]:
basic_json_schema("235801/1/15")




{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": false,
    "type": "object",
    "properties": {
        "category": {
            "type": "string"
        },
        "price": {
            "type": "string"
        },
        "_type": {
            "type": "string"
        },
        "description": {
            "type": [
                "null",
                "string"
            ]
        },
        "title": {
            "type": "string"
        },
        "_key": {
            "type": "string"
        }
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}

By itself a basic schema is pretty useless, but you can update it.

[8]:
a.source_items.df.head()
[8]:
Field _key _type category description price title
0 https://app.scrapinghub.com/p/235801/1/15/item/0 dict Politics Libertarianism isn't about winning elections; ... £51.33 Libertarianism for Beginners
1 https://app.scrapinghub.com/p/235801/1/15/item/1 dict Poetry This book is an important and complete collect... £20.66 Shakespeare's Sonnets
2 https://app.scrapinghub.com/p/235801/1/15/item/2 dict Young Adult Aaron Ledbetter’s future had been planned out ... £17.46 Set Me Free
3 https://app.scrapinghub.com/p/235801/1/15/item/3 dict Default Since her assault, Miss Annette Chetwynd has b... £13.99 Starving Hearts (Triangular Trade Trilogy, #1)
4 https://app.scrapinghub.com/p/235801/1/15/item/4 dict Music This is the never-before-told story of the mus... £57.25 Our Band Could Be Your Life: Scenes from the A...

Looks like price can be checked with regex. Let’s also add category tag which helps to see the distribution in categoric data and unique tag to title to ensure there are no duplicates.

[14]:
a.schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": False,
    "type": "object",
    "properties": {
        "category": {"type": "string", "tag": ["category"]},
        "price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
        "_type": {"type": "string"},
        "description": {"type": "string"},
        "title": {"type": "string", "tag": ["unique"]},
        "_key": {"type": "string"}
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}
[10]:
a.validate_with_json_schema()


RULE: JSON Schema Validation
(1 message(s))

2 items affected - description is not of type 'string': 980 162

Or if your job is really big you can use almost 100x faster backend

[11]:
a.glance()


RULE: JSON Schema Validation
(1 message(s))

2 items affected - data.description must be string: 980 162

We already got something! Let’s execute the whole thing again to see how category tag works.

[15]:
a.report_all()


Job Outcome:
        Finished

Job Errors:
        No errors

Responses Per Item Ratio:
        Number of responses / Number of scraped items - 1.05

Fields Coverage:
        0 totally empty field(s)

JSON Schema Validation:
      1000 items were checked, 1 error(s)

Tags:
        category, unique

Compare Price Was And Now:
        product_price_field or product_price_was_field tags were not found in schema

Uniqueness:
      'title' contains 1 duplicated value(s)

Duplicated Items:
        'name_field' and 'product_url_field' tags were not found in schema

Coverage For Scraped Categories:
        50 categories in 'category'




RULE: Fields Coverage
(1 message(s))

             Values Count  Percent
Field
description           998       99
_key                 1000      100
_type                1000      100
category             1000      100
price                1000      100
title                1000      100

RULE: JSON Schema Validation
(1 message(s))

2 items affected - description is not of type 'string': 980 162


RULE: Uniqueness
(1 message(s))

2 items affected - same 'The Star-Touched Queen' title: 220 396


RULE: Coverage For Scraped Categories
(1 message(s))