Basics¶
[13]:
from arche import *
Any project requires auth which is done with an api key set in SH_APIKEY
[5]:
a = Arche("235801/1/15")
[6]:
a.report_all()
Job Outcome:
Finished
Job Errors:
No errors
Responses Per Item Ratio:
Number of responses / Number of scraped items - 1.05
Fields Coverage:
0 totally empty field(s)
RULE: Fields Coverage
(1 message(s))
Values Count Percent
Field
description 998 99
_key 1000 100
_type 1000 100
category 1000 100
price 1000 100
title 1000 100
We just ran a minimal number of rules. The validation can be improved with adding a json schema, let’s infer one from the data we already have.
JSON schema¶
[7]:
basic_json_schema("235801/1/15")
{
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"float": {
"pattern": "^-?[0-9]+\\.[0-9]{2}$"
},
"url": {
"pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
}
},
"additionalProperties": false,
"type": "object",
"properties": {
"category": {
"type": "string"
},
"price": {
"type": "string"
},
"_type": {
"type": "string"
},
"description": {
"type": [
"null",
"string"
]
},
"title": {
"type": "string"
},
"_key": {
"type": "string"
}
},
"required": [
"_key",
"_type",
"category",
"description",
"price",
"title"
]
}
By itself a basic schema is pretty useless, but you can update it.
[8]:
a.source_items.df.head()
[8]:
Field | _key | _type | category | description | price | title |
---|---|---|---|---|---|---|
0 | https://app.scrapinghub.com/p/235801/1/15/item/0 | dict | Politics | Libertarianism isn't about winning elections; ... | £51.33 | Libertarianism for Beginners |
1 | https://app.scrapinghub.com/p/235801/1/15/item/1 | dict | Poetry | This book is an important and complete collect... | £20.66 | Shakespeare's Sonnets |
2 | https://app.scrapinghub.com/p/235801/1/15/item/2 | dict | Young Adult | Aaron Ledbetter’s future had been planned out ... | £17.46 | Set Me Free |
3 | https://app.scrapinghub.com/p/235801/1/15/item/3 | dict | Default | Since her assault, Miss Annette Chetwynd has b... | £13.99 | Starving Hearts (Triangular Trade Trilogy, #1) |
4 | https://app.scrapinghub.com/p/235801/1/15/item/4 | dict | Music | This is the never-before-told story of the mus... | £57.25 | Our Band Could Be Your Life: Scenes from the A... |
Looks like price
can be checked with regex. Let’s also add category
tag which helps to see the distribution in categoric data and unique
tag to title to ensure there are no duplicates.
[14]:
a.schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"float": {
"pattern": "^-?[0-9]+\\.[0-9]{2}$"
},
"url": {
"pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
}
},
"additionalProperties": False,
"type": "object",
"properties": {
"category": {"type": "string", "tag": ["category"]},
"price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
"_type": {"type": "string"},
"description": {"type": "string"},
"title": {"type": "string", "tag": ["unique"]},
"_key": {"type": "string"}
},
"required": [
"_key",
"_type",
"category",
"description",
"price",
"title"
]
}
[10]:
a.validate_with_json_schema()
RULE: JSON Schema Validation
(1 message(s))
Or if your job is really big you can use almost 100x faster backend
[11]:
a.glance()
RULE: JSON Schema Validation
(1 message(s))
We already got something! Let’s execute the whole thing again to see how category
tag works.
[15]:
a.report_all()
Job Outcome:
Finished
Job Errors:
No errors
Responses Per Item Ratio:
Number of responses / Number of scraped items - 1.05
Fields Coverage:
0 totally empty field(s)
JSON Schema Validation:
1000 items were checked, 1 error(s)
Tags:
category, unique
Compare Price Was And Now:
product_price_field or product_price_was_field tags were not found in schema
Uniqueness:
'title' contains 1 duplicated value(s)
Duplicated Items:
'name_field' and 'product_url_field' tags were not found in schema
Coverage For Scraped Categories:
50 categories in 'category'
RULE: Fields Coverage
(1 message(s))
Values Count Percent
Field
description 998 99
_key 1000 100
_type 1000 100
category 1000 100
price 1000 100
title 1000 100
RULE: JSON Schema Validation
(1 message(s))
RULE: Uniqueness
(1 message(s))
RULE: Coverage For Scraped Categories
(1 message(s))