In Short¶
[1]:
%cd ../../../src
/Users/valery/Documents/_code/arche/src
[2]:
%load_ext autoreload
%autoreload 2
[3]:
from arche import *
[4]:
schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"float": {
"pattern": "^-?[0-9]+\\.[0-9]{2}$"
},
"url": {
"pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
}
},
"additionalProperties": False,
"type": "object",
"properties": {
"category": {"type": "string", "tag": ["category"]},
"price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
"_type": {"type": "string"},
"description": {"type": "string"},
"title": {"type": "string"},
"_key": {"type": "string"}
},
"required": [
"_key",
"_type",
"category",
"description",
"price",
"title"
]
}
[5]:
a = Arche("381798/1/2", schema=schema, target="381798/1/1")
[6]:
a.source_items.df.head()
[6]:
_key | _type | category | description | price | title | |
---|---|---|---|---|---|---|
0 | https://app.scrapinghub.com/p/381798/1/2/item/0 | dict | Young Adult | Patient Twenty-nine.A monster roams the halls ... | £22.65 | The Requiem Red |
1 | https://app.scrapinghub.com/p/381798/1/2/item/1 | dict | History | From a renowned historian comes a groundbreaki... | £54.23 | Sapiens: A Brief History of Humankind |
2 | https://app.scrapinghub.com/p/381798/1/2/item/2 | dict | Mystery | WICKED above her hipbone, GIRL across her hear... | £47.82 | Sharp Objects |
3 | https://app.scrapinghub.com/p/381798/1/2/item/3 | dict | Fiction | Dans une France assez proche de la nôtre, un h... | £50.10 | Soumission |
4 | https://app.scrapinghub.com/p/381798/1/2/item/4 | dict | Historical Fiction | "Erotic and absorbing...Written with starling ... | £53.74 | Tipping the Velvet |
[9]:
a.report_all()
Job Outcome:
Finished
Job Errors:
No errors
Responses Per Item Ratio:
Number of responses / Number of scraped items - 1.05
Total Scraped Items:
Same number of items
Compare Runtime:
Similar or better runtime - 0:00:49.589000 and 0:00:55.089000
Finish Time:
Less than 1 day difference
Fields Coverage:
PASSED
Boolean Fields:
SKIPPED
JSON Schema Validation:
1000 items were checked, 1 error(s)
Tags:
Used - category
Not used - name_field, product_price_field, product_price_was_field, product_url_field, unique
Compare Price Was And Now:
product_price_field or product_price_was_field tags were not found in schema
Uniqueness:
'unique' tag was not found in schema
Duplicated Items:
'name_field' and 'product_url_field' tags were not found in schema
Coverage For Scraped Categories:
50 categories in 'category'
Compare Prices For Same Urls:
product_url_field tag is not set
Compare Names Per Url:
product_url_field tag is not set
Compare Prices For Same Names:
name_field tag is not set
Coverage Difference (1 message(s)):
Fields Coverage (1 message(s)):
JSON Schema Validation (1 message(s)):
Coverage For Scraped Categories (1 message(s)):
Category Coverage Difference (1 message(s)):
[8]:
find_duplicates_by(a.source_items.df, ["title", "price"]).show()
[ ]:
[ ]: