Price scraping policy

To show updated product prices in the visitor’s preferred currency and also inform us of a discontinued product, Onahole.eu operates a basic price fetching script. By lack of an unified API among shops CMS, the script relies on a practice known as “content scraping”, which is often shunned on the web ecosystem.

The purpose of this page is to inform shops on this functionality for transparency and safety reasons.

What does it do?

When a visitor reads one of our reviews, their web browser requests to our script an SVG image. The request includes 3 information:

  1. which store’s price is requested
  2. which product’s price is requested
  3. a security key

The purpose for the security key is to ensure our script is not getting abused (wildly requesting prices for products we never wrote a review about) and it’s formula is regularly changed to discourage hotlinking (outlandish referrer headers are being monitored in that aspect).

When the script receives the valid request, it first checks whether the price has been cached for that product on that shop. If it’s the case, and the cache is not expired, the script pulls the locally stored price cache, applies the currency conversion rate if necessary, and outputs the result.

Should the price be expired or not be cached, it establishes an external connection to the shop, and requests the HTML page for the requested product.

How does the script identify itself? (user-agent)

Sadly, some shops CMS are very sensitive to user agents, and so don’t know which HTTP protocol or HTML format to use. Unrecognized user agents may also be blocked, precisely against content scraping by spambots.

Thus, our script spoofs itself as Mozilla Firefox, which best follows W3C norms, and includes informative headers for administrators:

Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 (Onahole Review; PriceScraping/1.8; +https://blog.onahole.eu/price-script)

As much as possible, the script connects through HTTP/2 or HTTP/1.1, and TLS 2.0, TLS 1.0, with a cURL session (GET request, follows redirects, 3 retries, 19 seconds timeout).

What happens with the web page?

Once the script fetched the web page, it processes it’s content looking for the code pattern matching the product price.

Ideally, the price may be available for computer-processing as a meta tag:

<meta property="product:price" content="269.95" />

Should such meta tag not be available, then we pick the relevant semantic. This may be for example:

<span id="price">$ 84</span>
<div class="price-box">41 €</div>
<td>15</td>  ← pretty please, don't do this

The script extracts the matching pattern, and cleans it’s content to keep only a float value. This value is the stored in cache, and returned to the visitor’s web browser.

How long are prices cached?

One thing we absolutely do not want, is to put unnecessary work load on shop’s web servers. Since shops rarely constantly modify prices (except Amazon), our script refreshes prices at most once per day.

Which is significantly more server-friendly than, say, Google Shopping’s bots.

If the price hasn’t expired, the script just uses it’s local cache.

Feel free to contact us for any additional information!