×

Signs-of-life crawler performance report

Blog 26-10-2022

CENTR has been running the Signs-of-Life domain crawler (v1) in stable production now for almost 2 years. We believe this data is important for assessing the general health of domains among ccTLD registries, so we consider it important to ensure that accuracy in the crawler is high. For this reason, we plan to provide reports on its performance and have now completed the first one, which shows that the crawler has been performing very well, with an accuracy rating of 94% for ccTLDs and 96% for gTLDs. This report provides more detail, breaking down elements of the crawler and data science metrics such as precision, recall and the F1 score.

Benchmark dataset

To assess the performance of the crawler, domains are split into two groups - domains with no website / HTTP errors / registrar and all the others. The former are domains classified using rules and URL detection logic, which are explicitly defined calculations and thus not trainable. Their accuracy has already been assessed in earlier trials.

For the remaining domains, classification is done using a Machine Learning (ML) step which requires retraining with more labelled data to improve performance. As at the date of this report, 3023 “ML” domains are labelled from two sources:

  1. The initial manual labelling used to develop the initial version of the crawler, 2083 domains exclusively from the .COM TLD.
  2. The domains labelled through the Labtool application. After filtering some erroneous validations (include selections of the “unsure” button in the tool), 940 domains from various TLDs are used for retraining (domains extracted on 04/10/2022).

ML Performance

Using a cross-validation approach (separation of training and prediction domains), the performance on the “ML domains” are presented in the tables below.

Performance

20221026 crawler accuracy 1

Confusion Matrix

20221026 crawler accuracy 2

Between Parked page and High content, the crawler gets the correct category 90.9% of the time, that is to say 275 errors out of 3023 domains, 2/3 of which are High Content page predicted as parked.

Overall Performance

To get the overall performance, we take the weighted average between accuracy on Errors, accuracy on Registrars and accuracy on ML domains. The results are described in the tables below:

20221026 crawler accuracy 3 

Among ccTLDs, there are 29% of domains with errors and 15% registrars. It is respectively 55% and 14% for gTLDs. The accuracy on Errors and Registrars is much higher as they rely on exact methods. As of the date of the report, the overall accuracy of the crawler is 94.3% on ccTLDs and 96.4% on gTLDs.

For more information contact CENTR’s Data Analyst, This email address is being protected from spambots. You need JavaScript enabled to view it..

Published By Patrick Myles
Patrick Myles is the Data Analyst of CENTR, the European country code top-level domain association. Patrick is responsible for data management, member surveys and the development and technical maintenance of the CENTRstats data platform.