CENTR has been running the Signs-of-Life domain crawler (v1) in stable production now for almost 2 years. We believe this data is important for assessing the general health of domains among ccTLD registries, so we consider it important to ensure that accuracy in the crawler is high. For this reason, we plan to provide reports on its performance and have now completed the first one, which shows that the crawler has been performing very well, with an accuracy rating of 94% for ccTLDs and 96% for gTLDs. This report provides more detail, breaking down elements of the crawler and data science metrics such as precision, recall and the F1 score.
To assess the performance of the crawler, domains are split into two groups - domains with no website / HTTP errors / registrar and all the others. The former are domains classified using rules and URL detection logic, which are explicitly defined calculations and thus not trainable. Their accuracy has already been assessed in earlier trials.
For the remaining domains, classification is done using a Machine Learning (ML) step which requires retraining with more labelled data to improve performance. As at the date of this report, 3023 “ML” domains are labelled from two sources:
- The initial manual labelling used to develop the initial version of the crawler, 2083 domains exclusively from the .COM TLD.
- The domains labelled through the Labtool application. After filtering some erroneous validations (include selections of the “unsure” button in the tool), 940 domains from various TLDs are used for retraining (domains extracted on 04/10/2022).
Using a cross-validation approach (separation of training and prediction domains), the performance on the “ML domains” are presented in the tables below.
Between Parked page and High content, the crawler gets the correct category 90.9% of the time, that is to say 275 errors out of 3023 domains, 2/3 of which are High Content page predicted as parked.
To get the overall performance, we take the weighted average between accuracy on Errors, accuracy on Registrars and accuracy on ML domains. The results are described in the tables below:
Among ccTLDs, there are 29% of domains with errors and 15% registrars. It is respectively 55% and 14% for gTLDs. The accuracy on Errors and Registrars is much higher as they rely on exact methods. As of the date of the report, the overall accuracy of the crawler is 94.3% on ccTLDs and 96.4% on gTLDs.