Skip to content

Conversation

@garyillyes
Copy link

Generalized Record Extraction: Updated the regex in parseRecords to capture any valid Key: Value pair from robots.txt, instead of only matching a predefined list of rule types.

Dynamic Directive Counting: Modified the logic for populating record_counts.by_type to dynamically include and count any rule encountered in the file. Rules not present in RECORD_COUNT_TYPES are normalized (hyphens replaced with underscores) for consistent output keys.

Sitemap Exclusion: Added a condition to exclude sitemap directives from the by_useragent breakdown, as sitemaps are global and not specific to any user-agent.

Schema Consistency: Preserved the initialization of standard record types to 0 to ensure a consistent output schema even when those specific directives are absent.

This update allows the metrics to correctly identify and report on custom robots.txt directives (like "unicorns") while maintaining backward compatibility with the existing reporting structure.


Test websites:

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me and LGTM.

Do have some comments to consider, but I'm happy to merge as is.

Tagging @jroakes @Tiggerito @chr156r33n as people who setup, enhanced and/or used this in recent Web Almanacs, but I don't think any of the changes here would prevent similar analysis (and in fact would only go to open up more analysis in the future!). But let me know if you spot something here I didn't.

garyillyes and others added 2 commits January 29, 2026 16:38
Co-authored-by: Barry Pollard <barrypollard@google.com>
Co-authored-by: Barry Pollard <barrypollard@google.com>
@max-ostapenko
Copy link
Contributor

Let's skip all the rows with value 0, when something doesn't exist.

It's a big object, so multiply 50M we will save analysts a lot of bytes by including only whatever exists.

Refactor record counting to be fully dynamic rather than relying on static dictionaries with additional dynamic counting.
@github-actions
Copy link

https://almanac.httparchive.org/en/2022/

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "size": 76,
    "size_kib": 0.07421875,
    "over_google_limit": false,
    "comment_count": 0,
    "record_counts": {
      "by_type": {
        "user_agent": 1,
        "allow": 1,
        "sitemap": 1
      },
      "by_useragent": {
        "*": {
          "allow": 1
        }
      }
    }
  }
}
https://garyillyes.com

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "size": 3226,
    "size_kib": 3.150390625,
    "over_google_limit": false,
    "comment_count": 7,
    "record_counts": {
      "by_type": {
        "other": 98,
        "user_agent": 2,
        "allow": 1,
        "unicorns": 1
      },
      "by_useragent": {
        "*": {
          "allow": 1
        },
        "garybot": {
          "unicorns": 1,
          "other": 94
        }
      }
    }
  }
}
https://johnmu.com

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "size": 1643661,
    "size_kib": 1605.1376953125,
    "over_google_limit": true,
    "comment_count": 97,
    "record_counts": {
      "by_type": {
        "other": 16686,
        "sitemap": 1,
        "user_agent": 11,
        "disallow": 15,
        "p": 1,
        "s": 1
      },
      "by_useragent": {
        "googlebot": {
          "disallow": 2
        },
        "bingbot": {
          "disallow": 2
        },
        "duckduckbot": {
          "disallow": 2
        },
        "slurp": {
          "disallow": 2
        },
        "semrushbot": {
          "disallow": 1
        },
        "dotbot": {
          "disallow": 1
        },
        "qwantify": {
          "disallow": 1
        },
        "zoominfobot": {
          "disallow": 1
        },
        "voltron": {
          "disallow": 1
        },
        "ahrefsbot": {
          "disallow": 1
        },
        "*": {
          "disallow": 7,
          "other": 16685,
          "p": 1,
          "s": 1
        }
      }
    }
  }
}
https://www.census.gov

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "size": 1131,
    "size_kib": 1.1044921875,
    "over_google_limit": false,
    "comment_count": 0,
    "record_counts": {
      "by_type": {
        "user_agent": 6,
        "disallow": 21,
        "allow": 12,
        "crawl_delay": 3,
        "noindex": 1,
        "sitemap": 1
      },
      "by_useragent": {
        "*": {
          "disallow": 5,
          "allow": 3
        },
        "w3c-checklink": {
          "disallow": 5,
          "allow": 3
        },
        "googlebot": {
          "crawl_delay": 1,
          "disallow": 5,
          "allow": 3
        },
        "yahoo! slurp": {
          "crawl_delay": 1,
          "disallow": 5,
          "allow": 3
        },
        "bingbot": {
          "crawl_delay": 1,
          "disallow": 5,
          "allow": 3
        },
        "usasearch": {
          "disallow": 1,
          "noindex": 1
        }
      }
    }
  }
}

@tunetheweb
Copy link
Member

OK LGTM - thansk for addressing the comments.

Will leave it open for a couple of days in case @jroakes @Tiggerito or @chr156r33n have any comments, but make sure this is merged before the next crawl starts on 10th February.

@Tiggerito
Copy link

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants