-
Notifications
You must be signed in to change notification settings - Fork 33
Update robots_txt.js to extract unknown rules dynamically. #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tunetheweb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me and LGTM.
Do have some comments to consider, but I'm happy to merge as is.
Tagging @jroakes @Tiggerito @chr156r33n as people who setup, enhanced and/or used this in recent Web Almanacs, but I don't think any of the changes here would prevent similar analysis (and in fact would only go to open up more analysis in the future!). But let me know if you spot something here I didn't.
Co-authored-by: Barry Pollard <barrypollard@google.com>
Co-authored-by: Barry Pollard <barrypollard@google.com>
|
Let's skip all the rows with value 0, when something doesn't exist. It's a big object, so multiply 50M we will save analysts a lot of bytes by including only whatever exists. |
Refactor record counting to be fully dynamic rather than relying on static dictionaries with additional dynamic counting.
https://almanac.httparchive.org/en/2022/Changed custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"size": 76,
"size_kib": 0.07421875,
"over_google_limit": false,
"comment_count": 0,
"record_counts": {
"by_type": {
"user_agent": 1,
"allow": 1,
"sitemap": 1
},
"by_useragent": {
"*": {
"allow": 1
}
}
}
}
}https://garyillyes.comChanged custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"size": 3226,
"size_kib": 3.150390625,
"over_google_limit": false,
"comment_count": 7,
"record_counts": {
"by_type": {
"other": 98,
"user_agent": 2,
"allow": 1,
"unicorns": 1
},
"by_useragent": {
"*": {
"allow": 1
},
"garybot": {
"unicorns": 1,
"other": 94
}
}
}
}
}https://johnmu.comChanged custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"size": 1643661,
"size_kib": 1605.1376953125,
"over_google_limit": true,
"comment_count": 97,
"record_counts": {
"by_type": {
"other": 16686,
"sitemap": 1,
"user_agent": 11,
"disallow": 15,
"p": 1,
"s": 1
},
"by_useragent": {
"googlebot": {
"disallow": 2
},
"bingbot": {
"disallow": 2
},
"duckduckbot": {
"disallow": 2
},
"slurp": {
"disallow": 2
},
"semrushbot": {
"disallow": 1
},
"dotbot": {
"disallow": 1
},
"qwantify": {
"disallow": 1
},
"zoominfobot": {
"disallow": 1
},
"voltron": {
"disallow": 1
},
"ahrefsbot": {
"disallow": 1
},
"*": {
"disallow": 7,
"other": 16685,
"p": 1,
"s": 1
}
}
}
}
}https://www.census.govChanged custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"size": 1131,
"size_kib": 1.1044921875,
"over_google_limit": false,
"comment_count": 0,
"record_counts": {
"by_type": {
"user_agent": 6,
"disallow": 21,
"allow": 12,
"crawl_delay": 3,
"noindex": 1,
"sitemap": 1
},
"by_useragent": {
"*": {
"disallow": 5,
"allow": 3
},
"w3c-checklink": {
"disallow": 5,
"allow": 3
},
"googlebot": {
"crawl_delay": 1,
"disallow": 5,
"allow": 3
},
"yahoo! slurp": {
"crawl_delay": 1,
"disallow": 5,
"allow": 3
},
"bingbot": {
"crawl_delay": 1,
"disallow": 5,
"allow": 3
},
"usasearch": {
"disallow": 1,
"noindex": 1
}
}
}
}
} |
|
OK LGTM - thansk for addressing the comments. Will leave it open for a couple of days in case @jroakes @Tiggerito or @chr156r33n have any comments, but make sure this is merged before the next crawl starts on 10th February. |
|
LGTM |
Generalized Record Extraction: Updated the regex in parseRecords to capture any valid
Key: Valuepair from robots.txt, instead of only matching a predefined list of rule types.Dynamic Directive Counting: Modified the logic for populating
record_counts.by_typeto dynamically include and count any rule encountered in the file. Rules not present inRECORD_COUNT_TYPESare normalized (hyphens replaced with underscores) for consistent output keys.Sitemap Exclusion: Added a condition to exclude sitemap directives from the
by_useragentbreakdown, as sitemaps are global and not specific to any user-agent.Schema Consistency: Preserved the initialization of standard record types to 0 to ensure a consistent output schema even when those specific directives are absent.
This update allows the metrics to correctly identify and report on custom robots.txt directives (like "unicorns") while maintaining backward compatibility with the existing reporting structure.
Test websites: