-
Notifications
You must be signed in to change notification settings - Fork 9
Added a plugin to make the documentation LLM friendly #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🚀 Preview DeploymentYour preview deployment is ready! 🔗 Preview URL: https://preview.harper-docs.stage.harperfabric.com/pr-413 This preview will update automatically when you push new commits. |
package.json
Outdated
| "@docusaurus/theme-search-algolia": "3.9.1", | ||
| "@easyops-cn/docusaurus-search-local": "0.52.1", | ||
| "@mdx-js/react": "3.1.1", | ||
| "@signalwire/docusaurus-plugin-llms-txt": "^1.2.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets pin this version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a process of ensuring this stays up-to-date? Does that process provide some safety when combined with pinning? I'd think that we want to keep this pretty up-to-date, and like always, defaulting to pinning seems like a risky choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same as our other repos that are relying on socket / renovate to update dependencies automatically. I just noticed many of the other dependencies here are pinned so wanted to stick to that pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't noticed any socket or renovate PRs and https://github.com/HarperFast/documentation/commits/main/package.json looks pretty quiet, which makes me question this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - we don't have those tools set up here yet, but we have pinned dependency versions. Let's get the tools enabled asap
docusaurus.config.ts
Outdated
| siteTitle: 'Harper Documentation', | ||
| siteDescription: 'Comprehensive guide to developing on and using the Harper platform', | ||
| depth: 2, | ||
| content: { | ||
| includeBlog: true, | ||
| includePages: true, | ||
| enableLlmsFullTxt: true // Optional: generates llms-full.txt | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include some more comments or add a section to a markdown doc on what this configuration all means? Like what is the depth 2? Why not 3? And what about include Blog? We don't really have a blog here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along those lines, how did the organization vary and improve at 2 vs 3? Claude is recommending 3 for docs (for better organization - in its 'opinion' 😄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, nvm, I see the comment now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's claude's analysis after changing the depth to 3 and setting includeVersionedDocs: false:
⏺ Perfect! The changes are working beautifully. Here's the comparison:
📊 Before vs After Comparison
File Size Reduction
┌────────────────────────────┬─────────┬────────┬──────────────────────────┐
│ Metric │ Before │ After │ Change │
├────────────────────────────┼─────────┼────────┼──────────────────────────┤
│ llms.txt │ 220KB │ 67KB │ ✅ 69% smaller │
├────────────────────────────┼─────────┼────────┼──────────────────────────┤
│ llms-full.txt │ 5.4MB │ 1.2MB │ ✅ 78% smaller │
├────────────────────────────┼─────────┼────────┼──────────────────────────┤
│ Line count (llms.txt) │ 1,083 │ 1,528 │ More organized structure │
├────────────────────────────┼─────────┼────────┼──────────────────────────┤
│ Line count (llms-full.txt) │ 120,078 │ 27,732 │ ✅ 77% reduction │
└────────────────────────────┴─────────┴────────┴──────────────────────────┘
Content Filtering
┌─────────────────────────────┬────────┬───────┬──────────────────────────┐
│ Metric │ Before │ After │ Result │
├─────────────────────────────┼────────┼───────┼──────────────────────────┤
│ Documents processed │ ~730+ │ 317 │ ✅ Only current version │
├─────────────────────────────┼────────┼───────┼──────────────────────────┤
│ Old version links (4.1-4.6) │ 610 │ 0 │ ✅ Completely removed │
├─────────────────────────────┼────────┼───────┼──────────────────────────┤
│ Current version links │ 114 │ 114 │ ✅ Preserved │
├─────────────────────────────┼────────┼───────┼──────────────────────────┤
│ Routes excluded │ 0 │ 616 │ ✅ Old versions filtered │
└─────────────────────────────┴────────┴───────┴──────────────────────────┘
Hierarchy Structure with depth: 3
The depth: 3 setting creates a much better organized structure:
## docs ← Level 2: Main section
### administration ← Level 3: Category
#### harper-studio ← Level 4: Subcategory
- [Create Account](...)
- [Login](...)
- [Manage Databases](...)
#### logging ← Level 4: Subcategory
- [Standard Logging](...)
- [Audit Logging](...)
Hierarchy breakdown:
- 5 Level 2 sections (## - top level: docs, fabric, learn, etc.)
- 21 Level 3 sections (### - categories: administration, developers, deployments)
- 219 Level 4 sections (#### - subcategories: harper-studio, logging, cloning)
Why depth: 3 is Perfect
With your URL structure like /docs/administration/harper-studio/create-account:
- depth: 1 → Would group by /docs only (too broad)
- depth: 2 → Would group by /docs/administration (okay, but loses subcategory structure)
- depth: 3 ✅ → Groups by /docs/administration/harper-studio (ideal - preserves logical structure)
Benefits for LLMs
1. Cleaner context - Only current (4.7) documentation, no confusion from old versions
2. Smaller files - 78% reduction means faster processing
3. Better organization - Deep hierarchy helps LLMs understand document relationships
4. Focused information - 317 current docs vs 730+ mixed versions
Recommendation
These changes look excellent! The output is:
- ✅ Much smaller and faster to process
- ✅ Only includes current documentation
- ✅ Well-organized with proper hierarchy
- ✅ No old version confusion
This is exactly what you want for LLM-friendly documentation. The PR should include these configuration settings in the final version.
|
At a high level, the whole concept of creating documentation in markdown, converting to HTML, and the converting to markdown is... 🤔 . Can we simply make the original markdown publicly available? How much longer do we expect AI to actually prefer markdown to HTML? Or is that advice over a month old and no longer relevant in modern AI? |
🚀 Preview DeploymentYour preview deployment is ready! 🔗 Preview URL: https://preview.harper-docs.stage.harperfabric.com/pr-413 This preview will update automatically when you push new commits. |
I'm assuming markdown is preferred due to the lower token consumption vs html. That'd be my only guess 🤷🏾 |
There is token consumption in GEO? Whose tokens are being consumed? |
|
I don't think it has much to do with token consumption but rather the style of the input. These LLMs are only really good at words; they aren't servers. They can't necessarily parse HTML as easily as we assume. So since Markdown is much more readable that is what the LLMs prefer. |
|
I also agree though; why can't we just expose the markdown source for our pages rather than something else? But I don't really know what this is all about anyways so maybe its just a latest wave of how we can optimize our site for AI robots. |
|
lol so whats with the craze to add markdown files for LLMs if this research says its pointless? Just social media noise? TBH if I didn't see this PR and actively search for it on linkedin I wouldn't have known this was a recent change for doc sites. |
|
It's still just a proposed standard. Could use anthropic as an example to follow: https://platform.claude.com/llms.txt I asked Claude how it would discover Harper docs and it said it would only probe for an llms.txt if explicitly asked. E.g. "here's the llms.txt for harper, help me build something" (could be instructions in a Claude skill for example) or have it configured in an IDE somehow. |
It helps make our docs more discoverable by llms and usable for folks who have a more integrated AI workflow |
These are publicly accessible docs, so LLMs should have already ingested this information right? Giving LLMs directives about where to look, would be for documentation that LLMs didn't previously know about?
You are pretty confident that Flavio is wrong? Or there are other LLM pathways?
A myth like this could gain traction in our industry? Inconceivable :) |
|
I'll revisit this PR today ya'll. Sorry for the dealy! |
875e453 to
90b3159
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
🚀 Preview DeploymentYour preview deployment is ready! 🔗 Preview URL: https://preview.harper-docs.stage.harperfabric.com/pr-413 This preview will update automatically when you push new commits. |
| "lint": "echo 0;", | ||
| "preview:pr": "node scripts/preview-pr.mjs" | ||
| }, | ||
| "dependencies": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go forward with this I don't think these are dependencies. I think they're all dev dependencies.
I'll be transparent and state I'm not confident Flavio is wrong. My assertion comes from diving into Mintlify, Vercel, and Appwrite with their use/implementation of
I can do more research and get back to you. I'd love to know if @Ethan-Arrowood @cb1kenobi know anyone at Vercel that could speak more to this? I can reach out to the head of devrel at Appwrite to if they can share the metrics related to |
|
I don't think I know anyone directly for you to speak with. If you identified someone in particular from Vercel, I could see if I can make an introduction (assuming I knew them). Otherwise, I asked Claude (ironic I know) to help facilitate some additional research. Here is it's response:
've found a good collection of sources on the llm.txt debate. Here's what I found: Against llm.txt effectiveness:
For/Neutral on llm.txt:
Key takeaway: The consensus from research (particularly the SE Ranking 300k domain study) shows no measurable impact on AI citations currently, though some argue it's worth implementing as low-risk future-proofing. Major platforms haven't confirmed support, and log analysis shows LLM crawlers aren't fetching these files. Since there is no overly compelling research pointing to its effectiveness, my additional acceptance criteria are:
As long as this is not going to be a major thing to maintain, I'm fine with including it on the grounds of "low-risk future-proofing". Additionally, if we are to proceed with this, can we update it so it only uses this plugin for production deployments? |
The SE Ranking conclusion suggests (and provides evidence) that there is actually some risk (of decreased accuracy) though, right? I think this PR is actually conflating two different suggestions:
IMHO, #1, just adding a llms.txt at the root, without any new plugins, does seem relatively low-risk and I'm fine with that. On the other hand, the impact of #2 doesn't seem well defined or understood in this PR, and the risks in additional dependencies, build overhead, and supply-chain risks without any clear evidence or understanding of what we are doing seems more ill-advised to me. |
|
That is a great point. I'm +1 to that plan. |
|
I guess the suggested direction is me possibly creating a script to append all |
Based on my reading, I would consider that to be a poor quality llms.txt since it doesn't provide any information that the docs don't already provide. The guidance link that Ethan provided points to https://www.fastht.ml/docs/llms.txt as a good example. I would think this is also potentially close to what we want: https://github.com/HarperFast/application-template/blob/77c9b7845a2e3fc1057f8c53c61cab4621f1b23e/AGENTS.md |
|
Closing and creating a different PR related to the suggestions! Thanks for the feedback ya'll! |
🧹 Preview CleanupThe preview deployment for this PR has been removed. |
No description provided.