-
Notifications
You must be signed in to change notification settings - Fork 434
feat: add mmr deduplication #954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hijzy
wants to merge
44
commits into
MemTensor:dev-20260126-v2.0.4
Choose a base branch
from
hijzy:mmr
base: dev-20260126-v2.0.4
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+420
−33
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* feat: add timer for split text * feat: add chat_handler log * feat: add chat_handler log * fix: chat in playgound bug: use index in null list * chore: deprecated warning * fix: we don't use query when search in graph-db
## Description <!-- Please include a summary of the changes below; Fill in the issue number that this PR addresses (if applicable); Fill in the related MemOS-Docs repository issue or PR link (if applicable); Mention the person who will review this PR (if you know who it is); Replace (summary), (issue), (docs-issue-or-pr-link), and (reviewer) with the appropriate information. 请在下方填写更改的摘要; 填写此 PR 解决的问题编号(如果适用); 填写相关的 MemOS-Docs 仓库 issue 或 PR 链接(如果适用); 提及将审查此 PR 的人(如果您知道是谁); 替换 (summary)、(issue)、(docs-issue-or-pr-link) 和 (reviewer) 为适当的信息。 --> Summary: (summary) Fix: #(issue) Docs Issue/PR: (docs-issue-or-pr-link) Reviewer: @(reviewer) ## Checklist: - [ ] I have performed a self-review of my own code | 我已自行检查了自己的代码 - [ ] I have commented my code in hard-to-understand areas | 我已在难以理解的地方对代码进行了注释 - [ ] I have added tests that prove my fix is effective or that my feature works | 我已添加测试以证明我的修复有效或功能正常 - [ ] I have created related documentation issue/PR in [MemOS-Docs](https://github.com/MemTensor/MemOS-Docs) (if applicable) | 我已在 [MemOS-Docs](https://github.com/MemTensor/MemOS-Docs) 中创建了相关的文档 issue/PR(如果适用) - [ ] I have linked the issue to this PR (if applicable) | 我已将 issue 链接到此 PR(如果适用) - [ ] I have mentioned the person who will review this PR | 我已提及将审查此 PR 的人
# Conflicts: # src/memos/api/handlers/search_handler.py
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, memory retrieval may recall the same fact or topic multiple times. Typical sources include:
These characteristics create systemic issues under queries with no explicit constraints: a redundant candidate set, reduced information density, and downstream generation that is more easily affected by repeated information.
This optimization focuses on deduplicating similar memories, reducing semantic redundancy within the candidate set and improving diversity and effective information density.
Given a query, the retrieval system returns a set of candidate memories (including relevance scores, embeddings, etc.). The goal is to select the Top-K results while maintaining overall relevance, minimizing semantic duplication within Top-K, and preserving coverage of useful information.
After search results are returned, an internal MMR (Maximal Marginal Relevance) deduplication function is invoked to perform subset selection and re-ranking. It balances relevance and diversity, while preventing the diversity penalty term from overwhelming the relevance term when relevance is in the long tail (close to zero), which could otherwise cause the selected set to drift away from the query intent.
The overall implementation follows a “retrieve more first, then deduplicate” approach, with three key strategies:
Related Issue (Required):
New feature #978
How Has This Been Tested?
Test Script Or Test Steps:
Checklist
I have performed a self-review of my own code | 我已自行检查了自己的代码
I have commented my code in hard-to-understand areas | 我已在难以理解的地方对代码进行了注释
I have added tests that prove my fix is effective or that my feature works | 我已添加测试以证明我的修复有效或功能正常
I have created related documentation issue/PR in MemOS-Docs (if applicable) | 我已在 MemOS-Docs 中创建了相关的文档 issue/PR(如果适用)
I have linked the issue to this PR (if applicable) | 我已将 issue 链接到此 PR(如果适用)
I have mentioned the person who will review this PR | 我已提及将审查此 PR 的人
Reviewer Checklist
closes #xxxx (Replace xxxx with the GitHub issue number)
Made sure Checks passed
Tests have been provided