Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat rdb summary wide table #2035

Merged
merged 15 commits into from
Dec 18, 2024
Merged

Conversation

FOkvj
Copy link
Contributor

@FOkvj FOkvj commented Sep 22, 2024

Description

When the relational database table is a wide table, there are a lot of fields. In addition to many redundant fields, retrieving the full number of fields may also exceed the maximum sequence length accepted by the embedding model when performing summary embedding. As a result, the generated embedding cannot accurately reflect the semantic information of the summary. Therefore, for wide tables, I split the fields and the basic information of the table. If the number of fields in the table is too large, the fields will be divided into multiple chunks during summary, and the length of a chunk does not exceed the maximum sequence length of the embedding model. If the table is not wide, then the summary is the same as the original, and the table name and the table description and fields are in the same chunk. In the retrieval, the table name is retrieved first, then the table name (id) is used as filter, and the query is used for vector retrieval, and then the table name and table description are assembled with the field as the final result.

How Has This Been Tested?

Test summary of wide table and retrieve respectively in dbgpt/rag/assembler/tests/test_db_struct_assembler.py and dbgpt/rag/assembler/tests/test_embedding_assembler.py

Snapshots:

Checklist:

  • My code follows the style guidelines of this project
  • I have already rebased the commits and make the commit message conform to the project standard.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Any dependent changes have been merged and published in downstream modules

@fangyinc
Copy link
Collaborator

@Aries-ckt Please review it.

@intfish123
Copy link

@Aries-ckt 有什么进展吗,啥时候能合进去,我目前也遇到了同样的问题

@FOkvj
Copy link
Contributor Author

FOkvj commented Dec 11, 2024

@Aries-ckt 有什么进展吗,啥时候能合进去,我目前也遇到了同样的问题

先用该分支试试看

@Aries-ckt
Copy link
Collaborator

你好, 这边pr功能没啥问题,主要有个顾虑,可能老用户的表schema 向量数据就找不到了

@FOkvj
Copy link
Contributor Author

FOkvj commented Dec 12, 2024

你好, 这边pr功能没啥问题,主要有个顾虑,可能老用户的表schema 向量数据就找不到了

针对该问题,已对代码作出了修改,可保证用户原有的向量数据正常被检索

@Aries-ckt
Copy link
Collaborator

@FOkvj , hi why do you remove field_vector_connector
image

@FOkvj
Copy link
Contributor Author

FOkvj commented Dec 17, 2024

@Aries-ckt here,I make it created by default, so I think those code is not necessary.
image

@Aries-ckt
Copy link
Collaborator

image
But when get_summary -> _similarity_search -> _retrieve_field it will meet self._field_vector_store_connector = None error.

@FOkvj
Copy link
Contributor Author

FOkvj commented Dec 18, 2024

@Aries-ckt fixed

@Aries-ckt
Copy link
Collaborator

Test Success
image

Copy link
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Aries-ckt
Copy link
Collaborator

@FOkvj Are you in our WeChat group? and what's your WeChat alias?

@FOkvj
Copy link
Contributor Author

FOkvj commented Dec 18, 2024

@Aries-ckt Yep,I go by Wooop! 😊

Copy link
Collaborator

@csunny csunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

@csunny csunny merged commit 9b0161e into eosphoros-ai:main Dec 18, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants