-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566
Conversation
Cool PR! I'm kind of confused how this differs exactly from the schema graph extractor? I guess the main difference is here we are using prompt output parsing instead of function calling, which I guess will be less reliable. The schema extractor has a strict=True/False option, to also allow ontology expansion, fyi |
Thank you ! Through extensive experimentation with SchemaLLMPathExtractor, I've observed several limitations:
AdvancedLLMPathExtractor addresses these limitations by:
While we could have modified SchemaLLMPathExtractor, I chose to create a separate extractor to maintain SchemaLLMPathExtractor's utility in scenarios where sticking to a predefined Schema is crucial. Similarly, while customizing SimpleLLMPathExtractor (by passing the custom prompt an output parsing function) was an option, I believed a standalone extractor would be more user-friendly for those seeking to quickly generate rich, detailed graphs from raw text without a predefined ontology, which is something SimpleLLMPathExtractor fails to do out-of-the-box. In essence, AdvancedLLMPathExtractor is designed to leverage the LLM's capabilities in inferring and expanding schemas, even encouraging "hallucination" of fitting schemas when appropriate. This approach can lead to bigger, more diverse and comprehensive knowledge graphs, especially useful when the user doesn't have any expert knowledge about the raw text they want to generate a KG about. Below are examples of testing out the three different extractors on the first paragraph + History section of "OpenAI" Wikipedia Page : https://en.wikipedia.org/wiki/OpenAI Input :
Parameters used for all three path extractors.
Path extractors used :
Chunk nodes have been hidden for easier readability. SimpleLLMPathExtractor :SchemaLLMPathExtractor :AdvancedLLMPathExtractor :I plan to add some Pydantic validation later, just to make sure that the JSON output returned is what we expect. So far, I haven't run into issues where the JSON output was incompatible with the parsing function. My testing was limited only to the GPT models, mixtral and llama 3, with low temperatures. |
@Allaa-boutaleb alright, thanks for the detailed breakdown! I think I'm sold. But, I think the name doesn't quite feel right. What if a more advanced extractor comes along? 😅 Would you consider changing it? Maybe something like
Lastly, can we add an example notebook showing how to use this? Would be super helpful for showing off the feature (In terms of future work, one thing that people have been asking for as well is some kind of dedicated property extraction per kg node, rather than inheriting all metadata from the source chunk as properties. Could be an expansion of this, could be a separate module. Maybe a problem to tackle next if you are interested 💪🏻) |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hello again, I agree. "Advanced" doesn't really fit as a name. I've changed it to DynamicLLMPathExtractor, split the prompt into TMPL and PromptTemplate to be more consistent with how other prompts are defined. I've also added a notebook of comparison between the three different Path Extractors, directly highlighting the issues I've stated above. I also agree about the need of extracting certain information as metadata to the nodes instead of considering everything as entity/relation pairs. I've done some initial experiments before, but it was nothing other than some clever prompt engineering, which isn't always reliable, though I'll definitely look into it for sure 😁 |
@logan-markewich Apologies for the frequent commits. This is the final one. Had to delete some temporary files that got created alongside the notebook out of nowhere when trying to visualize the KGs with pyvis. I've also renamed/improved the output parser to make it less rigid and more adaptive with the LLM's outputs. I've cleaned up the notebook as well and made it more compact. You can test the notebook for yourself, it's in here : Dynamic_KG_Extraction.ipynb Please let me know if there's anything else that needs to be done in order to become eligible for a merge. Thank you ! |
Hello,
Here's a notebook I made which demonstrates how to use the
DynamicLLMPathExtractor :
https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/property_graph/Dynamic_KG_Extraction.ipynb
The llama-index documentation needs to be updated.
The prompts used can be found here : https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/prompts/default_prompts.py
under the name ```DEFAULT_DYNAMIC_EXTRACT_PROPS_PROMPT``` and ```DEFAULT_DYNAMIC_EXTRACT_PROMPT```. You could always use a different prompt, but make sure to pass an adequate output parser with it, otherwise it won't work. If you have any further questions, don't hesitate to reach out and I'll help. :)
Have a good day.
…On Fri, Jul 26, 2024 at 4:10 AM Fernanda De León ***@***.***> wrote:
Do you have any examples on how to use extract_prompt in
DynamicPathExtractor?
I've only found an example of extract_prompt usage here in
SimpleLLMPathExtractor:
https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/
When I try to use the same prompt as shown in the link above with Dynamic
I get this result:
image.png (view on web)
<https://github.com/user-attachments/assets/4c084a6a-8d05-4135-864c-17f7a245bdd1>
If I remove it my graph looks perfect.
I also saw the knowledge-graphs default_prompts.py from the library and
followed a similar pattern but still no success.
—
Reply to this email directly, view it on GitHub
<#14566 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AONLHP6QROPKI4HBSZVQ7M3ZOGVZZAVCNFSM6AAAAABKLFRWO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRHAZTGNZSGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
issue: #14975 Hi! So I think I'm parsing it correctly: template = PromptTemplate("my prompt using as reference The issue is that it works correctly with Also I have a question regarding Thanks for taking the time to answer!! |
Hey! So when you visualize the knowledge graph and you only see the chunk node and no entity nodes, it means that there was probable an output parsing problem. Unfortunately, the way I've coded it here was very simple and has a lot of room for improvement, as we rely on output parsing to handle the output of the LLM instead of function-calling. So if the LLM doesn't stick to the output provided in the prompt and instead generates an output with an unexpected format, the parsing might not even work, even if the LLM did extract triplets, the parsing function failed to retrieve the triplets due to probably a minor typo difference. Try passing this function as an output parser through the parameter from loguru import logger ## pip install loguru for debug purposes
def default_parse_dynamic_triplets(
llm_output: str,
) -> List[Tuple[EntityNode, Relation, EntityNode]]:
"""
Parse the LLM output and convert it into a list of entity-relation-entity triplets.
This function is flexible and can handle various output formats.
Args:
llm_output (str): The output from the LLM, which may be JSON-like or plain text.
Returns:
List[Tuple[EntityNode, Relation, EntityNode]]: A list of triplets.
"""
logger.debug(f"LLM Output: {llm_output}")
triplets = []
# Regular expression to match the structure of each dictionary in the list
pattern = r"{'head': '(.*?)', 'head_type': '(.*?)', 'relation': '(.*?)', 'tail': '(.*?)', 'tail_type': '(.*?)'}"
# Find all matches in the output
matches = re.findall(pattern, llm_output)
for match in matches:
head, head_type, relation, tail, tail_type = match
head_node = EntityNode(name=head, label=head_type)
tail_node = EntityNode(name=tail, label=tail_type)
relation_node = Relation(
source_id=head_node.id, target_id=tail_node.id, label=relation
)
triplets.append((head_node, relation_node, tail_node))
logger.warning(f"Triplets extracted: {triplets}")
return triplets Through the debug outputs, if you notice that triplets is an empty list whereas LLM output wasn't, that means there was a parsing issue somewhere and you'll need to pass a potentially better parsing function. You can copy the debug output and give it to ChatGPT or Claude, or any other competent LLM out there, and they'll propose a better For either DynamicLLMPathExtractor or SchemaLLMPathExtractor, it's up to you to define what entities / relationships you want to use (PERSON, ORGANIZATION, FRUIT...) whatever you like, there is no obligation to follow a specific format, as long as you pass a list of strings. Hope this helps! |
Hello again, I've fixed the issue and the DynamicLLMPathExtractor should function properly now. It was indeed an issue of the parse function being used, which fails at extracting anything if there is inconsistencies in the LLM's json output (single quotations vs double quotations... etc) You can find more details about that as well as the new functional parse_fn in a notebook right here : Pull Request to fix DynamicLLMPathExtractor Currently waiting for the pull request to be merged into the main repo. In the mean time, I suggest you copy the new parse functions and pass them directly into your DynamicLLMPathExtractor. Hope this helps. |
Description
This pull request introduces a new
DynamicLLMPathExtractor
in thedynamic_llm.py
file, which improves upon the existingSimpleLLMPathExtractor
. The new extractor offers enhanced functionality for knowledge graph construction by detecting entity types (labels) instead of simply labeling them as "__ entity __" and "chunk", as well as allowing for ontology expansion.Key improvements:
SchemaLLMPathExtractor
.This new extractor provides a middle ground between
SimpleLLMPathExtractor
andSchemaLLMPathExtractor
, offering more flexibility in knowledge graph construction while still providing guidance through an initial ontology.New Package?
Did I fill in the
tool.llamahub
section in thepyproject.toml
and provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Notebook of comparison between DynamicLLMPathExtractor, SchemaLLMPathExtractor and SimpleLLMPathExtractor : Dynamic_KG_Extraction.ipynb
Suggested Checklist:
make format; make lint
to appease the lint gods