DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566

Allaa-boutaleb · 2024-07-04T09:37:22Z

Description

This pull request introduces a new DynamicLLMPathExtractor in the dynamic_llm.py file, which improves upon the existing SimpleLLMPathExtractor. The new extractor offers enhanced functionality for knowledge graph construction by detecting entity types (labels) instead of simply labeling them as "__ entity __" and "chunk", as well as allowing for ontology expansion.

Key improvements:

Detects entity types (labels) instead of labeling them generically as "entity" and "chunk".
Accepts an initial ontology as input, specifying desired nodes and relationships.
Encourages ontology expansion through its prompt design, unlike the strict schema adherence of SchemaLLMPathExtractor.

This new extractor provides a middle ground between SimpleLLMPathExtractor and SchemaLLMPathExtractor, offering more flexibility in knowledge graph construction while still providing guidance through an initial ontology.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Notebook of comparison between DynamicLLMPathExtractor, SchemaLLMPathExtractor and SimpleLLMPathExtractor : Dynamic_KG_Extraction.ipynb

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

logan-markewich · 2024-07-04T14:31:45Z

Cool PR!

I'm kind of confused how this differs exactly from the schema graph extractor? I guess the main difference is here we are using prompt output parsing instead of function calling, which I guess will be less reliable.

The schema extractor has a strict=True/False option, to also allow ontology expansion, fyi

Allaa-boutaleb · 2024-07-04T15:55:38Z

Thank you !

Through extensive experimentation with SchemaLLMPathExtractor, I've observed several limitations:

Without user-defined entities and relationships, it defaults to a predefined list.
Providing empty lists ([] instead of None to avoid falling back to default ontology) results in no meaningful extraction, only chunks.
Even with "strict=False" and a provided list, the LLM tends to stick closely to given entities and relationships, rarely expanding beyond them. Powerful LLMs are good at following instructions, and the instruction in SchemaLLMPathExtractor is to follow the given ontology (and the default one in case of no ontology being passed in the input)

AdvancedLLMPathExtractor addresses these limitations by:

Encouraging the LLM to infer its own schema from the text.
Using any provided entities and relations as a baseline, while still promoting the discovery of additional entities and relationships.

While we could have modified SchemaLLMPathExtractor, I chose to create a separate extractor to maintain SchemaLLMPathExtractor's utility in scenarios where sticking to a predefined Schema is crucial. Similarly, while customizing SimpleLLMPathExtractor (by passing the custom prompt an output parsing function) was an option, I believed a standalone extractor would be more user-friendly for those seeking to quickly generate rich, detailed graphs from raw text without a predefined ontology, which is something SimpleLLMPathExtractor fails to do out-of-the-box.

In essence, AdvancedLLMPathExtractor is designed to leverage the LLM's capabilities in inferring and expanding schemas, even encouraging "hallucination" of fitting schemas when appropriate. This approach can lead to bigger, more diverse and comprehensive knowledge graphs, especially useful when the user doesn't have any expert knowledge about the raw text they want to generate a KG about.

Below are examples of testing out the three different extractors on the first paragraph + History section of "OpenAI" Wikipedia Page : https://en.wikipedia.org/wiki/OpenAI

Input :

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco. Its mission is to develop .......... In February 2019, GPT-2 was announced, which gained attention for its ability to generate human-like text.

Parameters used for all three path extractors.

GPT 3.5 Turbo, temperature = 0.
Max_triplets_per_chunk = 20.

Path extractors used :

SimpleLLMPathExtractor with its default parameters.
SchemaLLMPathExtractor with its default parameters (Default list of entities and relations, strict = False)
AdvancedLLMPathExtractor with its default parameters (Empty list of possible entities and relations)

Chunk nodes have been hidden for easier readability.

SimpleLLMPathExtractor :

SchemaLLMPathExtractor :

AdvancedLLMPathExtractor :

I plan to add some Pydantic validation later, just to make sure that the JSON output returned is what we expect. So far, I haven't run into issues where the JSON output was incompatible with the parsing function. My testing was limited only to the GPT models, mixtral and llama 3, with low temperatures.

logan-markewich · 2024-07-04T16:30:17Z

@Allaa-boutaleb alright, thanks for the detailed breakdown! I think I'm sold.

But, I think the name doesn't quite feel right. What if a more advanced extractor comes along? 😅 Would you consider changing it?

Maybe something like

DynamicLLMPathExtractor
ExpansiveLLMPathExtractor
AdaptiveLLMPathExtractor
Something else?

Lastly, can we add an example notebook showing how to use this? Would be super helpful for showing off the feature

(In terms of future work, one thing that people have been asking for as well is some kind of dedicated property extraction per kg node, rather than inheriting all metadata from the source chunk as properties. Could be an expansion of this, could be a separate module. Maybe a problem to tackle next if you are interested 💪🏻)

…fered Schema

review-notebook-app · 2024-07-04T20:47:23Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Allaa-boutaleb · 2024-07-04T20:54:10Z

@Allaa-boutaleb alright, thanks for the detailed breakdown! I think I'm sold.

But, I think the name doesn't quite feel right. What if a more advanced extractor comes along? 😅 Would you consider changing it?

Maybe something like

DynamicLLMPathExtractor

ExpansiveLLMPathExtractor

AdaptiveLLMPathExtractor

Something else?

Lastly, can we add an example notebook showing how to use this? Would be super helpful for showing off the feature

(In terms of future work, one thing that people have been asking for as well is some kind of dedicated property extraction per kg node, rather than inheriting all metadata from the source chunk as properties. Could be an expansion of this, could be a separate module. Maybe a problem to tackle next if you are interested 💪🏻)

Hello again, I agree. "Advanced" doesn't really fit as a name. I've changed it to DynamicLLMPathExtractor, split the prompt into TMPL and PromptTemplate to be more consistent with how other prompts are defined. I've also added a notebook of comparison between the three different Path Extractors, directly highlighting the issues I've stated above.

I also agree about the need of extracting certain information as metadata to the nodes instead of considering everything as entity/relation pairs. I've done some initial experiments before, but it was nothing other than some clever prompt engineering, which isn't always reliable, though I'll definitely look into it for sure 😁

… Output

Allaa-boutaleb · 2024-07-04T23:07:29Z

@logan-markewich Apologies for the frequent commits. This is the final one. Had to delete some temporary files that got created alongside the notebook out of nowhere when trying to visualize the KGs with pyvis. I've also renamed/improved the output parser to make it less rigid and more adaptive with the LLM's outputs. I've cleaned up the notebook as well and made it more compact.

You can test the notebook for yourself, it's in here : Dynamic_KG_Extraction.ipynb

Please let me know if there's anything else that needs to be done in order to become eligible for a merge. Thank you !

Allaa-boutaleb · 2024-07-26T14:40:28Z

Hello, Here's a notebook I made which demonstrates how to use the DynamicLLMPathExtractor : https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/property_graph/Dynamic_KG_Extraction.ipynb The llama-index documentation needs to be updated. The prompts used can be found here : https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/prompts/default_prompts.py under the name ```DEFAULT_DYNAMIC_EXTRACT_PROPS_PROMPT``` and ```DEFAULT_DYNAMIC_EXTRACT_PROMPT```. You could always use a different prompt, but make sure to pass an adequate output parser with it, otherwise it won't work. If you have any further questions, don't hesitate to reach out and I'll help. :) Have a good day.

…

On Fri, Jul 26, 2024 at 4:10 AM Fernanda De León ***@***.***> wrote: Do you have any examples on how to use extract_prompt in DynamicPathExtractor? I've only found an example of extract_prompt usage here in SimpleLLMPathExtractor: https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/ When I try to use the same prompt as shown in the link above with Dynamic I get this result: image.png (view on web) <https://github.com/user-attachments/assets/4c084a6a-8d05-4135-864c-17f7a245bdd1> If I remove it my graph looks perfect. I also saw the knowledge-graphs default_prompts.py from the library and followed a similar pattern but still no success. — Reply to this email directly, view it on GitHub <#14566 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AONLHP6QROPKI4HBSZVQ7M3ZOGVZZAVCNFSM6AAAAABKLFRWO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRHAZTGNZSGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ferdeleong · 2024-07-26T15:48:48Z

issue: #14975

Hi!

So I think I'm parsing it correctly:

template = PromptTemplate("my prompt using as reference default_prompt.py")
formatted_template = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

The issue is that it works correctly with SimpleLLMPathExtractor but when I use a prompt (even a simple one) with DynamicLLMPathExtractor or SchemaLLMPathExtractor it stops working entirely.

Also I have a question regarding SchemaLLMPathExtractor and how entity properties are defined, do you know where is the appropriate place to ask for help?

Thanks for taking the time to answer!!

Allaa-boutaleb · 2024-07-31T17:06:57Z

issue: #14975

Hi!

So I think I'm parsing it correctly:

template = PromptTemplate("my prompt using as reference default_prompt.py") formatted_template = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

The issue is that it works correctly with SimpleLLMPathExtractor but when I use a prompt (even a simple one) with DynamicLLMPathExtractor or SchemaLLMPathExtractor it stops working entirely.

Also I have a question regarding SchemaLLMPathExtractor and how entity properties are defined, do you know where is the appropriate place to ask for help?

Thanks for taking the time to answer!!

Hey! So when you visualize the knowledge graph and you only see the chunk node and no entity nodes, it means that there was probable an output parsing problem. Unfortunately, the way I've coded it here was very simple and has a lot of room for improvement, as we rely on output parsing to handle the output of the LLM instead of function-calling. So if the LLM doesn't stick to the output provided in the prompt and instead generates an output with an unexpected format, the parsing might not even work, even if the LLM did extract triplets, the parsing function failed to retrieve the triplets due to probably a minor typo difference.

Try passing this function as an output parser through the parameter parse_fn when initiating the DynamicLLMPathExtractor (it's the default one currently being used with DynamicLLMPathExtractor but with additional debugging using loguru):

from loguru import logger ## pip install loguru for debug purposes

def default_parse_dynamic_triplets(
    llm_output: str,
) -> List[Tuple[EntityNode, Relation, EntityNode]]:
    """
    Parse the LLM output and convert it into a list of entity-relation-entity triplets.
    This function is flexible and can handle various output formats.

    Args:
        llm_output (str): The output from the LLM, which may be JSON-like or plain text.

    Returns:
        List[Tuple[EntityNode, Relation, EntityNode]]: A list of triplets.
    """
    logger.debug(f"LLM Output: {llm_output}")

    triplets = []
    
    # Regular expression to match the structure of each dictionary in the list
    pattern = r"{'head': '(.*?)', 'head_type': '(.*?)', 'relation': '(.*?)', 'tail': '(.*?)', 'tail_type': '(.*?)'}"

    # Find all matches in the output
    matches = re.findall(pattern, llm_output)

    for match in matches:
        head, head_type, relation, tail, tail_type = match
        head_node = EntityNode(name=head, label=head_type)
        tail_node = EntityNode(name=tail, label=tail_type)
        relation_node = Relation(
            source_id=head_node.id, target_id=tail_node.id, label=relation
        )
        triplets.append((head_node, relation_node, tail_node))

    logger.warning(f"Triplets extracted: {triplets}")  

    return triplets

Through the debug outputs, if you notice that triplets is an empty list whereas LLM output wasn't, that means there was a parsing issue somewhere and you'll need to pass a potentially better parsing function. You can copy the debug output and give it to ChatGPT or Claude, or any other competent LLM out there, and they'll propose a better parse_fn for you!

For either DynamicLLMPathExtractor or SchemaLLMPathExtractor, it's up to you to define what entities / relationships you want to use (PERSON, ORGANIZATION, FRUIT...) whatever you like, there is no obligation to follow a specific format, as long as you pass a list of strings.

Hope this helps!

Allaa-boutaleb · 2024-08-01T14:30:56Z

issue: #14975

Hi!

So I think I'm parsing it correctly:

template = PromptTemplate("my prompt using as reference default_prompt.py") formatted_template = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

The issue is that it works correctly with SimpleLLMPathExtractor but when I use a prompt (even a simple one) with DynamicLLMPathExtractor or SchemaLLMPathExtractor it stops working entirely.

Also I have a question regarding SchemaLLMPathExtractor and how entity properties are defined, do you know where is the appropriate place to ask for help?

Thanks for taking the time to answer!!

Hello again,

I've fixed the issue and the DynamicLLMPathExtractor should function properly now. It was indeed an issue of the parse function being used, which fails at extracting anything if there is inconsistencies in the LLM's json output (single quotations vs double quotations... etc)

You can find more details about that as well as the new functional parse_fn in a notebook right here : Pull Request to fix DynamicLLMPathExtractor

Currently waiting for the pull request to be merged into the main repo. In the mean time, I suggest you copy the new parse functions and pass them directly into your DynamicLLMPathExtractor.

Hope this helps.

BOUTALEB Allaa added 2 commits July 4, 2024 11:20

Added AdvancedLLMPathExtractor for property graphs

032ea3a

Added AdvancedLLMPathExtractor in Property Graph

6f4e972

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 4, 2024

Allaa-boutaleb and others added 4 commits July 4, 2024 19:07

Added DynamicLLMPathExtractor for KG extraction with automatically in…

f882b8e

…fered Schema

Merge branch 'run-llama:main' into main

bb2e859

linting

5857e9e

Fixed DynamicLLMPathExtractor Prompt, Added a comparison Notebook

4bb963f

Merge branch 'run-llama:main' into main

7a2b756

Allaa-boutaleb changed the title ~~AdvancedLLMPathExtractor for Automatic Entity Detection and Labeling in Property Graphs~~ DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly Jul 4, 2024

[DynamicLLMPathExtractor] Improved Output Parser, Cleaned Up Notebook…

a84fb46

… Output

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 4, 2024

Allaa-boutaleb and others added 3 commits July 5, 2024 00:52

Merge branch 'run-llama:main' into main

d1f1929

[DynamicLLMPathExtractor] Improved Output Parser, Cleaned Up Notebook…

79f5657

… Output

Merge branch 'main' of https://github.com/Allaa-boutaleb/llama_index

102fe3c

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Jul 4, 2024

logan-markewich and others added 2 commits July 4, 2024 23:05

link in docs

28def41

Merge branch 'run-llama:main' into main

9761aec

logan-markewich approved these changes Jul 5, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 5, 2024

logan-markewich merged commit 8e9fab5 into run-llama:main Jul 5, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566

DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566

Allaa-boutaleb commented Jul 4, 2024 •

edited

Loading

logan-markewich commented Jul 4, 2024 •

edited

Loading

Allaa-boutaleb commented Jul 4, 2024

logan-markewich commented Jul 4, 2024

review-notebook-app bot commented Jul 4, 2024

Allaa-boutaleb commented Jul 4, 2024

Allaa-boutaleb commented Jul 4, 2024 •

edited

Loading

Allaa-boutaleb commented Jul 26, 2024 via email •

edited

Loading

ferdeleong commented Jul 26, 2024

Allaa-boutaleb commented Jul 31, 2024

Allaa-boutaleb commented Aug 1, 2024

DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566

DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566

Conversation

Allaa-boutaleb commented Jul 4, 2024 • edited Loading

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

logan-markewich commented Jul 4, 2024 • edited Loading

Allaa-boutaleb commented Jul 4, 2024

SimpleLLMPathExtractor :

SchemaLLMPathExtractor :

AdvancedLLMPathExtractor :

logan-markewich commented Jul 4, 2024

review-notebook-app bot commented Jul 4, 2024

Allaa-boutaleb commented Jul 4, 2024

Allaa-boutaleb commented Jul 4, 2024 • edited Loading

Allaa-boutaleb commented Jul 26, 2024 via email • edited Loading

ferdeleong commented Jul 26, 2024

Allaa-boutaleb commented Jul 31, 2024

Allaa-boutaleb commented Aug 1, 2024

Allaa-boutaleb commented Jul 4, 2024 •

edited

Loading

logan-markewich commented Jul 4, 2024 •

edited

Loading

Allaa-boutaleb commented Jul 4, 2024 •

edited

Loading

Allaa-boutaleb commented Jul 26, 2024 via email •

edited

Loading