Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly #14566

Merged
merged 13 commits into from
Jul 5, 2024

Conversation

Allaa-boutaleb
Copy link
Contributor

@Allaa-boutaleb Allaa-boutaleb commented Jul 4, 2024

Description

This pull request introduces a new DynamicLLMPathExtractor in the dynamic_llm.py file, which improves upon the existing SimpleLLMPathExtractor. The new extractor offers enhanced functionality for knowledge graph construction by detecting entity types (labels) instead of simply labeling them as "__ entity __" and "chunk", as well as allowing for ontology expansion.

Key improvements:

  1. Detects entity types (labels) instead of labeling them generically as "entity" and "chunk".
  2. Accepts an initial ontology as input, specifying desired nodes and relationships.
  3. Encourages ontology expansion through its prompt design, unlike the strict schema adherence of SchemaLLMPathExtractor.

This new extractor provides a middle ground between SimpleLLMPathExtractor and SchemaLLMPathExtractor, offering more flexibility in knowledge graph construction while still providing guidance through an initial ontology.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Notebook of comparison between DynamicLLMPathExtractor, SchemaLLMPathExtractor and SimpleLLMPathExtractor : Dynamic_KG_Extraction.ipynb

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 4, 2024
@logan-markewich
Copy link
Collaborator

logan-markewich commented Jul 4, 2024

Cool PR!

I'm kind of confused how this differs exactly from the schema graph extractor? I guess the main difference is here we are using prompt output parsing instead of function calling, which I guess will be less reliable.

The schema extractor has a strict=True/False option, to also allow ontology expansion, fyi

@Allaa-boutaleb
Copy link
Contributor Author

Thank you !

Through extensive experimentation with SchemaLLMPathExtractor, I've observed several limitations:

  • Without user-defined entities and relationships, it defaults to a predefined list.
  • Providing empty lists ([] instead of None to avoid falling back to default ontology) results in no meaningful extraction, only chunks.
  • Even with "strict=False" and a provided list, the LLM tends to stick closely to given entities and relationships, rarely expanding beyond them. Powerful LLMs are good at following instructions, and the instruction in SchemaLLMPathExtractor is to follow the given ontology (and the default one in case of no ontology being passed in the input)

AdvancedLLMPathExtractor addresses these limitations by:

  • Encouraging the LLM to infer its own schema from the text.
  • Using any provided entities and relations as a baseline, while still promoting the discovery of additional entities and relationships.

While we could have modified SchemaLLMPathExtractor, I chose to create a separate extractor to maintain SchemaLLMPathExtractor's utility in scenarios where sticking to a predefined Schema is crucial. Similarly, while customizing SimpleLLMPathExtractor (by passing the custom prompt an output parsing function) was an option, I believed a standalone extractor would be more user-friendly for those seeking to quickly generate rich, detailed graphs from raw text without a predefined ontology, which is something SimpleLLMPathExtractor fails to do out-of-the-box.

In essence, AdvancedLLMPathExtractor is designed to leverage the LLM's capabilities in inferring and expanding schemas, even encouraging "hallucination" of fitting schemas when appropriate. This approach can lead to bigger, more diverse and comprehensive knowledge graphs, especially useful when the user doesn't have any expert knowledge about the raw text they want to generate a KG about.

Below are examples of testing out the three different extractors on the first paragraph + History section of "OpenAI" Wikipedia Page : https://en.wikipedia.org/wiki/OpenAI

Input :

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco. Its mission is to develop .......... In February 2019, GPT-2 was announced, which gained attention for its ability to generate human-like text.

Parameters used for all three path extractors.

  • GPT 3.5 Turbo, temperature = 0.
  • Max_triplets_per_chunk = 20.

Path extractors used :

  • SimpleLLMPathExtractor with its default parameters.
  • SchemaLLMPathExtractor with its default parameters (Default list of entities and relations, strict = False)
  • AdvancedLLMPathExtractor with its default parameters (Empty list of possible entities and relations)

Chunk nodes have been hidden for easier readability.

SimpleLLMPathExtractor :

simple_llm_KG
simple_llm_entities_relations

SchemaLLMPathExtractor :

schema_llm_graph
schema_llm_entities_relations

AdvancedLLMPathExtractor :

advanced_llm_graph
advanced_llm_entities_relations

I plan to add some Pydantic validation later, just to make sure that the JSON output returned is what we expect. So far, I haven't run into issues where the JSON output was incompatible with the parsing function. My testing was limited only to the GPT models, mixtral and llama 3, with low temperatures.

@logan-markewich
Copy link
Collaborator

@Allaa-boutaleb alright, thanks for the detailed breakdown! I think I'm sold.

But, I think the name doesn't quite feel right. What if a more advanced extractor comes along? 😅 Would you consider changing it?

Maybe something like

  • DynamicLLMPathExtractor
  • ExpansiveLLMPathExtractor
  • AdaptiveLLMPathExtractor
  • Something else?

Lastly, can we add an example notebook showing how to use this? Would be super helpful for showing off the feature

(In terms of future work, one thing that people have been asking for as well is some kind of dedicated property extraction per kg node, rather than inheriting all metadata from the source chunk as properties. Could be an expansion of this, could be a separate module. Maybe a problem to tackle next if you are interested 💪🏻)

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Allaa-boutaleb
Copy link
Contributor Author

@Allaa-boutaleb alright, thanks for the detailed breakdown! I think I'm sold.

But, I think the name doesn't quite feel right. What if a more advanced extractor comes along? 😅 Would you consider changing it?

Maybe something like

  • DynamicLLMPathExtractor
  • ExpansiveLLMPathExtractor
  • AdaptiveLLMPathExtractor
  • Something else?

Lastly, can we add an example notebook showing how to use this? Would be super helpful for showing off the feature

(In terms of future work, one thing that people have been asking for as well is some kind of dedicated property extraction per kg node, rather than inheriting all metadata from the source chunk as properties. Could be an expansion of this, could be a separate module. Maybe a problem to tackle next if you are interested 💪🏻)

Hello again, I agree. "Advanced" doesn't really fit as a name. I've changed it to DynamicLLMPathExtractor, split the prompt into TMPL and PromptTemplate to be more consistent with how other prompts are defined. I've also added a notebook of comparison between the three different Path Extractors, directly highlighting the issues I've stated above.

I also agree about the need of extracting certain information as metadata to the nodes instead of considering everything as entity/relation pairs. I've done some initial experiments before, but it was nothing other than some clever prompt engineering, which isn't always reliable, though I'll definitely look into it for sure 😁

@Allaa-boutaleb Allaa-boutaleb changed the title AdvancedLLMPathExtractor for Automatic Entity Detection and Labeling in Property Graphs DynamicLLMPathExtractor for Entity Detection With a Schema inferred by LLMs on the fly Jul 4, 2024
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 4, 2024
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Jul 4, 2024
@Allaa-boutaleb
Copy link
Contributor Author

Allaa-boutaleb commented Jul 4, 2024

@logan-markewich Apologies for the frequent commits. This is the final one. Had to delete some temporary files that got created alongside the notebook out of nowhere when trying to visualize the KGs with pyvis. I've also renamed/improved the output parser to make it less rigid and more adaptive with the LLM's outputs. I've cleaned up the notebook as well and made it more compact.

You can test the notebook for yourself, it's in here : Dynamic_KG_Extraction.ipynb

Please let me know if there's anything else that needs to be done in order to become eligible for a merge. Thank you !

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 5, 2024
@logan-markewich logan-markewich merged commit 8e9fab5 into run-llama:main Jul 5, 2024
8 checks passed
@Allaa-boutaleb
Copy link
Contributor Author

Allaa-boutaleb commented Jul 26, 2024 via email

@ferdeleong
Copy link

issue: #14975

Hi!

So I think I'm parsing it correctly:

template = PromptTemplate("my prompt using as reference default_prompt.py")
formatted_template = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

The issue is that it works correctly with SimpleLLMPathExtractor but when I use a prompt (even a simple one) with DynamicLLMPathExtractor or SchemaLLMPathExtractor it stops working entirely.

Also I have a question regarding SchemaLLMPathExtractor and how entity properties are defined, do you know where is the appropriate place to ask for help?

Thanks for taking the time to answer!!

@Allaa-boutaleb
Copy link
Contributor Author

issue: #14975

Hi!

So I think I'm parsing it correctly:

template = PromptTemplate("my prompt using as reference default_prompt.py") formatted_template = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

The issue is that it works correctly with SimpleLLMPathExtractor but when I use a prompt (even a simple one) with DynamicLLMPathExtractor or SchemaLLMPathExtractor it stops working entirely.

Also I have a question regarding SchemaLLMPathExtractor and how entity properties are defined, do you know where is the appropriate place to ask for help?

Thanks for taking the time to answer!!

Hey! So when you visualize the knowledge graph and you only see the chunk node and no entity nodes, it means that there was probable an output parsing problem. Unfortunately, the way I've coded it here was very simple and has a lot of room for improvement, as we rely on output parsing to handle the output of the LLM instead of function-calling. So if the LLM doesn't stick to the output provided in the prompt and instead generates an output with an unexpected format, the parsing might not even work, even if the LLM did extract triplets, the parsing function failed to retrieve the triplets due to probably a minor typo difference.

Try passing this function as an output parser through the parameter parse_fn when initiating the DynamicLLMPathExtractor (it's the default one currently being used with DynamicLLMPathExtractor but with additional debugging using loguru):

from loguru import logger ## pip install loguru for debug purposes

def default_parse_dynamic_triplets(
    llm_output: str,
) -> List[Tuple[EntityNode, Relation, EntityNode]]:
    """
    Parse the LLM output and convert it into a list of entity-relation-entity triplets.
    This function is flexible and can handle various output formats.

    Args:
        llm_output (str): The output from the LLM, which may be JSON-like or plain text.

    Returns:
        List[Tuple[EntityNode, Relation, EntityNode]]: A list of triplets.
    """
    logger.debug(f"LLM Output: {llm_output}")

    triplets = []
    
    # Regular expression to match the structure of each dictionary in the list
    pattern = r"{'head': '(.*?)', 'head_type': '(.*?)', 'relation': '(.*?)', 'tail': '(.*?)', 'tail_type': '(.*?)'}"

    # Find all matches in the output
    matches = re.findall(pattern, llm_output)

    for match in matches:
        head, head_type, relation, tail, tail_type = match
        head_node = EntityNode(name=head, label=head_type)
        tail_node = EntityNode(name=tail, label=tail_type)
        relation_node = Relation(
            source_id=head_node.id, target_id=tail_node.id, label=relation
        )
        triplets.append((head_node, relation_node, tail_node))

    logger.warning(f"Triplets extracted: {triplets}")  

    return triplets

Through the debug outputs, if you notice that triplets is an empty list whereas LLM output wasn't, that means there was a parsing issue somewhere and you'll need to pass a potentially better parsing function. You can copy the debug output and give it to ChatGPT or Claude, or any other competent LLM out there, and they'll propose a better parse_fn for you!

For either DynamicLLMPathExtractor or SchemaLLMPathExtractor, it's up to you to define what entities / relationships you want to use (PERSON, ORGANIZATION, FRUIT...) whatever you like, there is no obligation to follow a specific format, as long as you pass a list of strings.

Hope this helps!

@Allaa-boutaleb
Copy link
Contributor Author

issue: #14975

Hi!

So I think I'm parsing it correctly:

template = PromptTemplate("my prompt using as reference default_prompt.py") formatted_template = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

The issue is that it works correctly with SimpleLLMPathExtractor but when I use a prompt (even a simple one) with DynamicLLMPathExtractor or SchemaLLMPathExtractor it stops working entirely.

Also I have a question regarding SchemaLLMPathExtractor and how entity properties are defined, do you know where is the appropriate place to ask for help?

Thanks for taking the time to answer!!

Hello again,

I've fixed the issue and the DynamicLLMPathExtractor should function properly now. It was indeed an issue of the parse function being used, which fails at extracting anything if there is inconsistencies in the LLM's json output (single quotations vs double quotations... etc)

You can find more details about that as well as the new functional parse_fn in a notebook right here : Pull Request to fix DynamicLLMPathExtractor

Currently waiting for the pull request to be merged into the main repo. In the mean time, I suggest you copy the new parse functions and pass them directly into your DynamicLLMPathExtractor.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants