Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

Open
HannaHUp opened this issue Jan 9, 2025 · 12 comments
Open

[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

HannaHUp opened this issue Jan 9, 2025 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@HannaHUp
Copy link

HannaHUp commented Jan 9, 2025

Bug Description
What happened?
I have defined four feedback functions:

f_similarity
f_qa_relevance
f_context_relevance
f_groundedness_cot
However, when I check the results in Snowflake, only one or two results appear inconsistently. Sometimes, feedback results are missing, and in other cases, they are marked as failed with errors.

I print the leaderboard_df, some feedbacks gave me "NaN".
imageAs you see, Inconsistent NaN Feedback Results from Defined Functions

To Reproduce
Which steps should someone take to run into the same error? A small, reproducible code example is useful here.
This behavior happens inconsistently, and I cannot reliably reproduce it on demand.

Expected behavior
A clear and concise description of what you expected to happen.
All defined feedback functions (f_similarity, f_qa_relevance, f_context_relevance, and f_groundedness_cot) should generate results in Snowflake without any missing or failed feedback.

Relevant Logs/Tracebacks
Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.
No Log

Environment:

  • OS: [e.g. MacOS, Windows]
  • Python Version
  • TruLens version
  • Versions of other relevant installed libraries

Additional context
Add any other context about the problem here.

`
def initialize_sessions(sf_settings: dict, password: str) -> Tuple[Session, TruSession]:

sf_config = sf_settings.copy()
sf_config.update({
"connection_timeout": 300, # 5 minutes
"client_session_keep_alive": True,
"session_timeout": 600 # 10 minutes
})

snowpark_session = Session.builder.configs(sf_settings).create()
print("Snowpark session created successfully.")

conn = SnowflakeConnector(
snowpark_session=snowpark_session,
password=password,
database_redact_keys=True
)

session = TruSession(connector=conn)
print("TruSession initialized.")
return snowpark_session, session

custom_provider = CustomFeedbackProvider(embedding_model=embedding_model)

f_similarity = Feedback(
    custom_provider.correctness_feedback,
    name="Answer Correctness"
).on(selectors['output']).on(selectors['expected_output'])

f_context_relevance = Feedback(
    provider.context_relevance,
    name="Context Relevance"
).on_input().on(selectors['retrieved_context'])

f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons,
    name="Answer Relevance"
).on_input_output()

f_groundedness_cot = Feedback(
    provider.groundedness_measure_with_cot_reasons,
    name="Groundedness"
).on(selectors['retrieved_context'].collect()).on_output()

tru_app = TruCustomApp(
retrieval_app,
app_name=app_name,
feedbacks= [f_similarity, f_qa_relevance, f_context_relevance, f_groundedness_cot],
)

with tru_app:
response = retrieval_app.ask_question(
query=qa.get("question"),
)
print(f"Response: {response}")
`

@HannaHUp HannaHUp added the bug Something isn't working label Jan 9, 2025
@HannaHUp HannaHUp changed the title [BUG] Inconsistent Feedback Results in Snowflake: Missing Feedback from Defined Functions [BUG] Inconsistent NaN Feedback Results from Defined Functions Jan 9, 2025
@sfc-gh-jreini
Copy link
Contributor

Hi @HannaHUp - for the NaNs, are you seeing any errors in the stdout?

Some of the feedback results currently with NaNs could still be computing. Can you try refreshing to see if more results are available?

@HannaHUp
Copy link
Author

HannaHUp commented Jan 9, 2025

Hi @sfc-gh-jreini .
No I don't see error in stdout.
I did notice that I should wait a bit. So in my code I wait 5 minutes before getting get_records_and_feedback.
I do see error in snowflake if I have error.
image

@sfc-gh-jreini
Copy link
Contributor

sfc-gh-jreini commented Jan 9, 2025

Thanks - can you share the full traceback?

Btw - 5 minutes may not be enough depending on how many records you're evaluating/how many feedbacks. If you want to wait for the app to produce a response until after the feedback is computed, you may want to try out the "with app" feedback mode:

TruCustomApp(app, ....
    feedback_mode="with_app",
)

@HannaHUp
Copy link
Author

HannaHUp commented Jan 9, 2025

What should be a good waiting time for 4 records and 4 feedbacks? I will try "with app" feedback mode.
Also I don't have the error much though.

Here are example errors:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 900, in run
core_endpoint.Endpoint.track_all_costs_tally(
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
result, cbs = Endpoint.track_all_costs(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
return Endpoint._track_costs(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
result: T = __func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 1673, in groundedness_measure_with_cot_reasons
futures = [
^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 1674, in
executor.submit(evaluate_hypothesis, i, hypothesis)
File "/opt/conda/lib/python3.11/site-packages/trulens/core/utils/threading.py", line 79, in submit
return super().submit(
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/thread.py", line 169, in submit
raise RuntimeError('cannot schedule new futures after '
RuntimeError: cannot schedule new futures after interpreter shutdown

The above exception was the direct cause of the following exception:


Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 915, in run
raise RuntimeError(
RuntimeError: Evaluation of Groundedness failed on inputs:
{'source': [['Overview\n'
'The problem of determining the most.

Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 900, in run
core_endpoint.Endpoint.track_all_costs_tally(
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
result, cbs = Endpoint.track_all_costs(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
return Endpoint._track_costs(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
result: T = __func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 617, in relevance_with_cot_reasons
return self.generate_score_and_reasons(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 217, in generate_score_and_reasons
response = self.endpoint.run_in_pace(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 344, in run_in_pace
raise RuntimeError(
RuntimeError: Endpoint LiteLLMEndpoint request failed 4 time(s):
cannot schedule new futures after interpreter shutdown
cannot schedule new futures after interpreter shutdown
cannot schedule new futures after interpreter shutdown
cannot schedule new futures after interpreter shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 915, in run
raise RuntimeError(
RuntimeError: Evaluation of Answer Relevance failed on inputs:
{'prompt': 'What is the primary problem addressed by the Project?',
'response.

@HannaHUp
Copy link
Author

HannaHUp commented Jan 9, 2025

@sfc-gh-jreini
Hi.I added feedback_mode="with_app" and tested it with 10 and 15 minutes of waiting. After 10 minutes, I got 12 feedbacks, but after 15 minutes, I only got 7 feedbacks. It seems to take a lot of time to process 4 records with 4 metrics.

Is there something wrong, or can this be improved?

@sfc-gh-jreini
Copy link
Contributor

This seems particularly slow. What feedback provider are you using for feedback computation? I don't see this specified in the code you shared

If you're using Snowflake for feedback computation - I'd suggest trying SnowflakeFeedback as well, which will run serverside. Below is the link to the notebook.

The two key changes to the code to enable this are:

  1. Add "init_server_side" parameter to connection
from trulens.core.session import TruSession

connection_params = {
    "account": "...",
    "user": "...",
    "password": "...",
    "database": "...",
    "schema": "...",
    "warehouse": "...",
    "role": "...",
    "init_server_side": True,  # Set to True to enable server side feedback functions
}

connector = SnowflakeConnector(**connection_params)
  1. Using SnowflakeFeedback in place of Feedback class + the Cortex feedback provider
import numpy as np
from trulens.core import Select
from trulens.core.feedback.feedback import SnowflakeFeedback
from trulens.providers.cortex import Cortex

provider = Cortex(
    snowpark_session,
    model_engine="mistral-large2",
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    SnowflakeFeedback(
        provider.relevance_with_cot_reasons, name="Answer Relevance"
    )
    .on_input()
    .on_output()
)

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    SnowflakeFeedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(Select.RecordCalls.retrieve_context.rets)
    .aggregate(np.mean)
)

f_groundedness = (
    SnowflakeFeedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness",
        use_sent_tokenize=False,
    )
    .on_input()
    .on(Select.RecordCalls.retrieve_context.rets.collect())
)

https://github.com/truera/trulens/blob/6bece5f1b844d93e99d378a2126ef78b07338189/examples/experimental/snowflake_feedbacks.ipynb

@HannaHUp
Copy link
Author

HannaHUp commented Jan 9, 2025

@sfc-gh-jreini
I'm using LiteLLM from trulens.providers.litellm becuase we are using gemini model.

@sfc-gh-jreini
Copy link
Contributor

Can you try a different model and see if it's still slow? I haven't experimented much with Gemini, wonder if that could be the issue

@HannaHUp
Copy link
Author

HannaHUp commented Jan 9, 2025

Can I use Cortex with Gemini?
Code:

provider = Cortex(
                snowpark_session,
                model_engine="gemini-1.5-flash-002",
            )

I got error: """RuntimeError: Endpoint CortexEndpoint request failed 4 time(s):
400 Client Error: Bad Request for url: https://pg.us-east-1.snowflakecomputing.com/api/v2/cortex/inference:complete"""

@sfc-gh-jreini
Copy link
Contributor

No - you can see the models available in Cortex here: https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#availability

@HannaHUp
Copy link
Author

HannaHUp commented Jan 9, 2025

I will check what other models I can use and update you later.
Thank you!

@HannaHUp
Copy link
Author

@sfc-gh-jreini
The issue is fixed by downgrading snowflake-sqlalchemy==1.7.1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants