[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

HannaHUp · 2025-01-09T20:00:55Z

Bug Description
What happened?
I have defined four feedback functions:

f_similarity
f_qa_relevance
f_context_relevance
f_groundedness_cot
However, when I check the results in Snowflake, only one or two results appear inconsistently. Sometimes, feedback results are missing, and in other cases, they are marked as failed with errors.

I print the leaderboard_df, some feedbacks gave me "NaN".
As you see, Inconsistent NaN Feedback Results from Defined Functions

To Reproduce
Which steps should someone take to run into the same error? A small, reproducible code example is useful here.
This behavior happens inconsistently, and I cannot reliably reproduce it on demand.

Expected behavior
A clear and concise description of what you expected to happen.
All defined feedback functions (f_similarity, f_qa_relevance, f_context_relevance, and f_groundedness_cot) should generate results in Snowflake without any missing or failed feedback.

Relevant Logs/Tracebacks
Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.
No Log

Environment:

OS: [e.g. MacOS, Windows]
Python Version
TruLens version
Versions of other relevant installed libraries

Additional context
Add any other context about the problem here.

`
def initialize_sessions(sf_settings: dict, password: str) -> Tuple[Session, TruSession]:

sf_config = sf_settings.copy()
sf_config.update({
"connection_timeout": 300, # 5 minutes
"client_session_keep_alive": True,
"session_timeout": 600 # 10 minutes
})

snowpark_session = Session.builder.configs(sf_settings).create()
print("Snowpark session created successfully.")

conn = SnowflakeConnector(
snowpark_session=snowpark_session,
password=password,
database_redact_keys=True
)

session = TruSession(connector=conn)
print("TruSession initialized.")
return snowpark_session, session

custom_provider = CustomFeedbackProvider(embedding_model=embedding_model)

f_similarity = Feedback(
    custom_provider.correctness_feedback,
    name="Answer Correctness"
).on(selectors['output']).on(selectors['expected_output'])

f_context_relevance = Feedback(
    provider.context_relevance,
    name="Context Relevance"
).on_input().on(selectors['retrieved_context'])

f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons,
    name="Answer Relevance"
).on_input_output()

f_groundedness_cot = Feedback(
    provider.groundedness_measure_with_cot_reasons,
    name="Groundedness"
).on(selectors['retrieved_context'].collect()).on_output()

tru_app = TruCustomApp(
retrieval_app,
app_name=app_name,
feedbacks= [f_similarity, f_qa_relevance, f_context_relevance, f_groundedness_cot],
)

with tru_app:
response = retrieval_app.ask_question(
query=qa.get("question"),
)
print(f"Response: {response}")
`

The text was updated successfully, but these errors were encountered:

sfc-gh-jreini · 2025-01-09T20:57:28Z

Hi @HannaHUp - for the NaNs, are you seeing any errors in the stdout?

Some of the feedback results currently with NaNs could still be computing. Can you try refreshing to see if more results are available?

HannaHUp · 2025-01-09T21:03:27Z

Hi @sfc-gh-jreini .
No I don't see error in stdout.
I did notice that I should wait a bit. So in my code I wait 5 minutes before getting get_records_and_feedback.
I do see error in snowflake if I have error.

sfc-gh-jreini · 2025-01-09T21:24:30Z

Thanks - can you share the full traceback?

Btw - 5 minutes may not be enough depending on how many records you're evaluating/how many feedbacks. If you want to wait for the app to produce a response until after the feedback is computed, you may want to try out the "with app" feedback mode:

TruCustomApp(app, ....
    feedback_mode="with_app",
)

HannaHUp · 2025-01-09T21:32:13Z

What should be a good waiting time for 4 records and 4 feedbacks? I will try "with app" feedback mode.
Also I don't have the error much though.

Here are example errors:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 900, in run
core_endpoint.Endpoint.track_all_costs_tally(
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
result, cbs = Endpoint.track_all_costs(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
return Endpoint._track_costs(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
result: T = __func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 1673, in groundedness_measure_with_cot_reasons
futures = [
^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 1674, in
executor.submit(evaluate_hypothesis, i, hypothesis)
File "/opt/conda/lib/python3.11/site-packages/trulens/core/utils/threading.py", line 79, in submit
return super().submit(
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/thread.py", line 169, in submit
raise RuntimeError('cannot schedule new futures after '
RuntimeError: cannot schedule new futures after interpreter shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 915, in run
raise RuntimeError(
RuntimeError: Evaluation of Groundedness failed on inputs:
{'source': [['Overview\n'
'The problem of determining the most.

Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 900, in run
core_endpoint.Endpoint.track_all_costs_tally(
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
result, cbs = Endpoint.track_all_costs(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
return Endpoint._track_costs(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
result: T = __func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 617, in relevance_with_cot_reasons
return self.generate_score_and_reasons(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/feedback/llm_provider.py", line 217, in generate_score_and_reasons
response = self.endpoint.run_in_pace(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/endpoint.py", line 344, in run_in_pace
raise RuntimeError(
RuntimeError: Endpoint LiteLLMEndpoint request failed 4 time(s):
cannot schedule new futures after interpreter shutdown
cannot schedule new futures after interpreter shutdown
cannot schedule new futures after interpreter shutdown
cannot schedule new futures after interpreter shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/trulens/core/feedback/feedback.py", line 915, in run
raise RuntimeError(
RuntimeError: Evaluation of Answer Relevance failed on inputs:
{'prompt': 'What is the primary problem addressed by the Project?',
'response.

HannaHUp · 2025-01-09T22:19:14Z

@sfc-gh-jreini
Hi.I added feedback_mode="with_app" and tested it with 10 and 15 minutes of waiting. After 10 minutes, I got 12 feedbacks, but after 15 minutes, I only got 7 feedbacks. It seems to take a lot of time to process 4 records with 4 metrics.

Is there something wrong, or can this be improved?

sfc-gh-jreini · 2025-01-09T22:29:21Z

This seems particularly slow. What feedback provider are you using for feedback computation? I don't see this specified in the code you shared

If you're using Snowflake for feedback computation - I'd suggest trying SnowflakeFeedback as well, which will run serverside. Below is the link to the notebook.

The two key changes to the code to enable this are:

Add "init_server_side" parameter to connection

from trulens.core.session import TruSession

connection_params = {
    "account": "...",
    "user": "...",
    "password": "...",
    "database": "...",
    "schema": "...",
    "warehouse": "...",
    "role": "...",
    "init_server_side": True,  # Set to True to enable server side feedback functions
}

connector = SnowflakeConnector(**connection_params)

Using SnowflakeFeedback in place of Feedback class + the Cortex feedback provider

import numpy as np
from trulens.core import Select
from trulens.core.feedback.feedback import SnowflakeFeedback
from trulens.providers.cortex import Cortex

provider = Cortex(
    snowpark_session,
    model_engine="mistral-large2",
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    SnowflakeFeedback(
        provider.relevance_with_cot_reasons, name="Answer Relevance"
    )
    .on_input()
    .on_output()
)

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    SnowflakeFeedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(Select.RecordCalls.retrieve_context.rets)
    .aggregate(np.mean)
)

f_groundedness = (
    SnowflakeFeedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness",
        use_sent_tokenize=False,
    )
    .on_input()
    .on(Select.RecordCalls.retrieve_context.rets.collect())
)

https://github.com/truera/trulens/blob/6bece5f1b844d93e99d378a2126ef78b07338189/examples/experimental/snowflake_feedbacks.ipynb

HannaHUp · 2025-01-09T22:31:57Z

@sfc-gh-jreini
I'm using LiteLLM from trulens.providers.litellm becuase we are using gemini model.

sfc-gh-jreini · 2025-01-09T22:33:16Z

Can you try a different model and see if it's still slow? I haven't experimented much with Gemini, wonder if that could be the issue

HannaHUp · 2025-01-09T22:42:22Z

Can I use Cortex with Gemini?
Code:

provider = Cortex(
                snowpark_session,
                model_engine="gemini-1.5-flash-002",
            )

I got error: """RuntimeError: Endpoint CortexEndpoint request failed 4 time(s):
400 Client Error: Bad Request for url: https://pg.us-east-1.snowflakecomputing.com/api/v2/cortex/inference:complete"""

sfc-gh-jreini · 2025-01-09T22:43:27Z

No - you can see the models available in Cortex here: https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#availability

HannaHUp · 2025-01-09T22:46:58Z

I will check what other models I can use and update you later.
Thank you!

HannaHUp · 2025-01-10T22:13:09Z

@sfc-gh-jreini
The issue is fixed by downgrading snowflake-sqlalchemy==1.7.1!

HannaHUp added the bug Something isn't working label Jan 9, 2025

HannaHUp assigned sfc-gh-pdharmana Jan 9, 2025

HannaHUp changed the title ~~[BUG] Inconsistent Feedback Results in Snowflake: Missing Feedback from Defined Functions~~ [BUG] Inconsistent NaN Feedback Results from Defined Functions Jan 9, 2025

HannaHUp mentioned this issue Jan 9, 2025

[BUG]SQL Execution Canceled During INSERT Operation in Snowflake #1716

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

HannaHUp commented Jan 9, 2025 •

edited

Loading

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025

sfc-gh-jreini commented Jan 9, 2025 •

edited

Loading

HannaHUp commented Jan 9, 2025 •

edited

Loading

HannaHUp commented Jan 9, 2025

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025 •

edited

Loading

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025

HannaHUp commented Jan 10, 2025

[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717

Comments

HannaHUp commented Jan 9, 2025 • edited Loading

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025

sfc-gh-jreini commented Jan 9, 2025 • edited Loading

HannaHUp commented Jan 9, 2025 • edited Loading

HannaHUp commented Jan 9, 2025

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025 • edited Loading

sfc-gh-jreini commented Jan 9, 2025

HannaHUp commented Jan 9, 2025

HannaHUp commented Jan 10, 2025

HannaHUp commented Jan 9, 2025 •

edited

Loading

sfc-gh-jreini commented Jan 9, 2025 •

edited

Loading

HannaHUp commented Jan 9, 2025 •

edited

Loading

HannaHUp commented Jan 9, 2025 •

edited

Loading