-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Inconsistent NaN Feedback Results from Defined Functions #1717
Comments
Hi @HannaHUp - for the NaNs, are you seeing any errors in the stdout? Some of the feedback results currently with NaNs could still be computing. Can you try refreshing to see if more results are available? |
Hi @sfc-gh-jreini . |
Thanks - can you share the full traceback? Btw - 5 minutes may not be enough depending on how many records you're evaluating/how many feedbacks. If you want to wait for the app to produce a response until after the feedback is computed, you may want to try out the "with app" feedback mode:
|
What should be a good waiting time for 4 records and 4 feedbacks? I will try "with app" feedback mode. Here are example errors: The above exception was the direct cause of the following exception: Traceback (most recent call last): Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
@sfc-gh-jreini Is there something wrong, or can this be improved? |
This seems particularly slow. What feedback provider are you using for feedback computation? I don't see this specified in the code you shared If you're using Snowflake for feedback computation - I'd suggest trying SnowflakeFeedback as well, which will run serverside. Below is the link to the notebook. The two key changes to the code to enable this are:
from trulens.core.session import TruSession
connection_params = {
"account": "...",
"user": "...",
"password": "...",
"database": "...",
"schema": "...",
"warehouse": "...",
"role": "...",
"init_server_side": True, # Set to True to enable server side feedback functions
}
connector = SnowflakeConnector(**connection_params)
import numpy as np
from trulens.core import Select
from trulens.core.feedback.feedback import SnowflakeFeedback
from trulens.providers.cortex import Cortex
provider = Cortex(
snowpark_session,
model_engine="mistral-large2",
)
# Question/answer relevance between overall question and answer.
f_answer_relevance = (
SnowflakeFeedback(
provider.relevance_with_cot_reasons, name="Answer Relevance"
)
.on_input()
.on_output()
)
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
SnowflakeFeedback(
provider.context_relevance_with_cot_reasons, name="Context Relevance"
)
.on_input()
.on(Select.RecordCalls.retrieve_context.rets)
.aggregate(np.mean)
)
f_groundedness = (
SnowflakeFeedback(
provider.groundedness_measure_with_cot_reasons,
name="Groundedness",
use_sent_tokenize=False,
)
.on_input()
.on(Select.RecordCalls.retrieve_context.rets.collect())
) |
@sfc-gh-jreini |
Can you try a different model and see if it's still slow? I haven't experimented much with Gemini, wonder if that could be the issue |
Can I use Cortex with Gemini?
I got error: """RuntimeError: Endpoint CortexEndpoint request failed 4 time(s): |
No - you can see the models available in Cortex here: https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#availability |
I will check what other models I can use and update you later. |
@sfc-gh-jreini |
Bug Description
What happened?
I have defined four feedback functions:
f_similarity
f_qa_relevance
f_context_relevance
f_groundedness_cot
However, when I check the results in Snowflake, only one or two results appear inconsistently. Sometimes, feedback results are missing, and in other cases, they are marked as failed with errors.
I print the leaderboard_df, some feedbacks gave me "NaN".
As you see, Inconsistent NaN Feedback Results from Defined Functions
To Reproduce
Which steps should someone take to run into the same error? A small, reproducible code example is useful here.
This behavior happens inconsistently, and I cannot reliably reproduce it on demand.
Expected behavior
A clear and concise description of what you expected to happen.
All defined feedback functions (f_similarity, f_qa_relevance, f_context_relevance, and f_groundedness_cot) should generate results in Snowflake without any missing or failed feedback.
Relevant Logs/Tracebacks
Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.
No Log
Environment:
Additional context
Add any other context about the problem here.
`
def initialize_sessions(sf_settings: dict, password: str) -> Tuple[Session, TruSession]:
sf_config = sf_settings.copy()
sf_config.update({
"connection_timeout": 300, # 5 minutes
"client_session_keep_alive": True,
"session_timeout": 600 # 10 minutes
})
snowpark_session = Session.builder.configs(sf_settings).create()
print("Snowpark session created successfully.")
conn = SnowflakeConnector(
snowpark_session=snowpark_session,
password=password,
database_redact_keys=True
)
session = TruSession(connector=conn)
print("TruSession initialized.")
return snowpark_session, session
tru_app = TruCustomApp(
retrieval_app,
app_name=app_name,
feedbacks= [f_similarity, f_qa_relevance, f_context_relevance, f_groundedness_cot],
)
with tru_app:
response = retrieval_app.ask_question(
query=qa.get("question"),
)
print(f"Response: {response}")
`
The text was updated successfully, but these errors were encountered: