Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added validation using tonic validate #141

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions api/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name = "pypi"
[packages]
stampy-chat = {editable = true, path = "."}
# ---- <flask stuff> ----
flask = "==1.1.2"
flask = ">=2.2.2"
gunicorn = "==20.0.4"
jinja2 = "==2.11.3"
markupsafe = "==1.1.1"
Expand All @@ -25,7 +25,7 @@ requests = "*"
alembic = "*"
sqlalchemy = "*"
mysql-connector-python = "*"
langchain = "*"
langchain = "==0.1.20"
transformers = "*"
langchain-anthropic = "*"

Expand Down
35 changes: 34 additions & 1 deletion api/src/stampy_chat/chat.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
from typing import Any, Callable, Dict, List

from tonic_validate import BenchmarkItem, ValidateScorer, LLMResponse
from tonic_validate.metrics import AnswerConsistencyMetric, RetrievalPrecisionMetric, AugmentationAccuracyMetric

from langchain.chains import LLMChain, OpenAIModerationChain
from langchain_community.chat_models import ChatOpenAI
from langchain_anthropic import ChatAnthropic
Expand All @@ -17,7 +20,7 @@
from stampy_chat.settings import Settings, MODELS, OPENAI, ANTRHROPIC
from stampy_chat.callbacks import StampyCallbackHandler, BroadcastCallbackHandler, LoggerCallbackHandler
from stampy_chat.followups import StampyChain
from stampy_chat.citations import make_example_selector
from stampy_chat.citations import make_example_selector, get_top_k_blocks

from langsmith import Client

Expand Down Expand Up @@ -322,6 +325,7 @@ def run_query(session_id: str, query: str, history: List[Dict], settings: Settin
:param Callable[[Any], None] callback: an optional callback that will be called at various key parts of the chain
:returns: the result of the chain
"""

callbacks = [LoggerCallbackHandler(session_id=session_id, query=query, history=history)]
if callback:
callbacks += [BroadcastCallbackHandler(callback)]
Expand All @@ -340,9 +344,38 @@ def run_query(session_id: str, query: str, history: List[Dict], settings: Settin
prompt=make_prompt(settings, chat_model, callbacks),
memory=make_memory(settings, history, callbacks)
)

if followups:
chain = chain | StampyChain(callbacks=callbacks)

result = chain.invoke({"query": query, 'history': history}, {'callbacks': []})

#Validate results
contexts=[c['text'] for c in get_top_k_blocks(query, 5)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're querying the vectorstore twice for each request, which will make things a lot slower - how hard would it be to reuse the previously fetched examples?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. I had a lot of challenges in this part and it took the majority of my time with this ticket.

Unfortunately, I could not figure out how to get a hook into the LangChain Semantic Similarity Selector. Without a hook, there is no way to get the previously fetched examples. I also looked at the source code of RAGAs (another RAG validation framework) and saw that they too could not figure out how to put a hook into LangChain and stopped supporting it last year.

I tested the latency and saw that querying the vectorstore does not add much latency and unlike querying an LLM, there is no additional cost to it. This is why I decided to query the vector store again. I think we can put this in the backlog and monitor if LangChain updates their API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this shouldn't be executed for each query. How about adding a flag to the settings object?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is required for each query because tonic_validate, checks to see if the LLM's answer is consistent with the context that was retrieved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but should it always do that, is the question. This makes the query slower, so for now I'd just use it for testing, rather than always running it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you mean. I will add a flag to the settings object to prevent it from running all the time.

rag_response={
"llm_answer": result['text'],
"llm_context_list": contexts
}

benchmark = BenchmarkItem(question=query)

llm_response = LLMResponse(
llm_answer=result['text'],
llm_context_list=contexts,
benchmark_item=benchmark
)

# Score the responses
scorer = ValidateScorer([
AnswerConsistencyMetric(),
AugmentationAccuracyMetric(),
RetrievalPrecisionMetric()
])
run = scorer.score_responses([llm_response])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually returned to the user? If not, then I'd suggest doing it after notifying the frontend that everything is done, i.e. moving the if callback (...) clause before these

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will make this change. The reason I kept it this way is because I was worried that a "fast" user might ask the next question before the evaluation is done running. This might overwhelm the system and spawn a lot of processes.

I don't think this will be useful for our users. We may want to track it internally first and then decide what would be the best way to display this to users. For instance, I don't think users will know what "context-precision" is or care about it.

print(run.overall_scores)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from stampy_chat import logging
logging.info(run.overall_scores)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also send messages to discord with the logger

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be really cool! I will add that in!




if callback:
callback({'state': 'done'})
callback(None) # make sure the callback handler know that things have ended
Expand Down
Loading