Skip to content

Commit

Permalink
Merge pull request #26 from parkervg/feature/vqa-model-integration
Browse files Browse the repository at this point in the history
`TransformersVisionModel`, `ImageCaption` Ingredient, `default_model` behavior
  • Loading branch information
parkervg authored Jun 21, 2024
2 parents 2e8c511 + 7612050 commit 637141f
Show file tree
Hide file tree
Showing 32 changed files with 434 additions and 185 deletions.
121 changes: 73 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,46 +53,68 @@ BlendSQL allows us to ask the following questions by injecting "ingredients", wh

_Which parks don't have park facilities?_
```sql
SELECT * FROM parks
WHERE NOT {{
LLMValidate(
'Does this location have park facilities?',
context=(SELECT "Name" AS "Park", "Description" FROM parks),
)
}}
SELECT "Name", "Description" FROM parks
WHERE {{
LLMMap(
'Does this location have park facilities?',
context='parks::Description'
)
}} = FALSE
```
| Name | Description |
|:----------------|:---------------------------------------------------------------------------------------------------------------------------------------|
| Everglades | The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities. |
<hr>

_What does the largest park in Alaska look like?_

```sql
SELECT {{VQA('Describe this image.', 'parks::Image')}} FROM parks
WHERE "Location" = 'Alaska'
ORDER BY {{
LLMMap(
'Size in km2?',
'parks::Area'
)
}} LIMIT 1
SELECT "Name",
{{ImageCaption('parks::Image')}} as "Image Description",
{{
LLMMap(
question='Size in km2?',
context='parks::Area'
)
}} as "Size in km" FROM parks
WHERE "Location" = 'Alaska'
ORDER BY "Size in km" DESC LIMIT 1
```

| Name | Image Description | Size in km |
|:-----------|:--------------------------------------------------------|-------------:|
| Everglades | A forest of tall trees with a sunset in the background. | 30448.1 |

<hr>

_Which state is the park in that protects an ash flow?_

```sql
SELECT "Location" FROM parks WHERE "Name" = {{
LLMQA(
'Which park protects an ash flow?',
context=(SELECT "Name", "Description" FROM parks),
options="parks::Name"
)
}}
SELECT "Location", "Name" AS "Park Protecting Ash Flow" FROM parks
WHERE "Name" = {{
LLMQA(
'Which park protects an ash flow?',
context=(SELECT "Name", "Description" FROM parks),
options="parks::Name"
)
}}
```
| Location | Park Protecting Ash Flow |
|:-----------|:---------------------------|
| Alaska | Katmai |

<hr>

_How many parks are located in more than 1 state?_

```sql
SELECT COUNT(*) FROM parks
WHERE {{LLMMap('How many states?', 'parks::Location')}} > 1
```
| Count |
|--------:|
| 1 |
<hr>

Now, we have an intermediate representation for our LLM to use that is explainable, debuggable, and [very effective at hybrid question-answering tasks](https://arxiv.org/abs/2402.17882).

Expand Down Expand Up @@ -124,25 +146,28 @@ model = TransformersLLM('Qwen/Qwen1.5-0.5B')

# Prepare our local database
db = Pandas(
{
"w": pd.DataFrame(
(
['11 jun', 'western districts', 'bathurst', 'bathurst ground', '11-0'],
['12 jun', 'wallaroo & university nsq', 'sydney', 'cricket ground',
'23-10'],
['5 jun', 'northern districts', 'newcastle', 'sports ground', '29-0']
),
columns=['date', 'rival', 'city', 'venue', 'score']
),
"documents": pd.DataFrame(
(
['bathurst, new south wales', 'bathurst /ˈbæθərst/ is a city in the central tablelands of new south wales , australia . it is about 200 kilometres ( 120 mi ) west-northwest of sydney and is the seat of the bathurst regional council .'],
['sydney', 'sydney ( /ˈsɪdni/ ( listen ) sid-nee ) is the state capital of new south wales and the most populous city in australia and oceania . located on australia s east coast , the metropolis surrounds port jackson.'],
['newcastle, new south wales', 'the newcastle ( /ˈnuːkɑːsəl/ new-kah-səl ) metropolitan area is the second most populated area in the australian state of new south wales and includes the newcastle and lake macquarie local government areas .']
),
columns=['title', 'content']
)
}
{
"w": pd.DataFrame(
(
['11 jun', 'western districts', 'bathurst', 'bathurst ground', '11-0'],
['12 jun', 'wallaroo & university nsq', 'sydney', 'cricket ground',
'23-10'],
['5 jun', 'northern districts', 'newcastle', 'sports ground', '29-0']
),
columns=['date', 'rival', 'city', 'venue', 'score']
),
"documents": pd.DataFrame(
(
['bathurst, new south wales',
'bathurst /ˈbæθərst/ is a city in the central tablelands of new south wales , australia . it is about 200 kilometres ( 120 mi ) west-northwest of sydney and is the seat of the bathurst regional council .'],
['sydney',
'sydney ( /ˈsɪdni/ ( listen ) sid-nee ) is the state capital of new south wales and the most populous city in australia and oceania . located on australia s east coast , the metropolis surrounds port jackson.'],
['newcastle, new south wales',
'the newcastle ( /ˈnuːkɑːsəl/ new-kah-səl ) metropolitan area is the second most populated area in the australian state of new south wales and includes the newcastle and lake macquarie local government areas .']
),
columns=['title', 'content']
)
}
)

# Write BlendSQL query
Expand All @@ -157,13 +182,13 @@ WHERE city = {{
}}
"""
smoothie = blend(
query=blendsql,
db=db,
ingredients={LLMMap, LLMQA, LLMJoin},
blender=model,
# Optional args below
infer_gen_constraints=True,
verbose=True
query=blendsql,
db=db,
ingredients={LLMMap, LLMQA, LLMJoin},
default_model=model,
# Optional args below
infer_gen_constraints=True,
verbose=True
)
print(smoothie.df)
# ┌────────┬───────────────────┬──────────┬─────────────────┬─────────┐
Expand Down
2 changes: 1 addition & 1 deletion app.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ async def main(message: cl.Message):
query=blendsql_query,
db=db,
ingredients={LLMMap, LLMQA, LLMJoin},
blender=blender_model,
default_model=blender_model,
infer_gen_constraints=True,
verbose=False,
)
Expand Down
2 changes: 1 addition & 1 deletion benchmark/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
smoothie = blend(
query=query,
db=db,
blender=MODEL,
default_model=MODEL,
verbose=False,
ingredients=ingredients,
)
Expand Down
2 changes: 1 addition & 1 deletion blendsql/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
__version__ = "0.0.18"


from .ingredients.builtin import LLMMap, LLMQA, LLMJoin, LLMValidate
from .ingredients.builtin import LLMMap, LLMQA, LLMJoin, LLMValidate, ImageCaption
from .blend import blend
6 changes: 4 additions & 2 deletions blendsql/_program.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from __future__ import annotations
from typing import Tuple, Type
from typing import Tuple, Type, List, Union
import inspect
import ast
import textwrap
Expand Down Expand Up @@ -54,7 +54,9 @@ def __new__(
return self.__call__(self, model, **kwargs)

@abstractmethod
def __call__(self, model: Model, *args, **kwargs) -> Tuple[str, str]:
def __call__(
self, model: Model, *args, **kwargs
) -> Tuple[Union[str, List[str]], str]:
"""Logic for formatting prompt and calling the underlying model.
Should return tuple of (response, prompt).
"""
Expand Down
58 changes: 35 additions & 23 deletions blendsql/blend.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,13 +147,13 @@ def autowrap_query(
return query


def preprocess_blendsql(query: str, blender: Model) -> Tuple[str, dict, set]:
def preprocess_blendsql(query: str, default_model: Model) -> Tuple[str, dict, set]:
"""Parses BlendSQL string with our pyparsing grammar and returns objects
required for interpretation and execution.
Args:
query: The BlendSQL query to preprocess
blender: Model object, which we attach to each parsed_dict
default_model: Model object, which we attach to each parsed_dict
Returns:
Tuple, containing:
Expand All @@ -166,7 +166,7 @@ def preprocess_blendsql(query: str, blender: Model) -> Tuple[str, dict, set]:
```python
preprocess_blendsql(
query="SELECT * FROM documents JOIN {{LLMJoin(left_on='w::player', right_on='documents::title')}} WHERE rank = 2",
blender=blender
default_model=default_model
)
```
```text
Expand Down Expand Up @@ -236,7 +236,7 @@ def preprocess_blendsql(query: str, blender: Model) -> Tuple[str, dict, set]:
# So we need to parse by indices in dict expression
# maybe if I was better at pp.Suppress we wouldn't need this
kwargs_dict = {x[0]: x[-1] for x in parsed_results_dict["kwargs"]}
kwargs_dict[IngredientKwarg.MODEL] = blender
kwargs_dict[IngredientKwarg.MODEL] = default_model
context_arg = kwargs_dict.get(
IngredientKwarg.CONTEXT,
(
Expand Down Expand Up @@ -279,7 +279,7 @@ def materialize_cte(
query_context: QueryContextManager,
aliasname: str,
db: Database,
blender: Model,
default_model: Model,
ingredient_alias_to_parsed_dict: Dict[str, dict],
**kwargs,
) -> pd.DataFrame:
Expand All @@ -288,7 +288,7 @@ def materialize_cte(
ingredient_alias_to_parsed_dict=ingredient_alias_to_parsed_dict,
query=str_subquery,
db=db,
blender=blender,
default_model=default_model,
aliasname=aliasname,
**kwargs,
).df
Expand Down Expand Up @@ -382,7 +382,7 @@ def disambiguate_and_submit_blend(
def _blend(
query: str,
db: Database,
blender: Optional[Model] = None,
default_model: Optional[Model] = None,
ingredients: Optional[Collection[Type[Ingredient]]] = None,
verbose: bool = False,
infer_gen_constraints: bool = True,
Expand Down Expand Up @@ -411,7 +411,7 @@ def _blend(
query,
ingredient_alias_to_parsed_dict,
tables_in_ingredients,
) = preprocess_blendsql(query=query, blender=blender)
) = preprocess_blendsql(query=query, default_model=default_model)
query = autowrap_query(
query=query,
kitchen=kitchen,
Expand All @@ -435,11 +435,13 @@ def _blend(
df=db.execute_to_df(query_context.to_string()),
meta=SmoothieMeta(
num_values_passed=0,
prompt_tokens=blender.prompt_tokens if blender is not None else 0,
prompt_tokens=default_model.prompt_tokens
if default_model is not None
else 0,
completion_tokens=(
blender.completion_tokens if blender is not None else 0
default_model.completion_tokens if default_model is not None else 0
),
prompts=blender.prompts if blender is not None else [],
prompts=default_model.prompts if default_model is not None else [],
ingredients=[],
query=original_query,
db_url=str(db.db_url),
Expand Down Expand Up @@ -524,7 +526,7 @@ def _blend(
query_context=query_context,
subquery=aliased_subquery,
aliasname=tablename,
blender=blender,
default_model=default_model,
db=db,
ingredient_alias_to_parsed_dict=ingredient_alias_to_parsed_dict,
# Below are in case we need to call blend() again
Expand Down Expand Up @@ -566,7 +568,7 @@ def _blend(
query_context=query_context,
subquery=aliased_subquery,
aliasname=aliasname,
blender=blender,
default_model=default_model,
db=db,
ingredient_alias_to_parsed_dict=ingredient_alias_to_parsed_dict,
# Below are in case we need to call blend() again
Expand Down Expand Up @@ -620,7 +622,12 @@ def _blend(

if table_to_title is not None:
kwargs_dict["table_to_title"] = table_to_title

# Heuristic check to see if we should snag the singleton arg as context
if (
len(parsed_results_dict["args"]) == 1
and "::" in parsed_results_dict["args"][0]
):
kwargs_dict[IngredientKwarg.CONTEXT] = parsed_results_dict["args"].pop()
# Optionally, recursively call blend() again to get subtable from args
# This applies to `context` and `options`
for i, unpack_kwarg in enumerate(
Expand All @@ -644,7 +651,7 @@ def _blend(
_smoothie = _blend(
query=unpack_value,
db=db,
blender=blender,
default_model=default_model,
ingredients=ingredients,
infer_gen_constraints=infer_gen_constraints,
table_to_title=table_to_title,
Expand All @@ -664,7 +671,8 @@ def _blend(
kwargs_dict[unpack_kwarg] = subtable
# Below, we can remove the optional `context` arg we passed in args
parsed_results_dict["args"] = parsed_results_dict["args"][:1]

if getattr(ingredient, "model", None) is not None:
kwargs_dict["model"] = ingredient.model
# Execute our ingredient function
function_out = ingredient(
*parsed_results_dict["args"],
Expand Down Expand Up @@ -832,9 +840,13 @@ def _blend(
]
)
+ _prev_passed_values,
prompt_tokens=blender.prompt_tokens if blender is not None else 0,
completion_tokens=blender.completion_tokens if blender is not None else 0,
prompts=blender.prompts if blender is not None else [],
prompt_tokens=default_model.prompt_tokens
if default_model is not None
else 0,
completion_tokens=default_model.completion_tokens
if default_model is not None
else 0,
prompts=default_model.prompts if default_model is not None else [],
ingredients=ingredients,
query=original_query,
db_url=str(db.db_url),
Expand All @@ -845,7 +857,7 @@ def _blend(
def blend(
query: str,
db: Database,
blender: Optional[Model] = None,
default_model: Optional[Model] = None,
ingredients: Optional[Collection[Type[Ingredient]]] = None,
verbose: bool = False,
infer_gen_constraints: bool = True,
Expand All @@ -861,7 +873,7 @@ def blend(
db: Database connector object
ingredients: Collection of ingredient objects, to use in interpreting BlendSQL query
verbose: Boolean defining whether to run with logger in debug mode
blender: Which BlendSQL model to use in performing ingredient tasks in the current query
default_model: Which BlendSQL model to use in performing ingredient tasks in the current query
infer_gen_constraints: Optionally infer the output format of an `IngredientMap` call, given the predicate context
For example, in `{{LLMMap('convert to date', 'w::listing date')}} <= '1960-12-31'`
We can infer the output format should look like '1960-12-31' and both:
Expand Down Expand Up @@ -926,7 +938,7 @@ def blend(
query=blendsql,
db=db,
ingredients={LLMMap, LLMQA, LLMJoin},
blender=model,
default_model=model,
# Optional args below
infer_gen_constraints=True,
verbose=True
Expand Down Expand Up @@ -959,7 +971,7 @@ def blend(
smoothie = _blend(
query=query,
db=db,
blender=blender,
default_model=default_model,
ingredients=ingredients,
infer_gen_constraints=infer_gen_constraints,
table_to_title=table_to_title,
Expand Down
Loading

0 comments on commit 637141f

Please sign in to comment.