Skip to content

Commit

Permalink
Merge branch 'master' into josiah/completions
Browse files Browse the repository at this point in the history
  • Loading branch information
jwlee64 authored Nov 5, 2024
2 parents 7e0783d + 69b2e1b commit 5fb659d
Show file tree
Hide file tree
Showing 146 changed files with 11,255 additions and 2,520 deletions.
10 changes: 8 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,20 @@ repos:
- id: ruff-format
types_or: [python, pyi, jupyter]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v1.10.0'
rev: "v1.10.0"
hooks:
- id: mypy
additional_dependencies:
[types-pkg-resources==0.1.3, types-all, wandb>=0.15.5]
# Note: You have to update pyproject.toml[tool.mypy] too!
args: ['--config-file=pyproject.toml']
args: ["--config-file=pyproject.toml"]
exclude: (.*pyi$)|(weave_query)|(tests)|(examples)
- repo: https://github.com/RobertCraigie/pyright-python
rev: v1.1.387
hooks:
- id: pyright
additional_dependencies: [".[tests]"]

# This is legacy Weave when we were building a notebook product - should be removed
- repo: local
hooks:
Expand Down
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,8 @@ docs:
build:
uv build

prepare-release: docs build
prepare-release: docs build

synchronize-base-object-schemas:
cd weave && make generate_base_object_schemas && \
cd ../weave-js && yarn generate-schemas
218 changes: 218 additions & 0 deletions dev_docs/BaseObjectClasses.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# BaseObjectClasses

## Refresher on Objects and object storage

In Weave, we have a general-purpose data storage system for objects.
The payloads themselves are completely free-form - basically anything that can be JSON-serialized.
Users can "publish" runtime objects to weave using `weave.publish`.
For example:

```python
config = {"model_name": "my_model", "model_version": "1.0"}
ref = weave.publish(config, name="my_model_config")
```

This will create a new object "version" in the collection called "my_model_config".
These can then be retrieved using `weave.ref().get()`:

```python
config = weave.ref("my_model_config").get()
```

Sometimes users are working with standard structured classes like `dataclasses` or `pydantic.BaseModel`.
In such cases, we have special serialization and deserialization logic that allows for cleaner serialization patterns.
For example, let's say the user does:

```python
class ModelConfig(weave.Object):
model_name: str
model_version: str
```

Then the user can publish an instance of `ModelConfig` as follows:

```python
config = ModelConfig(model_name="my_model", model_version="1.0")
ref = weave.publish(config)
```

This will result in an on-disk payload that looks like:

```json
{
"model_name": "my_model",
"model_version": "1.0",
"_type": "ModelConfig",
"_class_name": "ModelConfig",
"_bases": ["Object", "BaseModel"]
}
```

And additionally, the user can query for all objects of the `ModelConfig` class using the `base_object_classes` filter in `objs_query` or `POST objs/query`.
Effectively, this is like creating a virtual table for that class.

**Terminology**: We use the term "weave Object" (capital "O") to refer to instances of classes that subclass `weave.Object`.

**Technical note**: the "base_object_class" is the first subtype of "Object", not the _class_name.
For example, let's say the class hierarchy is:
* `A -> Object -> BaseModel`, then the `base_object_class` filter will be "A".
* `B -> A -> Object -> BaseModel`, then the `base_object_class` filter will still be "A"!

Finally, the Weave library itself utilizes this mechanism for common objects like `Model`, `Dataset`, `Evaluation`, etc...
This allows the user to subclass these objects to add additional metadata or functionality, while categorizing them in the same virtual table.

## Validated Base Objects

While many Weave Objects are free-form and user-defined, there is often a need for well-defined schemas for configuration objects that are tightly defined by Weave itself. The BaseObject system provides a way to define these schemas once and use them consistently across the entire stack.

### Key Features

1. **Single Source of Truth**: Define your schema once using Pydantic models
2. **Full Stack Integration**: The schema is used for:
- Python SDK validation
- Server-side HTTP API validation
- Frontend UI validation with generated TypeScript types
- Future: OpenAPI schema generation
- Future: TypeScript SDK type generation

### Usage Example

Here's how to define and use a validated base object:

1. **Define your schema** (in `weave/trace_server/interface/base_object_classes/your_schema.py`):

```python
from pydantic import BaseModel
from weave.trace_server.interface.base_object_classes import base_object_def

class NestedConfig(BaseModel):
setting_a: int

class MyConfig(base_object_def.BaseObject):
name: str
nested: NestedConfig
reference: base_object_def.RefStr

__all__ = ["MyConfig"]
```

2. **Use in Python**:
```python
# Publishing
ref = weave.publish(MyConfig(...))

# Fetching (maintains type)
config = ref.get()
assert isinstance(config, MyConfig)
```

3. **Use via HTTP API**:
```bash
# Creating
curl -X POST 'https://trace.wandb.ai/obj/create' \
-H 'Content-Type: application/json' \
-d '{
"obj": {
"project_id": "user/project",
"object_id": "my_config",
"val": {...},
"set_base_object_class": "MyConfig"
}
}'

# Querying
curl -X POST 'https://trace.wandb.ai/objs/query' \
-d '{
"project_id": "user/project",
"filter": {
"base_object_classes": ["MyConfig"]
}
}'
```

4. **Use in React**:
```typescript
// Read with type safety
const result = useBaseObjectInstances("MyConfig", ...);

// Write with validation
const createFn = useCreateBaseObjectInstance("MyConfig");
createFn({...}); // TypeScript enforced schema
```

### Keeping Frontend Types in Sync

Run `make synchronize-base-object-schemas` to ensure the frontend TypeScript types are up to date with your Pydantic schemas.

### Implementation Notes

- Base objects are pure data schemas (fields only)
- The system is designed to work independently of the weave SDK to maintain clean separation of concerns
- Server-side validation ensures data integrity
- Client-side validation (both Python and TypeScript) provides early feedback
- Generated TypeScript types ensure type safety in the frontend

### Architecture Flow

1. Define your schema in a python file in the `weave/trace_server/interface/base_object_classes/test_only_example.py` directory. See `weave/trace_server/interface/base_object_classes/test_only_example.py` as an example.
2. Make sure to register your schemas in `weave/trace_server/interface/base_object_classes/base_object_registry.py` by calling `register_base_object`.
3. Run `make synchronize-base-object-schemas` to generate the frontend types.
* The first step (`make generate_base_object_schemas`) will run `weave/scripts/generate_base_object_schemas.py` to generate a JSON schema in `weave/trace_server/interface/base_object_classes/generated/generated_base_object_class_schemas.json`.
* The second step (yarn `generate-schemas`) will read this file and use it to generate the frontend types located in `weave-js/src/components/PagePanelComponents/Home/Browse3/pages/wfReactInterface/generatedBaseObjectClasses.zod.ts`.
4. Now, each use case uses different parts:
1. `Python Writing`. Users can directly import these classes and use them as normal Pydantic models, which get published with `weave.publish`. The python client correct builds the requisite payload.
2. `Python Reading`. Users can `weave.ref().get()` and the weave python SDK will return the instance with the correct type. Note: we do some special handling such that the returned object is not a WeaveObject, but literally the exact pydantic class.
3. `HTTP Writing`. In cases where the client/user does not want to add the special type information, users can publish base objects by setting the `set_base_object_class` setting on `POST obj/create` to the name of the class. The weave server will validate the object against the schema, update the metadata fields, and store the object.
4. `HTTP Reading`. When querying for objects, the server will return the object with the correct type if the `base_object_class` metadata field is set.
5. `Frontend`. The frontend will read the zod schema from `weave-js/src/components/PagePanelComponents/Home/Browse3/pages/wfReactInterface/generatedBaseObjectClasses.zod.ts` and use that to provide compile time type safety when using `useBaseObjectInstances` and runtime type safety when using `useCreateBaseObjectInstance`.
* Note: it is critical that all techniques produce the same digest for the same data - which is tested in the tests. This way versions are not thrashed by different clients/users.

```mermaid
graph TD
subgraph Schema Definition
F["weave/trace_server/interface/<br>base_object_classes/your_schema.py"] --> |defines| P[Pydantic BaseObject]
P --> |register_base_object| R["base_object_registry.py"]
end
subgraph Schema Generation
M["make synchronize-base-object-schemas"] --> G["make generate_base_object_schemas"]
G --> |runs| S["weave/scripts/<br>generate_base_object_schemas.py"]
R --> |import registered classes| S
S --> |generates| J["generated_base_object_class_schemas.json"]
M --> |yarn generate-schemas| Z["generatedBaseObjectClasses.zod.ts"]
J --> Z
end
subgraph "Trace Server"
subgraph "HTTP API"
R --> |validates using| HW["POST obj/create<br>set_base_object_class"]
HW --> DB[(Weave Object Store)]
HR["POST objs/query<br>base_object_classes"] --> |Filters base_object_class| DB
end
end
subgraph "Python SDK"
PW[Client Code] --> |import & publish| W[weave.publish]
W --> |store| HW
R --> |validates using| W
PR["weave ref get()"] --> |queries| HR
R --> |deserializes using| PR
end
subgraph "Frontend"
Z --> |import| UBI["useBaseObjectInstances"]
Z --> |import| UCI["useCreateBaseObjectInstance"]
UBI --> |Filters base_object_class| HR
UCI --> |set_base_object_class| HW
UI[React UI] --> UBI
UI --> UCI
end
style F fill:#f9f,stroke:#333,stroke-width:2px
style P fill:#f9f,stroke:#333,stroke-width:2px
style R fill:#bbf,stroke:#333,stroke-width:2px
style DB fill:#dfd,stroke:#333,stroke-width:2px
style J fill:#ffd,stroke:#333,stroke-width:2px
style Z fill:#ffd,stroke:#333,stroke-width:2px
style M fill:#faa,stroke:#333,stroke-width:4px
```
6 changes: 6 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ generate_python_sdk_docs:
mkdir -p ./docs/reference/python-sdk
python scripts/generate_python_sdk_docs.py

generate_typescript_sdk_docs:
mkdir -p ./docs/reference/typescript-sdk
rm -rf ./docs/reference/typescript-sdk
mkdir -p ./docs/reference/typescript-sdk
bash scripts/generate_typescript_sdk_docs.sh

generate_notebooks_docs:
mkdir -p ./docs/reference/gen_notebooks
rm -rf ./docs/reference/gen_notebooks
Expand Down
72 changes: 54 additions & 18 deletions docs/docs/guides/core-types/datasets.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Datasets

`Dataset`s enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Use this to download the latest version locally with a simple API.
Expand All @@ -10,25 +13,58 @@ This guide will show you how to:

## Sample code

```python
import weave
from weave import Dataset
# Initialize Weave
weave.init('intro-example')
<Tabs groupId="programming-language">
<TabItem value="python" label="Python" default>
```python
import weave
from weave import Dataset
# Initialize Weave
weave.init('intro-example')

# Create a dataset
dataset = Dataset(
name='grammar',
rows=[
{'id': '0', 'sentence': "He no likes ice cream.", 'correction': "He doesn't like ice cream."},
{'id': '1', 'sentence': "She goed to the store.", 'correction': "She went to the store."},
{'id': '2', 'sentence': "They plays video games all day.", 'correction': "They play video games all day."}
]
)

# Publish the dataset
weave.publish(dataset)

# Retrieve the dataset
dataset_ref = weave.ref('grammar').get()

# Access a specific example
example_label = dataset_ref.rows[2]['sentence']
```

</TabItem>
<TabItem value="typescript" label="TypeScript">
```typescript
import * as weave from 'weave';

// Initialize Weave
await weave.init('intro-example');

# Create a dataset
dataset = Dataset(name='grammar', rows=[
{'id': '0', 'sentence': "He no likes ice cream.", 'correction': "He doesn't like ice cream."},
{'id': '1', 'sentence': "She goed to the store.", 'correction': "She went to the store."},
{'id': '2', 'sentence': "They plays video games all day.", 'correction': "They play video games all day."}
])
// Create a dataset
const dataset = new weave.Dataset({
name: 'grammar',
rows: [
{id: '0', sentence: "He no likes ice cream.", correction: "He doesn't like ice cream."},
{id: '1', sentence: "She goed to the store.", correction: "She went to the store."},
{id: '2', sentence: "They plays video games all day.", correction: "They play video games all day."}
]
});

# Publish the dataset
weave.publish(dataset)
// Publish the dataset
await dataset.save();

# Retrieve the dataset
dataset_ref = weave.ref('grammar').get()
// Access a specific example
const exampleLabel = datasetRef.getRow(2).sentence;
```

# Access a specific example
example_label = dataset_ref.rows[2]['sentence']
```
</TabItem>
</Tabs>
Loading

0 comments on commit 5fb659d

Please sign in to comment.