Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
diptanu committed May 26, 2024
1 parent b628868 commit 5cfaedd
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 73 deletions.
67 changes: 0 additions & 67 deletions docs/docs/apis/custom_extractors.md

This file was deleted.

25 changes: 19 additions & 6 deletions docs/docs/apis/develop_extractors.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,14 @@ The content object has the following properties -

* **data** - The unstructured data encoded as raw bytes.
* **content_type** - The mime type of the data. For example, `text/plain`, `image/png`, etc. This allows you to decode the bytes correctly.
* **labels** - Optional Key Value metadata associated with the content provided by users or added by Indexify. Labels are meant for filtering content while deciding which bindings are invoked on them or for storing user defined opaque metadata
* **Feature** - Optional Feature associated with the content, such as embedding or JSON metadata. Embeddings are stored in indexes in Vector Store and JSON metadata are stored in structured store such as Postgres. Features are searchable, if they are embedding you can perform KNN search on the resulting index, if it's JSON you could do JSON path queries on them.

The **Content** object is [defined here](https://github.com/tensorlakeai/indexify/blob/11346c29055f16d397fc0901ec10139cdc945134/indexify_extractor_sdk/base_extractor.py#L48)

### Feature
Feature is some form of extracted information from unstructured data. Embedding, or JSON metadata are the possible features for now. Features extracted are indexed and searchable.
Features can be easily constructed from [helper methods](https://github.com/tensorlakeai/indexify/blob/11346c29055f16d397fc0901ec10139cdc945134/indexify_extractor_sdk/base_extractor.py#L37)
You can optionally give features a name such as `my_custom_text_embedding`, we use the names as sufixes of index names.
You can optionally give features a name such as `my_custom_text_embedding`, we use the names as suffixes of index names.

## Install the Extractor SDK
```shell
Expand All @@ -30,7 +29,7 @@ pip install indexify-extractor-sdk
The following command will create a template for a new extractor in the current directory.

```shell
git clone https://github.com/tensorlakeai/indexify-extractor-template
curl https://codeload.github.com/tensorlakeai/indexify-extractor-template/tar.gz/main | tar -xz indexify-extractor-template-main
```

## Implement the Extractor
Expand All @@ -50,29 +49,43 @@ def extract(self, content: Content) -> List[Content]:
metadata_chunk = Content.from_text(text=chunk, feature=Feature.metadata(name="metadata", json.dumps(entities))),
output.append([embed_chunk, metadata_chunk])
return output


```

**extract** - Takes a `Content` object which have the bytes of unstructured data and the mime-type. You can pass a list of JSON, text, video, audio and documents into the extract method. It should return a list of transformed or derived content, or a list of features.
Examples -
- Text Chunking: Input(Text) -> List(Text)
- Audio Transcription: Input(Audio) -> List(Text)
- Speaker Diarization: Input(Audio) -> List(JSON of text and corresponding speaker ids)
- PDF Extraction: Input(PDF) -> List(Text, Images and JSON representation of tables)
- PDF to Markdown: Input(PDF) -> List(Markdown)

In this example we iterate over a list of content, chunk each content, run a NER model and an embedding model over each chunk and return them as features along with the chunks of text.

!!! note "Extractor Dependencies"

Use any python or native system dependencies in your extractors because we can package them in a container to deploy them to production.


## Extractor Metadata
**sample_input**: A sample input your extractor can process and a sample input config. This will be run when the extractor starts up to make sure the extractor is functioning properly.

### List dependencies
Add a `requirements.txt` file to the folder if it has any python dependencies.

## Extractor Description
Add a name to your extractor, a description of what it does and python and system dependencies. These goes in attributes/properties of your Extractor class -

* **name** - The name of the extractor. We use the name of the extractor also to name the container package.
* **description** - Long description of the extractor
* **python_dependencies** - List of python dependencies that you are importing in the extractor. Example - `["torch", "transformers"]`
* **system_dependencies** - List of system dependencies of the extractor such as any native dependencies of the model or packages you are using. Example - `["curl", "protobuf-compiler"]`
* **input_mime_types** - The list of input data types the extractor can handle. We use standard mime types as the API. Default is `["text/plain]`, and you can override or specify which ones your extractor supports from the [list here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types)

#### Test the extractor locally

Extractors are just python modules so you can write a unit test like any any other python module. You should also test the extractor using the indexify binary to make sure it works as expected.

```shell
indexify-extractor describe custom_extractor:MyExtractor
indexify-extractor run-local custom_extractor:MyExtractor --text "hello world"
```

Expand Down

0 comments on commit 5cfaedd

Please sign in to comment.