Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: enabled listing for docs snippets #1143

Merged
merged 3 commits into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/tools/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@


DOCS_DIR = "../website/docs"
BLOG_DIR = "../website/blog"


def collect_markdown_files(verbose: bool) -> List[str]:
"""
Discovers all docs markdown files
"""

# collect docs pages
markdown_files: List[str] = []
for path, _, files in os.walk(DOCS_DIR):
if "api_reference" in path:
Expand All @@ -23,6 +26,14 @@ def collect_markdown_files(verbose: bool) -> List[str]:
if verbose:
fmt.echo(f"Discovered {os.path.join(path, file)}")

# collect blog pages
for path, _, files in os.walk(BLOG_DIR):
for file in files:
if file.endswith(".md"):
markdown_files.append(os.path.join(path, file))
if verbose:
fmt.echo(f"Discovered {os.path.join(path, file)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you use glob?

glob.glob(f'{BLOG_DIR}/**/*.md', recursive=True)

https://docs.python.org/3/library/glob.html


if len(markdown_files) < 50: # sanity check
fmt.error("Found too few files. Something went wrong.")
exit(1)
Expand Down
19 changes: 11 additions & 8 deletions docs/website/blog/2023-06-14-dlthub-gpt-accelerated learning_01.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,11 @@ The code provided below demonstrates training a chat-oriented GPT model using th



```python
!python3 -m pip install --upgrade langchain deeplake openai tiktoken
```sh
python -m pip install --upgrade langchain deeplake openai tiktoken
```

```py
# Create accounts on platform.openai.com and deeplake.ai. After registering, retrieve the access tokens for both platforms and securely store them for use in the next step. Enter the access tokens grabbed in the last step and enter them when prompted

import os
Expand All @@ -65,7 +67,7 @@ embeddings = OpenAIEmbeddings(disallowed_special=())

#### 2. Create a directory to store the code for training the model. Clone the desired repositories into that.

```python
```sh
# making a new directory named dlt-repo
!mkdir dlt-repo
# changing the directory to dlt-repo
Expand All @@ -80,7 +82,7 @@ embeddings = OpenAIEmbeddings(disallowed_special=())
```

#### 3. Load the files from the directory
```python
```py
import os
from langchain.document_loaders import TextLoader

Expand All @@ -95,7 +97,7 @@ for dirpath, dirnames, filenames in os.walk(root_dir):
pass
```
#### 4. Load the files from the directory
```python
```py
import os
from langchain.document_loaders import TextLoader

Expand All @@ -111,15 +113,16 @@ for dirpath, dirnames, filenames in os.walk(root_dir):
```

#### 5. Splitting files to chunks
```python
```py
# This code uses CharacterTextSplitter to split documents into smaller chunksbased on character count and store the resulting chunks in the texts variable.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
```
#### 6. Create Deeplake dataset
```python

```sh
# Set up your deeplake dataset by replacing the username with your Deeplake account and setting the dataset name. For example if the deeplakes username is “your_name” and the dataset is “dlt-hub-dataset”

username = "your_deeplake_username" # replace with your username from app.activeloop.ai
Expand All @@ -138,7 +141,7 @@ retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10
```
#### 7. Initialize the GPT model
```python
```py
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

Expand Down
2 changes: 1 addition & 1 deletion docs/website/blog/2023-08-14-dlt-motherduck-blog.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ This is a perfect problem to test out my new super simple and highly customizabl
`dlt init bigquery duckdb`

This creates a folder with the directory structure
```
```text
├── .dlt
│ ├── config.toml
│ └── secrets.toml
Expand Down
2 changes: 1 addition & 1 deletion docs/website/blog/2023-08-21-dlt-lineage-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ By combining row and column level lineage, you can have an easy overview of wher

After a pipeline run, the schema evolution info gets stored in the load info.
Load it back to the database to persist the column lineage:
```python
```py
load_info = pipeline.run(data,
write_disposition="append",
table_name="users")
Expand Down
6 changes: 3 additions & 3 deletions docs/website/blog/2023-08-24-dlt-etlt.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ This engine is configurable in both how it works and what it does,
you can read more here: [Normaliser, schema settings](https://dlthub.com/docs/general-usage/schema#data-normalizer)

Here is a usage example (it's built into the pipeline):
```python
```py

import dlt

Expand Down Expand Up @@ -119,7 +119,7 @@ Besides your own customisations, `dlt` also supports injecting your transform co

Here is a code example of pseudonymisation, a common case where data needs to be transformed before loading:

```python
```py
import dlt
import hashlib

Expand Down Expand Up @@ -168,7 +168,7 @@ load_info = pipeline.run(data_source)
Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools.
`dlt` offers a dbt runner to get you started easily with your transformation package.

```python
```py
pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
Expand Down
22 changes: 11 additions & 11 deletions docs/website/blog/2023-09-05-mongo-etl.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,29 +139,29 @@ Here's a code explanation of how it works under the hood:
example of how this nested data could look:

```json
data = {
'id': 1,
'name': 'Alice',
'job': {
{
"id": 1,
"name": "Alice",
"job": {
"company": "ScaleVector",
"title": "Data Scientist",
"title": "Data Scientist"
},
'children': [
"children": [
{
'id': 1,
'name': 'Eve'
"id": 1,
"name": "Eve"
},
{
'id': 2,
'name': 'Wendy'
"id": 2,
"name": "Wendy"
}
]
}
```

1. We can load the data to a supported destination declaratively:

```python
```py
import dlt

pipeline = dlt.pipeline(
Expand Down
29 changes: 14 additions & 15 deletions docs/website/blog/2023-09-26-verba-dlt-zendesk.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ In this blog post, we'll guide you through the process of building a RAG applica

Create a new folder for your project and install Verba:

```bash
```sh
mkdir verba-dlt-zendesk
cd verba-dlt-zendesk
python -m venv venv
Expand All @@ -50,7 +50,7 @@ pip install goldenverba

To configure Verba, we need to set the following environment variables:

```bash
```sh
VERBA_URL=https://your-cluster.weaviate.network # your Weaviate instance URL
VERBA_API_KEY=F8...i4WK # the API key of your Weaviate instance
OPENAI_API_KEY=sk-...R # your OpenAI API key
Expand All @@ -61,13 +61,13 @@ You can put them in a `.env` file in the root of your project or export them in

Let's test that Verba is installed correctly:

```bash
```sh
verba start
```

You should see the following output:

```bash
```sh
INFO: Uvicorn running on <http://0.0.0.0:8000> (Press CTRL+C to quit)
ℹ Setting up client
✔ Client connected to Weaviate Cluster
Expand All @@ -88,23 +88,23 @@ If you try to ask a question now, you'll get an error in return. That's because

We get our data from Zendesk using dlt. Let's install it along with the Weaviate extra:

```bash
```sh
pip install "dlt[weaviate]"
```

This also installs a handy CLI tool called `dlt`. It will help us initialize the [Zendesk verified data source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/zendesk)—a connector to Zendesk Support API.

Let's initialize the verified source:

```bash
```sh
dlt init zendesk weaviate
```

`dlt init` pulls the latest version of the connector from the [verified source repository](https://github.com/dlt-hub/verified-sources) and creates a credentials file for it. The credentials file is called `secrets.toml` and it's located in the `.dlt` directory.

To make things easier, we'll use the email address and password authentication method for Zendesk API. Let's add our credentials to `secrets.toml`:

```yaml
```toml
[sources.zendesk.credentials]
password = "your-password"
subdomain = "your-subdomain"
Expand All @@ -113,14 +113,13 @@ email = "[email protected]"

We also need to specify the URL and the API key of our Weaviate instance. Copy the credentials for the Weaviate instance you created earlier and add them to `secrets.toml`:

```yaml
```toml
[destination.weaviate.credentials]
url = "https://your-cluster.weaviate.network"
api_key = "F8.....i4WK"

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "sk-....."

```

All the components are now in place and configured. Let's set up a pipeline to import data from Zendesk.
Expand All @@ -129,7 +128,7 @@ All the components are now in place and configured. Let's set up a pipeline to i

Open your favorite text editor and create a file called `zendesk_verba.py`. Add the following code to it:

```python
```py
import itertools

import dlt
Expand Down Expand Up @@ -217,13 +216,13 @@ Finally, we run the pipeline and print the load info.

Let's run the pipeline:

```bash
```sh
python zendesk_verba.py
```

You should see the following output:

```bash
```sh
Pipeline zendesk_verba completed in 8.27 seconds
1 load package(s) were loaded to destination weaviate and into dataset None
The weaviate destination used <https://your-cluster.weaviate.network> location to store data
Expand All @@ -235,13 +234,13 @@ Verba is now populated with data from Zendesk Support. However there are a coupl

Run the following command:

```bash
```sh
verba init
```

You should see the following output:

```bash
```sh
===================== Creating Document and Chunk class =====================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
Expand All @@ -264,7 +263,7 @@ Document class already exists, do you want to overwrite it? (y/n): n

We're almost there! Let's start Verba:

```bash
```sh
verba start
```

Expand Down
Loading
Loading