Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Cloud tutorials to Guide #70

Merged
merged 3 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions docs/domain/document/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
(document)=
(object)=
matkuliak marked this conversation as resolved.
Show resolved Hide resolved
# Document Store

Learn how to efficiently store JSON or other structured data, also nested, and
Expand All @@ -9,8 +8,15 @@ Storing documents in CrateDB provides the same development convenience like the
document-oriented storage layer of Lotus Notes / Domino, CouchDB, MongoDB, and
PostgreSQL's `JSON(B)` types.

- [](inv:cloud#object)
- [](#object-marketing)
- [Unleashing the Power of Nested Data: Ingesting and Querying JSON Documents with SQL]


[Unleashing the Power of Nested Data: Ingesting and Querying JSON Documents with SQL]: https://youtu.be/S_RHmdz2IQM?feature=shared

```{toctree}
:maxdepth: 1
:hidden:

object
```
128 changes: 128 additions & 0 deletions docs/domain/document/object.md
matkuliak marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
(object-marketing)=
matkuliak marked this conversation as resolved.
Show resolved Hide resolved

# Objects: Analyzing Marketing Data

Marketers often need to handle multi-structured data from different platforms.
CrateDB's dynamic `OBJECT` data type allows us to store and analyze this complex,
nested data efficiently. In this tutorial, we'll explore how to leverage this
feature in marketing data analysis, along with the use of generated columns to
parse and manage URLs.

Consider marketing data that captures details of various campaigns.

:::{code} json
{
"campaign_id": "c123",
"source": "Google Ads",
"metrics": {
"clicks": 500,
"impressions": 10000,
"conversion_rate": 0.05
},
"landing_page_url": "https://example.com/products?utm_source=google"
}
:::

To begin, let's create the schema for this dataset.

## Creating the Table

CrateDB uses SQL, the most popular query language for database management. To
store the marketing data, create a table with columns tailored to the
dataset using the `CREATE TABLE` command:

:::{code} sql
CREATE TABLE marketing_data (
campaign_id TEXT PRIMARY KEY,
source TEXT,
metrics OBJECT(DYNAMIC) AS (
clicks INTEGER,
impressions INTEGER,
conversion_rate DOUBLE PRECISION
),
landing_page_url TEXT,
url_parts GENERATED ALWAYS AS parse_url(landing_page_url)
);
:::

Let's highlight two features in this table definition:

:metrics: An `OBJECT` column featuring a dynamic structure for
performing flexible queries on its nested attributes like
clicks, impressions, and conversion rate.
:url_parts: A generated column to
decode an URL from the `landing_page_url` column. This is convenient
to query for specific components of the URL later on.

The table is designed to accommodate both fixed and dynamic attributes,
providing a robust and flexible structure for storing your marketing data.


## Inserting Data

Now, insert the data using the `COPY FROM` SQL statement.

:::{code} sql
COPY marketing_data
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_marketing.json.gz'
WITH (format = 'json', compression='gzip');
:::

## Analyzing Data

Start with a basic `SELECT` statement on the `metrics` column, and limit the
output to display only 10 records, in order to quickly explore a few samples
worth of data.

:::{code} sql
SELECT metrics
FROM marketing_data
LIMIT 10;
:::

You can see that the `metrics` column returns an object in the form of a JSON.
If you just want to return a single property of this object, you can adjust the
query slightly by adding the property to the selection using bracket notation.

:::{code} sql
SELECT metrics['clicks']
FROM marketing_data
LIMIT 10;
:::

It's helpful to select individual properties from a nested object, but what if
you also want to filter results based on these properties? For instance, to find
`campaign_id` and `source` where `conversion_rate` exceeds `0.09`, employ
the same bracket notation for filtering as well.

:::{code} sql
SELECT campaign_id, source
FROM marketing_data
WHERE metrics['conversion_rate'] > 0.09
LIMIT 50;
:::

This allows you to narrow down the query results while still leveraging CrateDB's
ability to query nested objects effectively.

Finally, let's explore data aggregation based on UTM source parameters. The
`url_parts` generated column, which is populated using the `parse_url()`
function, automatically splits the URL into its constituent parts upon data
insertion.

To analyze the UTM source, you can directly query these parsed parameters. The
goal is to count the occurrences of each UTM source and sort them in descending
order. This lets you easily gauge marketing effectiveness for different sources,
all while taking advantage of CrateDB's powerful generated columns feature.

:::{code} sql
SELECT
url_parts['parameters']['utm_source'] AS utm_source,
COUNT(*)
FROM marketing_data
GROUP BY 1
ORDER BY 2 DESC;
:::

In this tutorial, we explored the versatility and power of CrateDB's dynamic
`OBJECT` data type for handling complex, nested marketing data.
121 changes: 111 additions & 10 deletions docs/domain/search/index.md
matkuliak marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,17 +1,118 @@
(fts)=
(full-text)=
# Full-Text Search

Learn how to set up your database for full-text search, how to create the
relevant indices, and how to query your text data efficiently. A must-read
for anyone looking to make sense of large volumes of unstructured text data.
# Full-Text: Exploring the Netflix Catalog

- [](inv:cloud#full-text)


:::{note}
CrateDB is an exceptional choice for handling complex queries and large-scale
data sets. One of its standout features are its full-text search capabilities,
built on top of the powerful Lucene library. This makes it a great fit for
data sets. One of its standout features is its full-text search capabilities,
using the BM25 ranking algorithm for information retrieval, built on top of
the powerful Lucene indexing library. This makes CrateDB an excellent fit for
organizing, searching, and analyzing extensive datasets.

In this tutorial, we will explore how to manage a dataset of Netflix titles,
making use of CrateDB Cloud's full-text search capabilities.
Each entry in our imaginary dataset will have the following attributes:

:show_id: A unique identifier for each show or movie.
:type: Specifies whether the title is a movie, TV show, or another format.
:title: The title of the movie or show.
:director: The name of the director.
:cast: An array listing the cast members.
:country: The country where the title was produced.
:date_added: A timestamp indicating when the title was added to the catalog.
:release_year: The year the title was released.
:rating: The content rating (e.g., PG, R, etc.).
:duration: The duration of the title in minutes or seasons.
:listed_in: An array containing genres that the title falls under.
:description: A textual description of the title, indexed using full-text search.

To begin, let's create the schema for this dataset.


## Creating the Table

CrateDB uses SQL, the most popular query language for database management. To
store the data, create a table with columns tailored to the
dataset using the `CREATE TABLE` command.

Importantly, you will also take advantage
of CrateDB's full-text search capabilities by setting up a full-text index on
the description column. This will enable you to perform complex textual queries
later on.

:::{code} sql
CREATE TABLE "netflix_catalog" (
"show_id" TEXT PRIMARY KEY,
"type" TEXT,
"title" TEXT,
"director" TEXT,
"cast" ARRAY(TEXT),
"country" TEXT,
"date_added" TIMESTAMP,
"release_year" TEXT,
"rating" TEXT,
"duration" TEXT,
"listed_in" ARRAY(TEXT),
"description" TEXT INDEX using fulltext
);
:::

Run the above SQL command in CrateDB to set up your table. With the table ready,
you’re now set to insert the dataset.

## Inserting Data

Now, insert data into the table you just created, by using the `COPY FROM`
SQL statement.

:::{code} sql
COPY netflix_catalog
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_netflix.json.gz'
WITH (format = 'json', compression='gzip');
:::

Run the above SQL command in CrateDB to import the dataset. After this commands
finishes, you are now ready to start querying the dataset.

## Using Full-text Search

Start with a basic `SELECT` statement on all columns, and limit the output to
display only 10 records, in order to quickly explore a few samples worth of data.

:::{code} sql
SELECT *
FROM netflix_catalog
LIMIT 10;
:::

CrateDB Cloud’s full-text search can be leveraged to find specific entries based
on text matching. In this query, you are using the `MATCH` function on the
`description` field to find all movies or TV shows that contain the word "love".
The results can be sorted by relevance score by using the synthetic `_score` column.

:::{code} sql
SELECT title, description
FROM netflix_catalog
WHERE MATCH(description, 'love')
ORDER BY _score DESC
LIMIT 10;
:::

While full-text search is incredibly powerful, you can still perform more
traditional types of queries. For example, to find all titles directed by
"Kirsten Johnson", and sort them by release year, you can use:

:::{code} sql
SELECT title, release_year
FROM netflix_catalog
WHERE director = 'Kirsten Johnson'
ORDER BY release_year DESC;
:::

This query uses the conventional `WHERE` clause to find movies directed by
Kirsten Johnson, and the `ORDER BY` clause to sort them by their release year
in descending order.

Through these examples, you can see that CrateDB Cloud offers you a wide array
of querying possibilities, from basic SQL queries to advanced full-text
searches, making it a versatile choice for managing and querying your datasets.
6 changes: 4 additions & 2 deletions docs/domain/timeseries/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,17 @@ Learn how to optimally use CrateDB for time series use-cases.
- [](#timeseries-basics)
- [](#timeseries-normalize)
- [Financial data collection and processing using pandas]
- [](inv:cloud#time-series)
- [](inv:cloud#time-series-advanced)
- [](#time-series-guide)
- [](#time-series-advanced-guide)
- [Time-series data: From raw data to fast analysis in only three steps]

:::{toctree}
:hidden:

generate/index
normalize-intervals
time-series
time-series-advanced
:::

[Financial data collection and processing using pandas]: https://community.cratedb.com/t/automating-financial-data-collection-and-storage-in-cratedb-with-python-and-pandas-2-0-0/916
Expand Down
Loading