From 5212b7fa36d1dfd0a182e195dbe7376a75292c66 Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Fri, 13 Oct 2023 13:45:54 +0530 Subject: [PATCH 1/5] Added Inbox docs --- .../dlt-ecosystem/verified-sources/inbox.md | 288 ++++++++++++++++++ docs/website/sidebars.js | 1 + 2 files changed, 289 insertions(+) create mode 100644 docs/website/docs/dlt-ecosystem/verified-sources/inbox.md diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md new file mode 100644 index 0000000000..bc91897cb3 --- /dev/null +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -0,0 +1,288 @@ +--- +title: Inbox +description: dlt verified source for Mail Inbox +keywords: [inbox, inbox verified source, inbox mail, email] +--- + +# Inbox + +:::info Need help deploying these sources, or figuring out how to run them in your data stack? + +[Join our Slack community](https://dlthub-community.slack.com/join/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) +or [book a call](https://calendar.app.google/kiLhuMsWKpZUpfho6) with our support engineer Adrian. +::: + +This source collects inbox emails, retrieves attachments, and stores relevant email data. It uses the imaplib library for IMAP interactions and the dlt library for data processing. + +This Inbox `dlt` verified source and +[pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/inbox_pipeline.py) +loads data using “Inbox” verified source to the destination of your choice. + +Sources and resources that can be loaded using this verified source are: + +| Name | Description | +|-------------------|----------------------------------------------------| +| inbox_source | Gathers inbox emails and saves attachments locally | +| get_messages_uids | Retrieves messages UUIDs from the mailbox | +| get_messages | Retrieves emails from the mailbox using given UIDs | +| get_attachments | Downloads attachments from emails using given UIDs | + +## Setup Guide + +### Grab credentials + +1. For verified source configuration, you need: + - "host": IMAP server hostname (e.g., Gmail: imap.gmail.com, Outlook: imap.gmail.com). + - "email_account": Associated email account name. + - "password": APP password (for third-party clients) from the email provider. + +1. Host addresses and APP password procedures vary by provider and can be found via a quick Google search. For Google Mail's app password, read [here](https://support.google.com/mail/answer/185833?hl=en#:~:text=An%20app%20password%20is%20a,2%2DStep%20Verification%20turned%20on). + +1. However, this guide covers Gmail inbox configuration; similar steps apply to other providers. + +### Accessing Gmail Inbox + +1. SMTP server DNS: 'imap.gmail.com' for Gmail. +1. Port: 993 (for internet messaging access protocol over TLS/SSL). + +### Grab App password for Gmail + +1. An app password is a 16-digit code allowing less secure apps/devices to access your Google Account, available only with 2-Step Verification activated. + +#### Steps to Create and Use App Passwords: + +1. Visit your Google Account > Security. +1. Under "How you sign in to Google", enable 2-Step Verification. +1. Choose App passwords at the bottom. +1. Name the device for reference. +1. Click Generate. +1. Input the generated 16-character app password as prompted. +1. Click Done. + +Read more in [this article](https://pythoncircle.com/post/727/accessing-gmail-inbox-using-python-imaplib-module/) or [Google official documentation.](https://support.google.com/mail/answer/185833#zippy=%2Cwhy-you-may-need-an-app-password) + +### Initialize the verified source + +To get started with your data pipeline, follow these steps: + +1. Enter the following command: + + ```bash + dlt init inbox duckdb + ``` + + [This command](../../reference/command-line-interface) will initialize + [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/inbox_pipeline.py) + with Inbox as the [source](../../general-usage/source) and + [duckdb](../destinations/duckdb.md) as the [destination](../destinations). + +1. If you'd like to use a different destination, simply replace `duckdb` with the name of your + preferred [destination](../destinations). + +1. After running this command, a new directory will be created with the necessary files and + configuration settings to get started. + +For more information, read the +[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source) + +### Add credential + +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can + securely store your access tokens and other sensitive information. It's important to handle this + file with care and keep it safe. Here's what the file looks like: + + ```toml + # put your secret values and credentials here + # do not share this file and do not push it to github + [sources.inbox] + host = "Please set me up!" # The host address of the email service provider. + email_account = "Please set me up!" # Email account associated with the service. + password = "Please set me up!" # # APP Password for the above email account. + ``` + +1. Replace the host, email and password value with the [previously copied one](#grab-credentials) + to ensure secure access to your Inbox resources. + > When adding the App Password, remove any spaces. For instance, "abcd efgh ijkl mnop" should be "abcdefghijklmnop". + +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to + add credentials for your chosen destination, ensuring proper routing of your data to the final + destination. + +## Run the pipeline + +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by + running the command: + ```bash + pip install -r requirements.txt + ``` + + Prerequisites for fetching messages differ by provider. For Gmail: + + - Python 3.x + - dlt library: pip install dlt + - PyPDF2: pip install PyPDF2 + - Specific destinations, e.g., duckdb: pip install duckdb + - (Note: Confirm based on your service provider.) + +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using + the following command: + ```bash + dlt pipeline show + ``` + For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also + use any custom name instead. + +For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline) + +## Sources and resources + +`dlt` works on the principle of [sources](../../general-usage/source) and +[resources](../../general-usage/resource). + +### Source `inbox_source` + +This function fetches inbox emails, saves attachments locally, and returns uids, messages, and attachments as resources. + +```python +@dlt.source +def inbox_source( + host: str = dlt.secrets.value, + email_account: str = dlt.secrets.value, + password: str = dlt.secrets.value, + folder: str = "INBOX", + gmail_group: Optional[str] = GMAIL_GROUP, + start_date: pendulum.DateTime = DEFAULT_START_DATE, + filter_emails: Sequence[str] = None, + filter_by_mime_type: Sequence[str] = None, + chunksize: int = DEFAULT_CHUNK_SIZE, +) -> Sequence[DltResource]: +``` + +`host` : IMAP server hostname. Default: 'dlt.secrets.value'. + +`email_account`: Email login. Default: 'dlt.secrets.value'. + +`password`: Email App password. Default: 'dlt.secrets.value'. + +`folder`: Mailbox folder for collecting emails. Default: 'INBOX'. + +`gmail_group`: Google Group email for filtering. Default: settings 'GMAIL_GROUP'. + +`start_date`: Start date to collect emails. Default: `/inbox/settings.py` 'DEFAULT_START_DATE'. + +`filter_emails`:Email addresses for 'FROM' filtering. Default: settings 'FILTER_EMAILS'. + +`filter_by_mime_type`: MIME types for attachment filtering. Default: []. + +`chunksize`: UIDs collected per batch. Default: settings 'DEFAULT_CHUNK_SIZE'. + +### Resource `get_messages_uids` + +This function retrieves email message UIDs (Unique IDs) from the mailbox. + +```python + @dlt.resource(name="uids") + def get_messages_uids( + initial_message_num: Optional[ + dlt.sources.incremental[int] + ] = dlt.sources.incremental("message_uid", initial_value=1), + ) -> TDataItem: +``` + +`initial_message_num`: provides incremental loading on UID. + +### Resource `get_messages` + +This function reads mailbox emails using the given UIDs. + +```python +@dlt.transformer(name="messages", primary_key="message_uid") +def get_messages( + items: TDataItems, + include_body: bool = True, +) -> TDataItem: +``` + +`items` (TDataItems): An iterable of dictionaries with 'message_uid' for email UIDs. + +`include_body` (bool, optional): Includes email body if True. Default: True. + +Similar to the previous resources, resource `get_attachments` downloads email attachments using the provided UIDs. + +## Customization + +### Create your own pipeline + +If you wish to create your own pipelines, you can leverage source and resource methods from this +verified source. + +1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: + + ```python + pipeline = dlt.pipeline( + pipeline_name="standard_inbox", # Use a custom name if desired + destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) + dataset_name="standard_inbox_data" # Use a custom name if desired + full_refresh=True, + ) + ``` + To read more about pipeline configuration, please refer to our + [documentation](../../general-usage/pipeline). + +1. To load messages from "mycreditcard@bank.com" starting "2023-10-1": + + - Set DEFAULT_START_DATE = pendulum.datetime(2023, 10, 1) in "./inbox/settings.py". + - Use the following code: + ```python + # Retrieve messages from the specified email address. + messages = inbox_source(filter_emails=("mycreditcard@bank.com",)).messages + # Configure messages to exclude body and name the result "my_inbox". + messages = messages(include_body=False).with_name("my_inbox") + # Execute the pipeline and load messages to the "my_inbox" table. + load_info = pipeline.run(messages) + # Print the loading details. + print(load_info) + # Return the configured pipeline. + return pipeline + ``` +1. To load messages from multiple emails, including "community@dlthub.com": + + ```python + messages = inbox_source(filter_emails=("mycreditcard@bank.com", "community@dlthub.com.")).messages + + --- rest of the code + ``` + +1. In "inbox_pipeline.py", the "pdf_to_text" transformer extracts text from PDFs, treating each page as a separate data item. + Here is the code for the "pdf_to_text" transformer function: + ```python + @dlt.transformer(primary_key="file_hash", write_disposition="merge") + def pdf_to_text(file_items: Sequence[FileItemDict]) -> Iterator[Dict[str, Any]]: + # extract data from PDF page by page + for file_item in file_items: + with file_item.open() as file: + reader = PdfReader(file) + for page_no in range(len(reader.pages)): + # add page content to file item + page_item = {} + page_item["file_hash"] = file_item["file_hash"] + page_item["text"] = reader.pages[page_no].extract_text() + page_item["subject"] = file_item["message"]["Subject"] + page_item["page_id"] = file_item["file_name"] + "_" + str(page_no) + # TODO: copy more info from file_item + yield page_item + ``` +1. Using the "pdf_to_text" function to load parsed pdfs from mail to the database: + + ```python + filter_emails = ["mycreditcard@bank.com", "community@dlthub.com."] # Email senders + attachments = inbox_source( + filter_emails=filter_emails, filter_by_mime_type=["application/pdf"] + ).attachments + + # Process attachments through PDF parser and save to 'my_pages' table. + load_info = pipeline.run((attachments | pdf_to_text).with_name("my_pages")) + # Display loaded data details. + print(load_info) + return pipeline + ``` \ No newline at end of file diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 80e0fbd19d..fadb2397fb 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -44,6 +44,7 @@ const sidebars = { 'dlt-ecosystem/verified-sources/google_analytics', 'dlt-ecosystem/verified-sources/google_sheets', 'dlt-ecosystem/verified-sources/hubspot', + 'dlt-ecosystem/verified-sources/inbox', 'dlt-ecosystem/verified-sources/jira', 'dlt-ecosystem/verified-sources/matomo', 'dlt-ecosystem/verified-sources/mongodb', From 49c2d125f6db5e92a2ef3dbaa301ee00aeb5956b Mon Sep 17 00:00:00 2001 From: dat-a-man Date: Sun, 15 Oct 2023 10:54:43 +0530 Subject: [PATCH 2/5] Added Inbox docs --- docs/website/docs/dlt-ecosystem/verified-sources/inbox.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index bc91897cb3..d0c71be78f 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -178,7 +178,7 @@ def inbox_source( ### Resource `get_messages_uids` -This function retrieves email message UIDs (Unique IDs) from the mailbox. +This resource collects email message UIDs (Unique IDs) from the mailbox. ```python @dlt.resource(name="uids") @@ -193,7 +193,7 @@ This function retrieves email message UIDs (Unique IDs) from the mailbox. ### Resource `get_messages` -This function reads mailbox emails using the given UIDs. +This resource retrieves emails by UID (Unique IDs), yielding a dictionary with metadata like UID, ID, sender, subject, dates, content type, and body. ```python @dlt.transformer(name="messages", primary_key="message_uid") @@ -207,7 +207,7 @@ def get_messages( `include_body` (bool, optional): Includes email body if True. Default: True. -Similar to the previous resources, resource `get_attachments` downloads email attachments using the provided UIDs. +Similar to the previous resources, resource `get_attachments` extracts email attachments by UID from the IMAP server. It yields file items with attachments in the file_content field and the original email in the message field. ## Customization @@ -245,6 +245,7 @@ verified source. # Return the configured pipeline. return pipeline ``` + > Please refer to inbox_source() docstring for email filtering options by sender, date, or mime type. 1. To load messages from multiple emails, including "community@dlthub.com": ```python From f3e8dbb0eba6a4bdc514e6687b46f0ab0307ac33 Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Wed, 18 Oct 2023 15:28:20 +0200 Subject: [PATCH 3/5] update --- .../dlt-ecosystem/verified-sources/inbox.md | 105 +++++++++--------- 1 file changed, 51 insertions(+), 54 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index d0c71be78f..514eab03c8 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -20,20 +20,20 @@ loads data using “Inbox” verified source to the destination of your choice. Sources and resources that can be loaded using this verified source are: -| Name | Description | -|-------------------|----------------------------------------------------| -| inbox_source | Gathers inbox emails and saves attachments locally | -| get_messages_uids | Retrieves messages UUIDs from the mailbox | -| get_messages | Retrieves emails from the mailbox using given UIDs | -| get_attachments | Downloads attachments from emails using given UIDs | +| Name | Type | Description | +|-------------------|----------------------|----------------------------------------------------| +| inbox_source | source | Gathers inbox emails and saves attachments locally | +| get_messages_uids | resource | Retrieves messages UUIDs from the mailbox | +| get_messages | resource-transformer | Retrieves emails from the mailbox using given UIDs | +| get_attachments | resource-transformer | Downloads attachments from emails using given UIDs | ## Setup Guide ### Grab credentials 1. For verified source configuration, you need: - - "host": IMAP server hostname (e.g., Gmail: imap.gmail.com, Outlook: imap.gmail.com). - - "email_account": Associated email account name. + - "host": IMAP server hostname (e.g., Gmail: imap.gmail.com, Outlook: imap-mail.outlook.com). + - "email_account": Associated email account name (e.g. dlthub@dlthub.com). - "password": APP password (for third-party clients) from the email provider. 1. Host addresses and APP password procedures vary by provider and can be found via a quick Google search. For Google Mail's app password, read [here](https://support.google.com/mail/answer/185833?hl=en#:~:text=An%20app%20password%20is%20a,2%2DStep%20Verification%20turned%20on). @@ -116,13 +116,15 @@ For more information, read the pip install -r requirements.txt ``` - Prerequisites for fetching messages differ by provider. For Gmail: + Prerequisites for fetching messages differ by provider. - - Python 3.x - - dlt library: pip install dlt - - PyPDF2: pip install PyPDF2 - - Specific destinations, e.g., duckdb: pip install duckdb - - (Note: Confirm based on your service provider.) + For Gmail: + - `pip install google-api-python-client>=2.86.0` + - `pip install google-auth-oauthlib>=1.0.0` + - `pip install google-auth-httplib2>=0.1.0` + + For pdf parsing: + - PyPDF2: `pip install PyPDF2` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: @@ -166,27 +168,27 @@ def inbox_source( `folder`: Mailbox folder for collecting emails. Default: 'INBOX'. -`gmail_group`: Google Group email for filtering. Default: settings 'GMAIL_GROUP'. +`gmail_group`: Google Group email for filtering. Default: `/inbox/settings.py` 'GMAIL_GROUP'. `start_date`: Start date to collect emails. Default: `/inbox/settings.py` 'DEFAULT_START_DATE'. -`filter_emails`:Email addresses for 'FROM' filtering. Default: settings 'FILTER_EMAILS'. +`filter_emails`:Email addresses for 'FROM' filtering. Default: `/inbox/settings.py` 'FILTER_EMAILS'. -`filter_by_mime_type`: MIME types for attachment filtering. Default: []. +`filter_by_mime_type`: MIME types for attachment filtering. Default: None. -`chunksize`: UIDs collected per batch. Default: settings 'DEFAULT_CHUNK_SIZE'. +`chunksize`: UIDs collected per batch. Default: `/inbox/settings.py` 'DEFAULT_CHUNK_SIZE'. ### Resource `get_messages_uids` This resource collects email message UIDs (Unique IDs) from the mailbox. ```python - @dlt.resource(name="uids") - def get_messages_uids( - initial_message_num: Optional[ - dlt.sources.incremental[int] - ] = dlt.sources.incremental("message_uid", initial_value=1), - ) -> TDataItem: +@dlt.resource(name="uids") +def get_messages_uids( + initial_message_num: Optional[ + dlt.sources.incremental[int] + ] = dlt.sources.incremental("message_uid", initial_value=1), +) -> TDataItem: ``` `initial_message_num`: provides incremental loading on UID. @@ -203,11 +205,27 @@ def get_messages( ) -> TDataItem: ``` -`items` (TDataItems): An iterable of dictionaries with 'message_uid' for email UIDs. +`items` (TDataItems): An iterable containing dictionaries with 'message_uid' representing the email message UIDs. + +`include_body` (bool): Includes email body if True. Default: True. + +### Resource `get_attachments_by_uid` + +Similar to the previous resources, resource `get_attachments` extracts email attachments by UID from the IMAP server. +It yields file items with attachments in the file_content field and the original email in the message field. -`include_body` (bool, optional): Includes email body if True. Default: True. +```python +@dlt.transformer( + name="attachments", + primary_key="file_hash", +) +def get_attachments( + items: TDataItems, +) -> Iterable[List[FileItem]]: +``` +`items` (TDataItems): An iterable containing dictionaries with 'message_uid' representing the email message UIDs. -Similar to the previous resources, resource `get_attachments` extracts email attachments by UID from the IMAP server. It yields file items with attachments in the file_content field and the original email in the message field. +We use the document hash as a primary key to avoid duplicating them in tables. ## Customization @@ -223,7 +241,6 @@ verified source. pipeline_name="standard_inbox", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) dataset_name="standard_inbox_data" # Use a custom name if desired - full_refresh=True, ) ``` To read more about pipeline configuration, please refer to our @@ -231,7 +248,7 @@ verified source. 1. To load messages from "mycreditcard@bank.com" starting "2023-10-1": - - Set DEFAULT_START_DATE = pendulum.datetime(2023, 10, 1) in "./inbox/settings.py". + - Set `DEFAULT_START_DATE = pendulum.datetime(2023, 10, 1)` in `./inbox/settings.py`. - Use the following code: ```python # Retrieve messages from the specified email address. @@ -242,38 +259,18 @@ verified source. load_info = pipeline.run(messages) # Print the loading details. print(load_info) - # Return the configured pipeline. - return pipeline ``` > Please refer to inbox_source() docstring for email filtering options by sender, date, or mime type. 1. To load messages from multiple emails, including "community@dlthub.com": ```python - messages = inbox_source(filter_emails=("mycreditcard@bank.com", "community@dlthub.com.")).messages - - --- rest of the code + messages = inbox_source( + filter_emails=("mycreditcard@bank.com", "community@dlthub.com.") + ).messages ``` -1. In "inbox_pipeline.py", the "pdf_to_text" transformer extracts text from PDFs, treating each page as a separate data item. - Here is the code for the "pdf_to_text" transformer function: - ```python - @dlt.transformer(primary_key="file_hash", write_disposition="merge") - def pdf_to_text(file_items: Sequence[FileItemDict]) -> Iterator[Dict[str, Any]]: - # extract data from PDF page by page - for file_item in file_items: - with file_item.open() as file: - reader = PdfReader(file) - for page_no in range(len(reader.pages)): - # add page content to file item - page_item = {} - page_item["file_hash"] = file_item["file_hash"] - page_item["text"] = reader.pages[page_no].extract_text() - page_item["subject"] = file_item["message"]["Subject"] - page_item["page_id"] = file_item["file_name"] + "_" + str(page_no) - # TODO: copy more info from file_item - yield page_item - ``` -1. Using the "pdf_to_text" function to load parsed pdfs from mail to the database: +1. In `inbox_pipeline.py`, the `pdf_to_text` transformer extracts text from PDFs, treating each page as a separate data item. + Using the `pdf_to_text` function to load parsed pdfs from mail to the database: ```python filter_emails = ["mycreditcard@bank.com", "community@dlthub.com."] # Email senders From 77b892405f24fc9771605da43f4b806981848872 Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Wed, 18 Oct 2023 20:28:38 +0200 Subject: [PATCH 4/5] update --- docs/website/docs/dlt-ecosystem/verified-sources/inbox.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index 514eab03c8..8cc395dac5 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -282,5 +282,4 @@ verified source. load_info = pipeline.run((attachments | pdf_to_text).with_name("my_pages")) # Display loaded data details. print(load_info) - return pipeline ``` \ No newline at end of file From 26cc4edcb24bd07a8a576bca02a9f2dedac9671d Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Wed, 18 Oct 2023 20:35:51 +0200 Subject: [PATCH 5/5] update --- docs/website/docs/dlt-ecosystem/verified-sources/inbox.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index 8cc395dac5..5de31ce086 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -205,9 +205,9 @@ def get_messages( ) -> TDataItem: ``` -`items` (TDataItems): An iterable containing dictionaries with 'message_uid' representing the email message UIDs. +`items`: An iterable containing dictionaries with 'message_uid' representing the email message UIDs. -`include_body` (bool): Includes email body if True. Default: True. +`include_body`: Includes email body if True. Default: True. ### Resource `get_attachments_by_uid` @@ -223,7 +223,7 @@ def get_attachments( items: TDataItems, ) -> Iterable[List[FileItem]]: ``` -`items` (TDataItems): An iterable containing dictionaries with 'message_uid' representing the email message UIDs. +`items`: An iterable containing dictionaries with 'message_uid' representing the email message UIDs. We use the document hash as a primary key to avoid duplicating them in tables.