Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example/Documentation unclear for low level reading of file size. #255

Open
heijligers opened this issue Dec 5, 2023 · 6 comments
Open

Comments

@heijligers
Copy link

I'm trying to use GPT4 to implement a python smb crawler that has to connect over a VERY SLOW connection with a Synology NAS with MILLIONS of files. Luckily I only need a subset of the folder and of the file types.
Can someone help me get a basic version up and running. Both using various GPT tools and trying to parse the low level source code myself i haven't managed to get the following software design and reference implementation to work:

Prototype 3:

  • Samba Client Library: smbprotocol
  • Configuration Parser: PyYAML
  • Logging Framework: Python's built-in loguru
  • Threading/Concurrency Approach: single-threaded
  • Error Handling and Retry Logic: tenacity (for retry logic)

Yaml.conf:
top_folder_filter: P100* file_copy_extention_filter: - .xml - .dat - .txt - .doc - .docx local_directory: ../download
Intended Pseudocode
`- Initialize:
- Load configuration from 'config.yaml'
- Establish connection to Samba server using 'server_ip', 'server_user', 'server_password'
- Initialize logging framework

  • Main Process:

    • Query the Samba share using 'top_folder_filter' pattern
    • For each folder returned from the query:
      • Call 'Recursive Folder Crawl' with folder as argument
  • Recursive Folder Crawl (folder):

    • Query the current folder with a pattern matching either files with an extension present in 'file_copy_extention_filter' or matching the name of a subfolder
    • For each file returned from the query:
      • Download file to 'local_directory', maintaining the folder structure found on the remote Samba share
      • Log download status, original filesize, creation date, and modification date
    • For each subfolder returned from the query:
      • Call 'Recursive Folder Crawl' with subfolder as argument
  • Error Handling:

    • If any error occurs during the process:
      • Log the error
      • If error is recoverable (e.g., temporary network issue), retry the operation using 'tenacity'
      • If error is not recoverable, terminate the process and log the final state
  • Finalize:

    • Close connection to Samba server
    • Log final state
      `
      Attempt1:
      'import logging
      import threading
      import yaml
      from smbprotocol.connection import Connection
      from smbprotocol.session import Session
      from smbprotocol.tree import TreeConnect
      from smbprotocol.open import Open, CreateDisposition, FileAttributes, CreateOptions, DirectoryAccessMask
      from tenacity import retry, stop_after_attempt, wait_fixed
      import os
      import fnmatch
      from smbprotocol.exceptions import SMBResponseException

Configure logging

logging.basicConfig(level=logging.INFO)
import uuid

def main():

# Load configuration from YAML file
with open("config.yaml", "r") as file:
        config = yaml.safe_load(file)

# Samba client configuration
server_ip = config['server_ip']
username = config['server_user']
password = config['server_password']
share_name = config['share_name']
top_folder_filter = config['top_folder_filter']
file_copy_extention_filter = config['file_copy_extention_filter']
try:
    guid = uuid.uuid4()
    connection = Connection(guid, server_ip)
    connection.connect()
    session = Session(connection, username, password)
    session.connect()

    # Ensure the share name is correctly formatted as '\\server\share' before passing it to TreeConnect
    formatted_share_name = rf"\\{server_ip}\{share_name}"
    logging.info(f"Formatted Share Name: {formatted_share_name}")

    tree = TreeConnect(session, formatted_share_name)

    try:
        tree.connect()
    except SMBResponseException as e:
        logging.error(f"Error connecting to share: {e}")
        raise

    # Retry strategy for file download
    @retry(stop=stop_after_attempt(3), wait=wait_fixed(2))
    def download_file(tree, file_path, local_path):
        try:
            with Open(tree, file_path) as file:
                file.read(local_path)
                logging.info(f"Downloaded file: {file_path}")
                track_matched_file(file_path)
        except Exception as e:
            logging.error(f"Error downloading file {file_path}: {e}")
            raise

    # Function to check file extension
    def should_copy_file(file_name):
        return any(fnmatch.fnmatch(file_name, '*' + ext) for ext in file_copy_extention_filter)

    # Function to crawl directories
    def crawl_directory(tree, path=""):
        try:
            with Open(tree, path, desired_access=DirectoryAccessMask.FILE_LIST_DIRECTORY) as dir:
                for file_info in dir.query_directory("*"):
                    file_path = os.path.join(path, file_info['file_name'])
                    if file_info['file_attributes'] & FileAttributes.FILE_ATTRIBUTE_DIRECTORY:
                        if fnmatch.fnmatch(file_info['file_name'], top_folder_filter):
                            crawl_directory(tree, file_path)
                    elif should_copy_file(file_info['file_name']):
                        download_file(tree, file_path, f"local_directory/{file_info['file_name']}")
        except Exception as e:
            logging.error(f"Error crawling directory {path}: {e}")
    # Function to track matching files
    matched_files = []

    def track_matched_file(file_path):
        matched_files.append(file_path)
        logging.info(f"Tracking file: {file_path}")

    crawl_directory(tree)
except Exception as e:
    logging.error(f"Error in main: {e}")

if name == "main":
main()
'

attempt 2 (incomplete)
'
import yaml
from loguru import logger
from tenacity import retry, stop_after_attempt, wait_exponential
from smbprotocol.open import CreateDisposition, CreateOptions, DirectoryAccessMask, FileAttributes,
FileInformationClass, ImpersonationLevel, Open, ShareAccess
from contextlib import contextmanager
from io import BytesIO
from smbprotocol.connection import Connection
from smbprotocol.session import Session
from smbprotocol.open import CreateDisposition, FileAttributes, FilePipePrinterAccessMask, ImpersonationLevel, Open,
ShareAccess
from smbprotocol.tree import TreeConnect
from smbprotocol.connection import Connection
from smbprotocol.session import Session
from smbprotocol.tree import TreeConnect
from smbprotocol.connection import Connection
from smbprotocol.session import Session
from smbprotocol.open import CreateDisposition, CreateOptions, DirectoryAccessMask, FileAttributes,
FileInformationClass, ImpersonationLevel, Open, ShareAccess
from smbprotocol.tree import TreeConnect
import uuid,sys

def smb_b_open(tree, mode='r', share='r', username=None, password=None, encrypt=True):
"""
Functions similar to the builtin open() method where it will create an open handle to a file over SMB. This can be
used to read and/or write data to the file using the methods exposed by the Open() class in smbprotocol. Read and
write operations only support bytes and not text strings.

:param tree: smbprotocol tree object
:param mode: The mode in which the file is to be opened, can be set to one of the following;
    'r': Opens the file for reading (default)
    'w': Opens the file for writing, truncating first
    'x': Create a new file and open it for writing, fail if the file already exists
:param share: The SMB sharing mode to set for the opened file handle, can be set to one or more of the following:
    'r': Allows other handles to read from the file (default)
    'w': Allows other handles to write to the file
    'd': Allows other handles to delete the file
:param username: Optional username to use for authentication, required if Kerberos is not used.
:param password: Optional password to use for authentication, required if Kerberos is not used.
:param enrypt: Whether to use encryption or not, Must be set to False if using an older SMB Dialect.
:return: The opened smbprotocol Open() obj that has a read, write, and flush functions.
"""

try:
    if mode == 'r':
        create_disposition = CreateDisposition.FILE_OPEN
        access_mask = FilePipePrinterAccessMask.GENERIC_READ
    elif mode == 'w':
        create_disposition = CreateDisposition.FILE_OVERWRITE_IF
        access_mask = FilePipePrinterAccessMask.GENERIC_WRITE
    elif mode == 'x':
        create_disposition = CreateDisposition.FILE_CREATE
        access_mask = FilePipePrinterAccessMask.GENERIC_WRITE
    else:
        raise ValueError("Invalid mode value specified.")

    share_map = {
        'r': ShareAccess.FILE_SHARE_READ,
        'w': ShareAccess.FILE_SHARE_WRITE,
        'd': ShareAccess.FILE_SHARE_DELETE,
    }
    share_access = 0
    for s in share:
        share_access |= share_map[s]

    obj = Open(tree, file_path)
    obj.create(
        ImpersonationLevel.Impersonation,
        access_mask,
        FileAttributes.FILE_ATTRIBUTE_NORMAL,
        share_access,
        create_disposition,
        0,
    )

    try:
        yield obj
    finally:
        obj.close()

class FileEntry(object):

def __init__(self, path, file_directory_info):
    self.name = file_directory_info['file_name'].value.decode('utf-16-le')
    self.path = r"%s\%s" % (path, self.name)
    self.ctime = file_directory_info['creation_time'].value
    self.atime = file_directory_info['last_access_time'].value
    self.wtime = file_directory_info['last_write_time'].value
    self.size = file_directory_info['allocation_size'].value
    self.attributes = file_directory_info['file_attributes'].value

    self.is_archive = self._flag_set(FileAttributes.FILE_ATTRIBUTE_ARCHIVE)
    self.is_compressed = self._flag_set(FileAttributes.FILE_ATTRIBUTE_COMPRESSED)
    self.is_directory = self._flag_set(FileAttributes.FILE_ATTRIBUTE_DIRECTORY)
    self.is_hidden = self._flag_set(FileAttributes.FILE_ATTRIBUTE_HIDDEN)
    self.is_normal = self._flag_set(FileAttributes.FILE_ATTRIBUTE_NORMAL)
    self.is_readonly = self._flag_set(FileAttributes.FILE_ATTRIBUTE_READONLY)
    self.is_reparse_point = self._flag_set(FileAttributes.FILE_ATTRIBUTE_REPARSE_POINT)
    self.is_system = self._flag_set(FileAttributes.FILE_ATTRIBUTE_SYSTEM)
    self.is_temporary = self._flag_set(FileAttributes.FILE_ATTRIBUTE_TEMPORARY)

def _flag_set(self, attribute):
    return self.attributes & attribute == attribute

Define _listdir helper function for applying a filter pattern and recursion to listing the content of a samba share,

specified by the tree variable

def _listdir(tree, path, pattern, recurse):
full_path = tree.share_name
if path != "":
full_path += r"%s" % path

    # We create a compound request that does the following;
    #     1. Opens a handle to the directory
    #     2. Runs a query on the directory to list all the files
    #     3. Closes the handle of the directory
    # This is done in a compound request so we send 1 packet instead of 3 at the expense of more complex code.
    directory = Open(tree, path)
    query = [
        directory.create(
            ImpersonationLevel.Impersonation,
            DirectoryAccessMask.FILE_LIST_DIRECTORY,
            FileAttributes.FILE_ATTRIBUTE_DIRECTORY,
            ShareAccess.FILE_SHARE_READ | ShareAccess.FILE_SHARE_WRITE,
            CreateDisposition.FILE_OPEN,
            CreateOptions.FILE_DIRECTORY_FILE,
            send=False
        ),
        directory.query_directory(
            pattern,
            FileInformationClass.FILE_DIRECTORY_INFORMATION,
            send=False
        ),
        directory.close(False, send=False)
    ]

    query_reqs = tree.session.connection.send_compound(
        [x[0] for x in query],
        tree.session.session_id,
        tree.tree_connect_id,
        related=True
    )

    # Process the result of the create and close request before parsing the files.
    query[0][1](query_reqs[0])
    query[2][1](query_reqs[2])

    # Parse the queried files and repeat if the entry is a directory and recurse=True. We ignore . and .. as they are
    # not directories inside the queried dir.
    entries = []
    ignore_entries = [".".encode('utf-16-le'), "..".encode('utf-16-le')]
    for file_entry in query[1][1](query_reqs[1]):
        if file_entry['file_name'].value in ignore_entries:
            continue

        fe = FileEntry(full_path, file_entry)
        entries.append(fe)

        if fe.is_directory and recurse:
            dir_path = r"%s\%s" % (path, fe.name) if path != "" else fe.name
            entries += _listdir(tree, dir_path, recurse)

    return entries

def main1():
# Load configuration
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
# Samba client configuration
server_ip = config['server_ip']
username = config['server_user']
password = config['server_password']
share_name = config['share_name']
top_folder_filter = config['top_folder_filter']
file_copy_extention_filter = config['file_copy_extention_filter']

# Initialize logging
logger.add("file.log")
logger.add(sys.stderr, format="{time} {level} {message}", filter="my_module", level="INFO")

# Establish connection to Samba server
# Here we will use the Connection, Session, and TreeConnect classes from smbprotocol to establish a connection to the Samba server.
# Initialize connection
connection = Connection(uuid.uuid4(), config['server_ip'], 445)
connection.connect()
session = Session(connection, config['server_user'], config['server_password'])
session.connect()
tree = TreeConnect(session, rf"\\{config['server_ip']}\{config['share_name']}")
tree.connect()

# Query the Samba share for top level folders qualifying the top_folder_filter
entries =_listdir(tree,"",top_folder_filter,False)
# prepare downloading file
create_disposition = CreateDisposition.FILE_OPEN
access_mask = FilePipePrinterAccessMask.GENERIC_READ
share_map = {
    'r': ShareAccess.FILE_SHARE_READ,
    'w': ShareAccess.FILE_SHARE_WRITE,
    'd': ShareAccess.FILE_SHARE_DELETE,
}
share_access = 0
share = 'r'
for s in share:
    share_access |= share_map[s]

# For each folder returned from the query, call a recursive function to crawl the folder.
# Here we will define a recursive function that takes a folder as an argument.
# This function will query the current folder using the method _listdir.
# For each file, it will download it using smb_b_open and log the download status.
for entry in entries:
    subentries = _listdir(tree, entry.name, "*", True)
    for subentry in subentries:
        if subentry.name.split('.')[-1] in file_copy_extention_filter:
            obj = Open(tree, subentry.path)
            obj.create(
                ImpersonationLevel.Impersonation,
                access_mask,
                FileAttributes.FILE_ATTRIBUTE_NORMAL,
                share_access,
                create_disposition,
                0,
            )
            file_info = obj.query_info(FileInformationClass.FILE_STANDARD_INFORMATION)
            file_size = file_info['end_of_file'].get_value()
            file_contents = obj.read(0, file_size)
            with open(subentry.name, 'wb') as local_file:
                local_file.write(file_contents)
                
# If an error occurs, log the error and if it's recoverable, retry the operation using tenacity.
# Here we will use the retry decorator from tenacity to automatically retry operations in case of recoverable errors. We can customize the retry logic by specifying the number of attempts, wait time, etc.

# Finally, close the connection to the Samba server and log the final state.
# Here we will use the disconnect method of the Connection class from smbprotocol to close the connection to the Samba server. We will also log the final state, which could include the number of files downloaded, the number of errors encountered, and the last directory or file that was processed.

if name == "main":
main1()

'

Software Design Specification for a Remote Samba Share Crawler

Overview

The Remote Samba Share Crawler is designed to connect to a Samba share, crawl through its directories and files, and download specified files to a local directory. It supports various features like recursive crawling, threading, logging, and error handling.

Functional Requirements

  1. Connection Management: Establish and manage a connection to a Samba share using server IP, user credentials, and share name.
  2. Directory Crawling: Recursively list directories and files in the Samba share, starting from a specified base directory.
  3. File Downloading: Download files from the Samba share to a local directory, with support for retries and throttling.
  4. Logging: Log various operations and errors for debugging and monitoring.
  5. State Management: Maintain and save the state of crawling and downloading operations, allowing resumption from the last state in case of interruption.
  6. Configuration Management: Load and use configuration from an external file, allowing easy modification of parameters.
  7. Error Handling: Handle and log errors, particularly in connection establishment, file listing, and file downloading.

Non-functional Requirements

  1. Modularity: Code should be structured into distinct classes and functions for ease of maintenance and scalability.
  2. Performance: Efficient crawling and downloading, with the option to use threading to improve performance.
  3. Security: Secure handling of credentials and encryption of the connection where necessary.
  4. Flexibility: Ability to easily change the underlying Samba client library or logging framework.

Proposed Architecture

1. Classes and Modules

  • Crawler: Main class handling connection, crawling, downloading, and state management.
  • FileEntry: Class representing a file or directory in the Samba share.
  • Configuration loader module (e.g., using yaml or json).
  • Logging module (e.g., using Python's logging module or an alternative).

2. External Libraries

  • Samba client library (e.g., smbprotocol, pysmb, or an equivalent).
  • Configuration parser (e.g., PyYAML or json).
  • Logging framework (e.g., Python's built-in logging module or an equivalent like loguru).

3. Configuration

  • Use an external configuration file (YAML, JSON, or .env) for setting parameters like server IP, user credentials, share name, local directory, and other crawler settings.

4. Logging

  • Implement a logging system that can be easily replaced or modified. It should support different log levels and output to multiple destinations (console, file, etc.).

5. Error Handling and Retry Logic

  • Implement comprehensive error handling throughout the application.
  • Include retry logic for file downloads with exponential backoff.

6. Threading and Concurrency

  • Optional use of threading for parallel processing of file downloading to enhance performance.
  • Thread safety considerations in state management and logging.
@jborean93
Copy link
Owner

I would highly recommend you use the high level API, specifically smbclient.scandir to enumerate entries on a directory. There's not too much that you really gain by using the low level API here as I've tried to make the high level one as efficient as possible for the operations needed. Even just things like opening a file/directory can be done with the high level API and then using the raw file open object can be used for low level operations that might not be exposed in the high level API.

Ultimately I can't help you write your actual application, I can help if you have specific questions about smbprotocol that you may have but that's about it. If you don't have a specific question or query then I'll close this issue tomorrow.

@heijligers
Copy link
Author

heijligers commented Dec 5, 2023 via email

@jborean93
Copy link
Owner

Yep, the search_pattern kwarg

def scandir(path, search_pattern="*", **kwargs):
supports the normal server side filtering with * and ? that the underlying SMB server supports.

@heijligers
Copy link
Author

heijligers commented Dec 6, 2023 via email

@jborean93
Copy link
Owner

The closest there is is the "Owner" of the file in the security descriptor. Unfortunately it's not reliable as on Windows this could be the Administrators group or whatever is set in the user's group sids as the owner. Plus getting that value will only give you the SID string in python, you still need a separate process to translate that to an account name which this library does not do.

@heijligers
Copy link
Author

heijligers commented Dec 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants