Skip to content

Jobs (Functions) (Syntax and Usage)

cybersader edited this page Jun 7, 2023 · 7 revisions

Job: search_and_flatten_csv

Overview

The search_and_flatten_csv function is a type of job that allows you to flatten JSON data from files into a CSV format based on the search configurations defined. The purpose of the search_and_flatten_csv function is to simplify this process of JSON to CSV while also having low memory and the ability to handle files that are HUGE (even 100GB) by using a "streaming" approach. This allows you to analyze and process your JSON data more effectively, especially when dealing with large datasets that can't easily be explored in local tools.

JSON Config File Syntax

{
    "jobs": [
        {
            "type": "search_and_flatten_csv",
            "name": "search",
            "search_config_path": "searches.json",
            "searchconfigs": {
                "search_and_flatten1": "example1.json",
                "search_and_flatten2": "example2.json"
            },
            "delimiter": ",",
            "mode": "normal",
            "num_test_rows": 100,
            "verbose": false
        },
        ...more jobs here...
    ]
}

Configuration Parameters

Below is an explanation of the configuration parameters in the jobs configuration file:

  • type: The type of job. For this function, it should be "search_and_flatten_csv".
  • name: The name of the job.
  • search_config_path: The path to the JSON file that contains the search configuration.
  • searchconfigs: An object containing pairs of search configuration names and corresponding JSON file paths. These files contain the specific search criteria for each job.
  • delimiter: The delimiter used in the resulting CSV file. Typically this is a comma (,) but can be set to other characters as required.
  • mode: The running mode. Can be "normal" or "test".
  • num_test_rows: The number of test rows to be generated when the mode is "test".
  • verbose: A boolean value indicating whether to log additional information during the execution. Set to true to enable, false to disable.

Search Configuration

The search configuration is defined in a separate JSON file as specified in search_config_path. This JSON file contains the keys to be flattened from the JSON data.

Below are the properties used in the search configuration:

  • root_key: The root key from where the search should begin in the JSON structure. If this is an empty string, the search begins from the top level. If the input JSON is one single object, this will result in one row in a CSV. If the root key points to an array of objects, then the result will be multiple rows in the outputted CSV.
  • search_config: An array of keys or a string "*" indicating that all keys should be included. This can also be an object that defines specific properties for each key.
  • similarity_threshold: A decimal value ranging from 0 to 1 indicating the threshold of similarity when comparing keys.
  • array_handling: Defines how array data should be handled. The "stringify" option converts the entire array into a single string, while "explode" treats each element in the array as a separate entity, and "vertical" makes new columns for up to the max # of items in that array over the whole JSON file.
  • object_handling: Defines how object data should be handled. The "stringify" option converts the applicable object or value into a single string, while "explode" flattens the object into all of its children keys (using dot notation - "parent_key.child.grandchild").
  • allow_dot_notation: A boolean indicating whether dot notation should be use to find keys. If true, keys can be accessed as "object.key". If false, the program will go slower, but it can utilize similarity_threshold to use "jaccard index" to find keys that approximately match.
  • max_string_length: The maximum string length allowed for a value. If a string exceeds this length, it will be handled according to the long_string_handling option.
  • long_string_handling: Defines how strings exceeding max_string_length should be handled. "truncate" will cut off the string at the maximum length, while "split" will break the string into multiple keys.

Example of searches.json

look at search_config options (below) for configuring the search in detail

{
    "search_and_flatten1":{
        "root_key":"",
        "search_config":[     # list search
            "account_name",
            "asset_unique_id",
            ...
        ],
        ...
    },
    "search_and_flatten2":{
        "root_key":"",
        "search_config":"*",  # wildcard search
        ...
    },
    "search_and_flatten3":{
        "root_key":"",
        "search_config":      # granular search
            "key1": {
                "similarity_threshold": 1.0,
                "array_handling": "stringify",
                "object_handling": "stringify",
                "allow_dot_notation": true
            },
            "key2": {
                "similarity_threshold": 0.9,
                "array_handling": "stringify",
                "object_handling": "recurse",
                "allow_dot_notation": false
            },
        ...
    },
    ...
}

In this example, 3 separate configurations for search_config have been chosen: one that "tabulizes" all of the keys in the JSON file, and the other which does it with a select list of keys.

search_config options

This is the most important parameter for search configuration!

The search_config parameter in the search configuration can take a wildcard ("*"), list, or granular (dict) search options.

Wildcard Search

The wildcard search configuration uses an asterisk ("*") to expand all keys in the JSON depending on the base level options. This is a quick way to flatten all data, but is not recommended for large JSON files with many keys due to performance considerations and issues that still haven't been solved with the "DynamicDictWriter" function and "searchAndFlatten".

{
    "all_the_data": {
        "root_key": "",
        "search_config": "*",
        "similarity_threshold": 1.0,
        "array_handling": "stringify",
        "object_handling": "recurse",
        "allow_dot_notation": true,
        "max_string_length": 32750,
        "long_string_handling": "truncate"
    }
}

List Search

The list search configuration allows you to apply base level options to a specific list of keys. This is useful when you want to flatten a subset of the data.

{
    "specific_keys_only": {
        "root_key": "",
        "search_config": [
            "key1",
            "key2",
            "key3",
            "key4.subkey",
            "key5.subkey.subsubkey"
        ],
        "similarity_threshold": 1.0,
        "array_handling": "stringify",
        "object_handling": "stringify",
        "allow_dot_notation": true,
        "max_string_length": 32750,
        "long_string_handling": "truncate"
    }
}

Granular Search

The granular search configuration applies specific options to each key that is searched for. This provides the most flexibility as you can customize how each key is handled.

{
    "granular_control": {
        "root_key": "",
        "search_config": {
            "key1": {
                "similarity_threshold": 1.0,
                "array_handling": "stringify",
                "object_handling": "stringify",
                "allow_dot_notation": true
            },
            "key2": {
                "similarity_threshold": 0.9,
                "array_handling": "stringify",
                "object_handling": "recurse",
                "allow_dot_notation": false
            }
        },
        "similarity_threshold": 1.0,
        "array_handling": "stringify",
        "object_handling": "recurse",
        "allow_dot_notation": true,
        "max_string_length": 32750,
        "long_string_handling": "truncate"
    }
}

In the granular search, you can still specify the base level options, which will act as the default for all keys. However, the specific options within each key will override these base level options.