add other AI providers (#53)

* add other AI providers #34 * fix loglevel * use DEBUG log for exceptions * change info to debug * add --ai_settings_extractions and --ai_settings_relationships #55 * fix build error * clearer ai_setting error * adding env markdown * Update README.md * updating tests to match new AI modes * updating aimodel settings * cleaning up doc files * fix relationship_mode not getting considered before throwing ai_settings_relationships is required * updating gemini * fixing docs for multiple ai extraction providers * fixing legacy paths in docs --------- Co-authored-by: David G <[email protected]>
muchdogesec · Nov 11, 2024 · 31f7e3d · 31f7e3d
1 parent 6095983
commit 31f7e3d
Show file tree

Hide file tree

Showing 20 changed files with 2,152 additions and 1,805 deletions.
diff --git a/.env.markdown b/.env.markdown
@@ -0,0 +1,37 @@
+# Environment file info
+
+If you're running in production, you should set these securely.
+
+However, if you just want to experiment, set the following values
+
+## AI Settings
+
+* `INPUT_TOKEN_LIMIT`: `15000`
+	* (REQUIRED IF USING AI MODES) Ensure the input/output token count meets requirements and is supported by the model selected. Will not allow files with more than tokens specified to be processed
+* `TEMPERATURE`: `0.0` 
+	* The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness in responses.
+* `OPENAI_API_KEY`: YOUR_API_KEY
+	* (REQUIRED IF USING OPENAI MODELS IN AI MODES) get it from https://platform.openai.com/api-keys
+* `ANTHROPIC_API_KEY`: YOUR_API_KEY
+	* (REQUIRED IF USING ANTHROPIC MODELS IN AI MODES) get it from https://console.anthropic.com/settings/keys
+* `GOOGLE_API_KEY`:
+	* (REQUIRED IF USING GOOGLE GEMINI MODELS IN AI MODES) get it from the Google Cloud Platform (making sure the Gemini API is enabled for the project)
+
+## BIN List
+
+* `BIN_LIST_API_KEY`: BLANK
+	*  for enriching credit card extractions needed for extracting credit card information. You get an API key here https://rapidapi.com/trade-expanding-llc-trade-expanding-llc-default/api/bin-ip-checker
+
+## CTIBUTLER
+
+Obstracts requires [ctibutler](https://github.com/muchdogesec/ctibutler) to lookup ATT&CK, CAPEC, CWE, ATLAS, and locations in blogs
+
+* `CTIBUTLER_HOST`: `'http://host.docker.internal:8006'`
+	* If you are running CTI Butler locally, be sure to set `'http://host.docker.internal:8006'` in the `.env` file otherwise you will run into networking errors.
+
+## VULMATCH FOR CVE AND CPE LOOKUPS
+
+Obstracts requires [vulmatch](https://github.com/muchdogesec/vulmatch) to lookup CVEs and CPEs in blogs
+
+* `VULMATCH_HOST`: `'http://host.docker.internal:8005'`
+	* If you are running vulmatch locally, be sure to set `'http://host.docker.internal:8005'` in the `.env` file otherwise you will run into networking errors.
diff --git a/.env.sample b/.env.sample
@@ -1,10 +1,12 @@
-INPUT_TOKEN_LIMIT=50 # [REQUIRED] for AI modes. keep in mind the token limit for selected model (which includes both input AND output tokens). For example, if your input limit is 50,000 characters, this could incur up to 25,000 tokens. Assuming your selected model allows for 64,000 tokens, you will therefore be able to obtain an output of over 39,000 tokens.
-OPENAI_API_KEY= # [REQUIRED IF USING AI MODES] get it from https://platform.openai.com/api-keys
-OPENAI_MODEL=gpt-4 # [REQUIRED IF USING AI MODES] choose an OpenAI model of your choice. Ensure the input/output token count meets requirements (and adjust INPUT_TOKEN_LIMIT accordingly). List of models here: https://platform.openai.com/docs/models
-BIN_LIST_API_KEY= #[OPTIONAL] needed for extracting credit card information
-## CTIBUTLER FOR ATT&CK, CAPEC, AND CWE LOOKUPS
-CTIBUTLER_HOST= # [REQUIRED] e.g. http://localhost:8006/
-CTIBUTLER_APIKEY= #[OPTIONAL] if using https://app.ctibutler.com
-## VULMATCH FOR CVE AD CPE LOOKUPS
-VULMATCH_HOST= # [REQUIRED] e.g. http://localhost:8005/
-VULMATCH_APIKEY= #[OPTIONAL] if using https://app.vulmatch.com
+## AI Settings
+INPUT_TOKEN_LIMIT=
+OPENAI_API_KEY=
+ANTHROPIC_API_KEY=
+GOOGLE_API_KEY=
+TEMPERATURE=
+## BIN LIST
+BIN_LIST_API_KEY=
+## CTIBUTLER
+CTIBUTLER_HOST=
+## VULMATCH
+VULMATCH_HOST=
diff --git a/Pipfile b/Pipfile
@@ -4,8 +4,6 @@ verify_ssl = true
 name = "pypi"
 
 [packages]
-# langchain = {extras = ["openai"], version = "*"}
-# langchain_openai = "*"
 openai = "*"
 stix2 = "*"
 phonenumbers = "*"
@@ -20,6 +18,10 @@ python-dotenv = "*"
 llama-index = "*"
 stix2extensions = {file = "https://github.com/muchdogesec/stix2extensions/archive/main.zip"}
 base58 = "*"
+llama-index-llms-gemini = "*"
+llama-index-core = "*"
+llama-index-llms-openai = "*"
+llama-index-llms-anthropic = "*"
 
 [dev-packages]
 autopep8 = "*"
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/README.md b/README.md
@@ -11,9 +11,9 @@ The general design goal of txt2stix was to keep it flexible, but simple, so that
 In short txt2stix;
 
 1. takes a txt file input
-2. (optional) rewrites file with enabled aliases
+2. (OPTIONAL) rewrites file with enabled aliases
 3. extracts observables for enabled extractions (ai, pattern, or lookup)
-4. (optional) removes any extractions that match whitelists
+4. (OPTIONAL) removes any extractions that match whitelists
 5. converts extracted observables to STIX 2.1 objects
 6. generates the relationships between extracted observables (ai, standard)
 7. converts extracted relationships to STIX 2.1 SRO objects
@@ -39,59 +39,82 @@ cd txt2stix
 python3 -m venv txt2stix-venv
 source txt2stix-venv/bin/activate
 # install requirements
-pip3 install -r requirements.txt
+pip install .
 ```
 
-Now copy the `.env` file to set your config:
+### Set variables
+
+txt2stix has various settings that are defined in an `.env` file.
+
+To create a template for the file:
 
 ```shell
-cp .env.sample .env
+cp .env.example .env
 ```
 
-You can new set the correct values in `.env`.
+To see more information about how to set the variables, and what they do, read the `.env.markdown` file.
 
 ### Usage
 
 ```shell
 python3 txt2stix.py \
 	--relationship_mode MODE \
 	--input_file FILE.txt \
-	--name NAME \
-	--tlp_level TLP_LEVEL \
-	--confidence CONFIDENCE_SCORE \
-	--labels label1,label2 \
-	--created DATE \
-	--use_identity \{IDENTITY JSON\} \
-	--use_extractions EXTRACTION1,EXTRACTION2 \
-	--use_aliases ALIAS1,ALIAS2 \
-	--use_whitelist WHITELIST1,WHITELIST2
+	...
 ```
 
-* `--relationship_mode` (required): either.
-	* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
-	* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.
-* `--input_file` (required): the file to be converted. Must be `.txt`
-* `--name` (required): name of file, max 72 chars. Will be used in the STIX Report Object created.
-* `--report_id` (optional): Sometimes it is required to control the id of the `report` object generated. You can therefore pass a valid UUIDv4 in this field to be assigned to the report. e.g. passing `2611965-930e-43db-8b95-30a1e119d7e2` would create a STIX object id `report--2611965-930e-43db-8b95-30a1e119d7e2`. If this argument is not passed, the UUID will be randomly generated.
-* `--tlp_level` (optional): Options are `clear`, `green`, `amber`, `amber_strict`, `red`. Default if not passed, is `clear`.
-* `--confidence` (optional): value between 0-100. Default if not passed is null.
-* `--labels` (optional): comma seperated list of labels. Case-insensitive (will all be converted to lower-case). Allowed `a-z`, `0-9`. e.g.`label1,label2` would create 2 labels.
-* `--created` (optional): by default all object `created` times will take the time the script was run. If you want to explicitly set these times you can do so using this flag. Pass the value in the format `YYYY-MM-DDTHH:MM:SS.sssZ` e.g. `2020-01-01T00:00:00.000Z`
-* `--use_identity` (optional): can pass a full STIX 2.1 identity object (make sure to properly escape). Will be validated by the STIX2 library.
-* `--external_refs` (optional): txt2stix will automatically populate the `external_references` of the report object it creates for the input. You can use this value to add additional objects to `external_references`. Note, you can only add `source_name` and `external_id` values currently. Pass as `source_name=external_id`. e.g. `--external_refs txt2stix=demo1 source=id` would create the following objects under the `external_references` property: `{"source_name":"txt2stix","external_id":"demo1"},{"source_name":"source","external_id":"id"}`
-* `--use_extractions` (required): if you only want to use certain extraction types, you can pass their slug found in either `ai/config.yaml`, `lookup/config.yaml` `regex/config.yaml` (e.g. `regex_ipv4_address_only`). Default if not passed, no extractions applied.
+The following arguments are available:
+
+#### Input settings
+
+* `--input_file` (REQUIRED): the file to be converted. Must be `.txt`
+
+#### STIX Report generation settings
+
+
+* `--name` (REQUIRED): name of file, max 72 chars. Will be used in the STIX Report Object created.
+* `--report_id` (OPTIONAL): Sometimes it is required to control the id of the `report` object generated. You can therefore pass a valid UUIDv4 in this field to be assigned to the report. e.g. passing `2611965-930e-43db-8b95-30a1e119d7e2` would create a STIX object id `report--2611965-930e-43db-8b95-30a1e119d7e2`. If this argument is not passed, the UUID will be randomly generated.
+* `--tlp_level` (OPTIONAL): Options are `clear`, `green`, `amber`, `amber_strict`, `red`. Default if not passed, is `clear`.
+* `--confidence` (OPTIONAL): value between 0-100. Default if not passed is null.
+* `--labels` (OPTIONAL): comma seperated list of labels. Case-insensitive (will all be converted to lower-case). Allowed `a-z`, `0-9`. e.g.`label1,label2` would create 2 labels.
+* `--created` (OPTIONAL): by default all object `created` times will take the time the script was run. If you want to explicitly set these times you can do so using this flag. Pass the value in the format `YYYY-MM-DDTHH:MM:SS.sssZ` e.g. `2020-01-01T00:00:00.000Z`
+* `--use_identity` (OPTIONAL): can pass a full STIX 2.1 identity object (make sure to properly escape). Will be validated by the STIX2 library.
+* `--external_refs` (OPTIONAL): txt2stix will automatically populate the `external_references` of the report object it creates for the input. You can use this value to add additional objects to `external_references`. Note, you can only add `source_name` and `external_id` values currently. Pass as `source_name=external_id`. e.g. `--external_refs txt2stix=demo1 source=id` would create the following objects under the `external_references` property: `{"source_name":"txt2stix","external_id":"demo1"},{"source_name":"source","external_id":"id"}`
+
+#### Output settings
+
+How the extractions are performed
+
+* `--use_extractions` (REQUIRED): if you only want to use certain extraction types, you can pass their slug found in either `ai/config.yaml`, `lookup/config.yaml` `regex/config.yaml` (e.g. `regex_ipv4_address_only`). Default if not passed, no extractions applied.
 	* Important: if using any AI extractions, you must set an OpenAI API key in your `.env` file
 	* Important: if you are using any MITRE ATT&CK, CAPEC, CWE, ATLAS or Location extractions you must set `CTIBUTLER` or NVD CPE or CVE extractions you must set `VULMATCH` settings in your `.env` file
-* `--use_aliases` (optional): if you want to apply aliasing to the input doc (find and replace strings) you can pass their slug found in `aliases/config.yaml` (e.g. `country_iso3_to_iso2`). Default if not passed, no extractions applied.
-* `--use_whitelist` (optional): if you want to get the script to ignore certain values that might create extractions you can specify using `whitelist/config.yaml` (e.g. `alexa_top_1000`) related to the whitelist file you want to use. Default if not passed, no extractions applied.
+* `--use_aliases` (OPTIONAL): if you want to apply aliasing to the input doc (find and replace strings) you can pass their slug found in `aliases/config.yaml` (e.g. `country_iso3_to_iso2`). Default if not passed, no aliases applied.
+* `--use_whitelist` (OPTIONAL): if you want to get the script to ignore certain values that might create extractions you can specify using `whitelist/config.yaml` (e.g. `alexa_top_1000`) related to the whitelist file you want to use. Default if not passed, no whitelists applied.
+* `--relationship_mode` (REQUIRED): either.
+	* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
+	* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.
+
+#### AI settings
+
+If any AI extractions, or AI relationship mode is set, you must set the following accordingly
+
+* `--ai_settings_extractions`:
+	* defines the `provider:model` to be used for extractions. You can supply more than one provider. Seperate with a space (e.g. `gpt-4o claude-3-opus-latest`) If more than one provider passed, txt2stix will take extractions from all models, de-dupelicate them, and them package them in the output. Currently supports:
+		* Provider: `openai:`, models e.g.: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4` ([More here](https://platform.openai.com/docs/models))
+		* Provider: `anthropic:`, models e.g.: `claude-3-5-sonnet-latest`, `claude-3-5-haiku-latest`, `claude-3-opus-latest` ([More here](https://docs.anthropic.com/en/docs/about-claude/models))
+		* Provider: `gemini:models/`, models: `gemini-1.5-pro-latest`, `gemini-1.5-flash-latest` ([More here](https://ai.google.dev/gemini-api/docs/models/gemini))
+	* See `tests/manual-tests/cases-ai-extraction-type.md` for some examples
+* `--ai_settings_relationships`:
+	* similar to `ai_settings_extractions` but defines the model used to generate relationships. Only one model can be provided. Passed in same format as `ai_settings_extractions`
+	* See `tests/manual-tests/cases-ai-relationships.md` for some examples
 
 ## Adding new extractions/lookups/aliases
 
 It is very likely you'll want to extend txt2stix to include new extractions, aliases, and/or lookups. The following is possible:
 
-* Add a new lookup extraction: add your lookup to `lookups` as a `.txt` file. Lookups should be a list of items seperated by new lines to be searched for in documents. Once this is added, update `extactions/lookups/config.yaml` with a new record pointing to your lookup. You can now use this lookup time at script run-time.
-* Add a new AI extraction: Edit `extactions/ai/config.yaml` with a new record for your extraction. You can craft the prompt used in the config to control how the LLM performs the extraction.
-* Add a new alias: add a your alias to `aliases` as a `.csv` file. Alias files should have two columns `value,alias`, where `value` is the document in the original document to replace and `alias` is the value it should be replaced with.
+* Add a new lookup extraction: add your lookup to `includes/lookups` as a `.txt` file. Lookups should be a list of items seperated by new lines to be searched for in documents. Once this is added, update `includes/extractions/lookup/config.yaml` with a new record pointing to your lookup. You can now use this lookup time at script run-time.
+* Add a new AI extraction: Edit `includes/extractions/ai/config.yaml` with a new record for your extraction. You can craft the prompt used in the config to control how the LLM performs the extraction.
+* Add a new alias: add a your alias to `includes/aliases` as a `.csv` file. Alias files should have two columns `value,alias`, where `value` is the document in the original document to replace and `alias` is the value it should be replaced with. Once this is added, update `includes/extractions/alias/config.yaml` with a new record pointing to your alias. You can now use this lookup time at script run-time.
 
 Currently it is not possible to easily add any other types of extractions (without modifying the logic at a code level).