Skip to content

Commit

Permalink
add other AI providers (#53)
Browse files Browse the repository at this point in the history
* add other AI providers #34

* fix loglevel

* use DEBUG log for exceptions

* change info to debug

* add --ai_settings_extractions and --ai_settings_relationships #55

* fix build error

* clearer ai_setting error

* adding env markdown

* Update README.md

* updating tests to match new AI modes

* updating aimodel settings

* cleaning up doc files

* fix relationship_mode not getting considered before throwing ai_settings_relationships is required

* updating gemini

* fixing docs for multiple ai extraction providers

* fixing legacy paths in docs

---------

Co-authored-by: David G <[email protected]>
  • Loading branch information
fqrious and himynamesdave authored Nov 11, 2024
1 parent 6095983 commit 31f7e3d
Show file tree
Hide file tree
Showing 20 changed files with 2,152 additions and 1,805 deletions.
37 changes: 37 additions & 0 deletions .env.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Environment file info

If you're running in production, you should set these securely.

However, if you just want to experiment, set the following values

## AI Settings

* `INPUT_TOKEN_LIMIT`: `15000`
* (REQUIRED IF USING AI MODES) Ensure the input/output token count meets requirements and is supported by the model selected. Will not allow files with more than tokens specified to be processed
* `TEMPERATURE`: `0.0`
* The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness in responses.
* `OPENAI_API_KEY`: YOUR_API_KEY
* (REQUIRED IF USING OPENAI MODELS IN AI MODES) get it from https://platform.openai.com/api-keys
* `ANTHROPIC_API_KEY`: YOUR_API_KEY
* (REQUIRED IF USING ANTHROPIC MODELS IN AI MODES) get it from https://console.anthropic.com/settings/keys
* `GOOGLE_API_KEY`:
* (REQUIRED IF USING GOOGLE GEMINI MODELS IN AI MODES) get it from the Google Cloud Platform (making sure the Gemini API is enabled for the project)

## BIN List

* `BIN_LIST_API_KEY`: BLANK
* for enriching credit card extractions needed for extracting credit card information. You get an API key here https://rapidapi.com/trade-expanding-llc-trade-expanding-llc-default/api/bin-ip-checker

## CTIBUTLER

Obstracts requires [ctibutler](https://github.com/muchdogesec/ctibutler) to lookup ATT&CK, CAPEC, CWE, ATLAS, and locations in blogs

* `CTIBUTLER_HOST`: `'http://host.docker.internal:8006'`
* If you are running CTI Butler locally, be sure to set `'http://host.docker.internal:8006'` in the `.env` file otherwise you will run into networking errors.

## VULMATCH FOR CVE AND CPE LOOKUPS

Obstracts requires [vulmatch](https://github.com/muchdogesec/vulmatch) to lookup CVEs and CPEs in blogs

* `VULMATCH_HOST`: `'http://host.docker.internal:8005'`
* If you are running vulmatch locally, be sure to set `'http://host.docker.internal:8005'` in the `.env` file otherwise you will run into networking errors.
22 changes: 12 additions & 10 deletions .env.sample
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
INPUT_TOKEN_LIMIT=50 # [REQUIRED] for AI modes. keep in mind the token limit for selected model (which includes both input AND output tokens). For example, if your input limit is 50,000 characters, this could incur up to 25,000 tokens. Assuming your selected model allows for 64,000 tokens, you will therefore be able to obtain an output of over 39,000 tokens.
OPENAI_API_KEY= # [REQUIRED IF USING AI MODES] get it from https://platform.openai.com/api-keys
OPENAI_MODEL=gpt-4 # [REQUIRED IF USING AI MODES] choose an OpenAI model of your choice. Ensure the input/output token count meets requirements (and adjust INPUT_TOKEN_LIMIT accordingly). List of models here: https://platform.openai.com/docs/models
BIN_LIST_API_KEY= #[OPTIONAL] needed for extracting credit card information
## CTIBUTLER FOR ATT&CK, CAPEC, AND CWE LOOKUPS
CTIBUTLER_HOST= # [REQUIRED] e.g. http://localhost:8006/
CTIBUTLER_APIKEY= #[OPTIONAL] if using https://app.ctibutler.com
## VULMATCH FOR CVE AD CPE LOOKUPS
VULMATCH_HOST= # [REQUIRED] e.g. http://localhost:8005/
VULMATCH_APIKEY= #[OPTIONAL] if using https://app.vulmatch.com
## AI Settings
INPUT_TOKEN_LIMIT=
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=
TEMPERATURE=
## BIN LIST
BIN_LIST_API_KEY=
## CTIBUTLER
CTIBUTLER_HOST=
## VULMATCH
VULMATCH_HOST=
6 changes: 4 additions & 2 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ verify_ssl = true
name = "pypi"

[packages]
# langchain = {extras = ["openai"], version = "*"}
# langchain_openai = "*"
openai = "*"
stix2 = "*"
phonenumbers = "*"
Expand All @@ -20,6 +18,10 @@ python-dotenv = "*"
llama-index = "*"
stix2extensions = {file = "https://github.com/muchdogesec/stix2extensions/archive/main.zip"}
base58 = "*"
llama-index-llms-gemini = "*"
llama-index-core = "*"
llama-index-llms-openai = "*"
llama-index-llms-anthropic = "*"

[dev-packages]
autopep8 = "*"
2,646 changes: 1,547 additions & 1,099 deletions Pipfile.lock

Large diffs are not rendered by default.

89 changes: 56 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ The general design goal of txt2stix was to keep it flexible, but simple, so that
In short txt2stix;

1. takes a txt file input
2. (optional) rewrites file with enabled aliases
2. (OPTIONAL) rewrites file with enabled aliases
3. extracts observables for enabled extractions (ai, pattern, or lookup)
4. (optional) removes any extractions that match whitelists
4. (OPTIONAL) removes any extractions that match whitelists
5. converts extracted observables to STIX 2.1 objects
6. generates the relationships between extracted observables (ai, standard)
7. converts extracted relationships to STIX 2.1 SRO objects
Expand All @@ -39,59 +39,82 @@ cd txt2stix
python3 -m venv txt2stix-venv
source txt2stix-venv/bin/activate
# install requirements
pip3 install -r requirements.txt
pip install .
```

Now copy the `.env` file to set your config:
### Set variables

txt2stix has various settings that are defined in an `.env` file.

To create a template for the file:

```shell
cp .env.sample .env
cp .env.example .env
```

You can new set the correct values in `.env`.
To see more information about how to set the variables, and what they do, read the `.env.markdown` file.

### Usage

```shell
python3 txt2stix.py \
--relationship_mode MODE \
--input_file FILE.txt \
--name NAME \
--tlp_level TLP_LEVEL \
--confidence CONFIDENCE_SCORE \
--labels label1,label2 \
--created DATE \
--use_identity \{IDENTITY JSON\} \
--use_extractions EXTRACTION1,EXTRACTION2 \
--use_aliases ALIAS1,ALIAS2 \
--use_whitelist WHITELIST1,WHITELIST2
...
```

* `--relationship_mode` (required): either.
* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.
* `--input_file` (required): the file to be converted. Must be `.txt`
* `--name` (required): name of file, max 72 chars. Will be used in the STIX Report Object created.
* `--report_id` (optional): Sometimes it is required to control the id of the `report` object generated. You can therefore pass a valid UUIDv4 in this field to be assigned to the report. e.g. passing `2611965-930e-43db-8b95-30a1e119d7e2` would create a STIX object id `report--2611965-930e-43db-8b95-30a1e119d7e2`. If this argument is not passed, the UUID will be randomly generated.
* `--tlp_level` (optional): Options are `clear`, `green`, `amber`, `amber_strict`, `red`. Default if not passed, is `clear`.
* `--confidence` (optional): value between 0-100. Default if not passed is null.
* `--labels` (optional): comma seperated list of labels. Case-insensitive (will all be converted to lower-case). Allowed `a-z`, `0-9`. e.g.`label1,label2` would create 2 labels.
* `--created` (optional): by default all object `created` times will take the time the script was run. If you want to explicitly set these times you can do so using this flag. Pass the value in the format `YYYY-MM-DDTHH:MM:SS.sssZ` e.g. `2020-01-01T00:00:00.000Z`
* `--use_identity` (optional): can pass a full STIX 2.1 identity object (make sure to properly escape). Will be validated by the STIX2 library.
* `--external_refs` (optional): txt2stix will automatically populate the `external_references` of the report object it creates for the input. You can use this value to add additional objects to `external_references`. Note, you can only add `source_name` and `external_id` values currently. Pass as `source_name=external_id`. e.g. `--external_refs txt2stix=demo1 source=id` would create the following objects under the `external_references` property: `{"source_name":"txt2stix","external_id":"demo1"},{"source_name":"source","external_id":"id"}`
* `--use_extractions` (required): if you only want to use certain extraction types, you can pass their slug found in either `ai/config.yaml`, `lookup/config.yaml` `regex/config.yaml` (e.g. `regex_ipv4_address_only`). Default if not passed, no extractions applied.
The following arguments are available:

#### Input settings

* `--input_file` (REQUIRED): the file to be converted. Must be `.txt`

#### STIX Report generation settings


* `--name` (REQUIRED): name of file, max 72 chars. Will be used in the STIX Report Object created.
* `--report_id` (OPTIONAL): Sometimes it is required to control the id of the `report` object generated. You can therefore pass a valid UUIDv4 in this field to be assigned to the report. e.g. passing `2611965-930e-43db-8b95-30a1e119d7e2` would create a STIX object id `report--2611965-930e-43db-8b95-30a1e119d7e2`. If this argument is not passed, the UUID will be randomly generated.
* `--tlp_level` (OPTIONAL): Options are `clear`, `green`, `amber`, `amber_strict`, `red`. Default if not passed, is `clear`.
* `--confidence` (OPTIONAL): value between 0-100. Default if not passed is null.
* `--labels` (OPTIONAL): comma seperated list of labels. Case-insensitive (will all be converted to lower-case). Allowed `a-z`, `0-9`. e.g.`label1,label2` would create 2 labels.
* `--created` (OPTIONAL): by default all object `created` times will take the time the script was run. If you want to explicitly set these times you can do so using this flag. Pass the value in the format `YYYY-MM-DDTHH:MM:SS.sssZ` e.g. `2020-01-01T00:00:00.000Z`
* `--use_identity` (OPTIONAL): can pass a full STIX 2.1 identity object (make sure to properly escape). Will be validated by the STIX2 library.
* `--external_refs` (OPTIONAL): txt2stix will automatically populate the `external_references` of the report object it creates for the input. You can use this value to add additional objects to `external_references`. Note, you can only add `source_name` and `external_id` values currently. Pass as `source_name=external_id`. e.g. `--external_refs txt2stix=demo1 source=id` would create the following objects under the `external_references` property: `{"source_name":"txt2stix","external_id":"demo1"},{"source_name":"source","external_id":"id"}`

#### Output settings

How the extractions are performed

* `--use_extractions` (REQUIRED): if you only want to use certain extraction types, you can pass their slug found in either `ai/config.yaml`, `lookup/config.yaml` `regex/config.yaml` (e.g. `regex_ipv4_address_only`). Default if not passed, no extractions applied.
* Important: if using any AI extractions, you must set an OpenAI API key in your `.env` file
* Important: if you are using any MITRE ATT&CK, CAPEC, CWE, ATLAS or Location extractions you must set `CTIBUTLER` or NVD CPE or CVE extractions you must set `VULMATCH` settings in your `.env` file
* `--use_aliases` (optional): if you want to apply aliasing to the input doc (find and replace strings) you can pass their slug found in `aliases/config.yaml` (e.g. `country_iso3_to_iso2`). Default if not passed, no extractions applied.
* `--use_whitelist` (optional): if you want to get the script to ignore certain values that might create extractions you can specify using `whitelist/config.yaml` (e.g. `alexa_top_1000`) related to the whitelist file you want to use. Default if not passed, no extractions applied.
* `--use_aliases` (OPTIONAL): if you want to apply aliasing to the input doc (find and replace strings) you can pass their slug found in `aliases/config.yaml` (e.g. `country_iso3_to_iso2`). Default if not passed, no aliases applied.
* `--use_whitelist` (OPTIONAL): if you want to get the script to ignore certain values that might create extractions you can specify using `whitelist/config.yaml` (e.g. `alexa_top_1000`) related to the whitelist file you want to use. Default if not passed, no whitelists applied.
* `--relationship_mode` (REQUIRED): either.
* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.

#### AI settings

If any AI extractions, or AI relationship mode is set, you must set the following accordingly

* `--ai_settings_extractions`:
* defines the `provider:model` to be used for extractions. You can supply more than one provider. Seperate with a space (e.g. `gpt-4o claude-3-opus-latest`) If more than one provider passed, txt2stix will take extractions from all models, de-dupelicate them, and them package them in the output. Currently supports:
* Provider: `openai:`, models e.g.: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4` ([More here](https://platform.openai.com/docs/models))
* Provider: `anthropic:`, models e.g.: `claude-3-5-sonnet-latest`, `claude-3-5-haiku-latest`, `claude-3-opus-latest` ([More here](https://docs.anthropic.com/en/docs/about-claude/models))
* Provider: `gemini:models/`, models: `gemini-1.5-pro-latest`, `gemini-1.5-flash-latest` ([More here](https://ai.google.dev/gemini-api/docs/models/gemini))
* See `tests/manual-tests/cases-ai-extraction-type.md` for some examples
* `--ai_settings_relationships`:
* similar to `ai_settings_extractions` but defines the model used to generate relationships. Only one model can be provided. Passed in same format as `ai_settings_extractions`
* See `tests/manual-tests/cases-ai-relationships.md` for some examples

## Adding new extractions/lookups/aliases

It is very likely you'll want to extend txt2stix to include new extractions, aliases, and/or lookups. The following is possible:

* Add a new lookup extraction: add your lookup to `lookups` as a `.txt` file. Lookups should be a list of items seperated by new lines to be searched for in documents. Once this is added, update `extactions/lookups/config.yaml` with a new record pointing to your lookup. You can now use this lookup time at script run-time.
* Add a new AI extraction: Edit `extactions/ai/config.yaml` with a new record for your extraction. You can craft the prompt used in the config to control how the LLM performs the extraction.
* Add a new alias: add a your alias to `aliases` as a `.csv` file. Alias files should have two columns `value,alias`, where `value` is the document in the original document to replace and `alias` is the value it should be replaced with.
* Add a new lookup extraction: add your lookup to `includes/lookups` as a `.txt` file. Lookups should be a list of items seperated by new lines to be searched for in documents. Once this is added, update `includes/extractions/lookup/config.yaml` with a new record pointing to your lookup. You can now use this lookup time at script run-time.
* Add a new AI extraction: Edit `includes/extractions/ai/config.yaml` with a new record for your extraction. You can craft the prompt used in the config to control how the LLM performs the extraction.
* Add a new alias: add a your alias to `includes/aliases` as a `.csv` file. Alias files should have two columns `value,alias`, where `value` is the document in the original document to replace and `alias` is the value it should be replaced with. Once this is added, update `includes/extractions/alias/config.yaml` with a new record pointing to your alias. You can now use this lookup time at script run-time.

Currently it is not possible to easily add any other types of extractions (without modifying the logic at a code level).

Expand Down
Loading

0 comments on commit 31f7e3d

Please sign in to comment.