Towards Unified Semantic Log Parsing Framework for Temporal Causal Graph Construction
- correlate data from multiple sources (network traffic, system/applications/service logs, process execution status)
- automatically recognize log format, and calculate depth and similarity threshold
- extract the entities (obj, sub, action) with depedency relationships from events (both structured and unstructured logs)
- provenance graph construction from multi-source logs
- measure the delay for log fusion
- interfaces for optimized temporal graph query and graph community detection
-
core:
-
entity_reco: custom entity extraction from unifited output
-
graph_create: the module block to build causal graphs
-
graph_label: labelling temporal graph
-
logparse: multiple log parsers
-
pattern: the rule to build unifited output and graph
-
-
eval: benchmark testing
-
eval_data: the code to generate evaluation data
-
src: the running main interface
-
unit_test: the unit testing for core modules
-
utils: util functions to support processing
-
config: the config file including regexes, defined poi, etc
- preprepration
# avoid python version conflict --- pyenv
brew install pyenv-virtualenv
brew install pyenv
pyenv install 3.10
pyenv global 3.10
pyenv virtualenv 3.10 UTLParser
# activate the environment
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
pyenv local UTLParser
pyenv activate UTLParser
pip3 install -r requirements.txt
# download large language library
python -m spacy download en_core_web_lg
- how to use
# single log source processing
python3 main.py -a dns -i /xxx/UTLParser/unit_test/data/dns.log
# multiple log sources processing --- fused graph
python3 main.py -f True -al 'dns,error,access,audit'
# temporal graph query
python3 main.py -al 'dns,error,access,audit' -t "2022-Jan-15 10:17:01.246000"
# assign labels to fused graphs
python3 main.py -l True
-
custom running
- add poi and iocs for custom logs inside config.py
- repeat above steps
-
IOCs:
Timestamp, Src_IP, Dst_IP, Proto or Application, Domain, PacketSize, ParaPair (tuple)
-
AIT (fox) --- pure unstructured logs:
-
used for intrusion detection systems, federated learning, alert aggregation
-
include logs from all hosts, apache, error, authentication, DNS/VPN, audit, network traffic, syslog, system monitoring logs
-
ground truth labels for events
-
details:
- host log: gather/ host name / logs
- labels directory: labelling information
- rules directory: how the labels are assigned
-
launched attacks:
- Scans
- Webshell upload --- apache
- password cracking
- privilege escalation --- dnsmasq, apache, audit (internal_server), system.cpu
- remote command execution --- dnsmasq,apache, audit (internal_server), system.cpu
- data exfiltration --- dnsmasq, audit (internal_share),
-
-
Sysdig Process:
# follow the format like: evt.num, evt.time, evt.cpu, proc.name, thread.tid, evt.dir, evt.type, evt.args - 123 23:40:09.105899621 3 httpd (28599) > switch next=0 pgft_maj=3 pgft_min=619 vm_size=442720 vm_rss=668 vm_swap=7004
-
IoT23 (structured logs) --- network traffic:
-
label information
- attack (part of APT): indictors that there was some type of attack from the infected device to another host
- C & C (part of APT): the infected device was connnected to a CC server
- DDoS: ddos attack is being executed by the infected device
- FileDownload (part of APT): a file is being downloaded to the infected device
- HeartBeat (periodic similar connections) packets sent on this connection are used to keep a track on the infected host
- Mirai (botnet) similar patterns
- Okiru (botnet) same parameters
- PortScan (part of APT)
- Torii (botnet) same parameters
-
related field and its number
- id.resp_h (5) ----> C & C
- id.resp_p (6) ----> Malware, HeartBeat, Port Scan
- conn_state (12) ----> Port Scan
-
choosen fields to extract features
- ts? -- time series --- dynamic beyasian network
- id.orig_h, id.orig_p, id.resp_h, id.resp_p
- resp_bytes ---- filedownload
- conn_state ---- port scan
- feature analysis? --- other features
-
-
Build Temporal Graph Neural Networks
- reduce the graph size to some extent: suitable for low-memory cost training
- capable of process heterogeneous graph attributes
- capable of capture the changes between temporal graphs
- capable of measuring normal and abnormal behaviour in unsupervised way