Skip to content

Latest commit

 

History

History
115 lines (74 loc) · 4.73 KB

README.md

File metadata and controls

115 lines (74 loc) · 4.73 KB

Pysharkfeat

Pysharkfeat is a TLS encrypted traffic feature extraction tool from pcaps written in Python by using Wireshark's command line tshark.

Pysharkfeat is derived from an academic research on malicious encrypted traffic analysis. Compare with other feature extraction tools such as Flowmeter and Joy, Pysharkfeat is easier to setup and use while providing rich features.

Features

  • Parse a single pcap or directory to generate meta and statistical features
  • Export features in JSON files
  • Support logging

Traffic features include:

  • Meta: 5-tuple(src ip, src port, dest ip, dest port, timestamp), duration, stream index
  • Statistical:
    • Bidirectional packet len and inter-arrival-time sum/max/min/mean/std
    • SPLT(Markov sequence of pkt len and time)
    • Byte distribution, payload std and entropy.
  • TLS: todo.

Full features can be found in feat.py or feature JSON file.

Environment

  • Operating systems: Linux, Mac OSX. (Windows is not supported in current version, see Known issues)
  • Language: Python3.8, 3.9
  • Dependence: Wireshark

Installation

Install pysharkfeat from pip

pip3 install pysharkfeat

Install Wireshark(tshark)

Test tshark

tshark --version

For Windows, make sure tshark can be called by command line by adding tshark to the environment path.

Use case

Pysharkfeat can be used for machine learning research and threat analysis.

There are several feature files in tests/output generated from pcaps at Malware Traffic Analysis, and you can immediately start analyzing them.

Example

This code snippet can be found in tests/demo.py.

from pysharkfeat.featextractor import FeatureExtractor
import json, os

# specify pcaps and output dir
pcap_dir = "./pcaps/2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap"
output_dir="./output"

extractor = FeatureExtractor(pcap_path=pcap_dir, output_dir=output_dir)
summary = extractor.main_extract_pcaps_feat()

print(summary)

# read feature files
feat_file = os.path.join(output_dir, "2021-01-04-Emotet-infection-with-Trickbot-traffic.json")
f = open(feat_file)
stream_feats = json.load(f)
for feat in stream_feats :
    print("%s,  stream_index:%s,  byte dist entropy:%s" % (feat["pcap_name"], feat["stream_index"], feat["bd_entropy"]))


# display stream index and byte distribution entropy features, and bd entropies are very close.

    2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap,  stream_index:3,  byte dist entropy:7.999464797314957
    2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap,  stream_index:7,  byte dist entropy:7.903172099500442
    2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap,  stream_index:9,  byte dist entropy:7.9876935373284805
    ...

Performance consideration

Time

Pysharkfeat is built on tshark, which may incur substantial overhead. The following table shows some test results for meta freatures on OSX(CPU i5, 16GB RAM).

pcap name pcap size num of TLS streams time(sec)
2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap 5.4MB 10 10.8
2021-01-05-PurpleFox-EK-and-post-infection-traffic.pcap 9.5MB 8 11.5
2021-01-15-Emotet-epoch-1-infection-traffic.pcap 5.9MB 40 38.2
2021-02-24-Qakbot-infection-with-spambot-traffic.pcap 21.1MB 94 213.9
20 pcaps collected from MalwareTrafficAnalsysis 1503 3342.3

In the last row of the table, we tested Pysharfeat with 20 pcaps from MalwareTrafficAnalysis (from Jan 2021 to Feb 2021) on a workstation using Centos8.5, CPU i7 and 32GBM RAM. It took 55 minutes to extract features.

Storage

The feature file of a single TLS stream has approximately 16KB. If a pcap has 100 TLS streams, the storage will be roughly 1.6MB.

Feedback

You are welcome to post a issue or feature request, or send email to the author [email protected].

License

Pysharkfeat is open source and free to use under GPL V3 license. See LICENSE for more details.

Known issues

  • Running Pysharkfeat on Windows may encounter syntax parsing error with tshark, probably Windows has a different usage of ""(quotes).