USDA Application Rationalization Challenge

Overview

The USDA CEC Team partnered with Harvard Computer Society's Tech for Social Good (T4SG) to explore application installation data gathered using SCCM and Tanium. The algorithms that T4SG developed are made available in the Python scripts agency_ids.py and usages.py.

agency_ids.py generates reports analyzing the Agency IDs that SCCM and Tanium report for each workstation. The algorithm details situations where Agency ID classifications match and differ, as well as coverage of each tool for individual agencies.

usages.py generates reports and visualizations on application usage levels within each agency and mission area. These reports are based on the data reported by Tanium.

Installation

Install Python 3.9 (or latest version).

Use pip (built-in Python package management system) to install the following libraries:

Matplotlib: run pip install matplotlib, see https://matplotlib.org/stable/users/installing/index.html.
Numpy: run pip install numpy, see https://numpy.org/install/.
OpenPyXL: run pip install openpyxl, see https://openpyxl.readthedocs.io/en/stable/#installation.
Pandas: run pip install pandas, see https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html.

Expected Input

SCCM

See below the expected structure of the SCCM dataset, which should be placed in the project directory, as well as where in the Python scripts these expected names can be modified.

Description	Expected name	`agency_ids.py`
SCCM dataset Excel filename	`sccm.xlsx`	line 28
Workstation identifier column*	`Encrypted Workstation Name`	line 7
SCCM Agency ID column	`Agency`	line 13

* The workstation identifier column name must be the same for both the SCCM and Tanium datasets.

All Agency IDs are expected to have the following schema: XXX, where XXX is the alphabetical agency identification of any length.

Tanium

See below the expected structure of the Tanium dataset, which should be placed in the project directory, as well as where in the Python scripts these expected names can be modified.

Description	Expected name	`agency_ids.py`	`usages.py`
Tanium dataset Excel filename	`tanium.xlsx`	line 29	line 24
Workstation identifier column*	`Encrypted Workstation Name`	line 7
Tanium application name column	`Name`		line 7
Tanium operating system column	`Operating System`	line 9
Tanium usage level column	`Usage`	line 11	line 9
Tanium Agency ID columns	`Asset - Custom Tags.2.1` `Asset - Custom Tags.2.2.1` `Asset - Custom Tags.2.2.2.1` `Asset - Custom Tags.2.2.2.2.1` `Asset - Custom Tags.2.2.2.2.2.1` `Asset - Custom Tags.2.2.2.2.2.2.1` `Asset - Custom Tags.2.2.2.2.2.2.2.1` `Asset - Custom Tags.2.2.2.2.2.2.2.2.2.1`	lines 15-24	lines 11-20

* The workstation identifier column name must be the same for both the SCCM and Tanium datasets.

All Tanium Agency IDs are expected to have the following schema: AgencyID-XXX, where XXX is the alphabetical agency identification of any length. agency_ids.py will ignore any tag that does not have the prefix AgencyID-. usages.py will create datasets and visualizations for all tags, including those not prefixed by AgencyID-.

Execution

Once the required packages are installed, go to the Terminal and navigate to the project directory where both scripts are held.

To generate reports 1 - 7, run python agency_ids.py in the Terminal.

To generate Tanium usage reports and figures by Agency IDs and Mission Areas, run python usages.py into the Terminal.

The executed script will run, output its progress, and generate its respective reports. Note that importing and exporting Excel files are resource intensive processes. If an algorithm appears to not progress, it may just be that an Excel file is being read or written.

Reports for `agency_ids.py`

The report files generated by this script are stored in the data/ folder of the project directory by default. To change this behavior, update both the folder creation in line 13 of agency_ids.py as well as the specific output filepath of each report.

For all reports, Tanium Agency IDs are all Tanium reported Agency IDs for a workstation (filtered for only tags with the AgencyID- prefix) concatenated in the format XXX-YYY-....

Report 1: Matching classification report (`matching_raw.xlsx`)

This report compiles encrypted workstations present in both SCCM and Tanium datasets where all Agency ID tags match between datasets.