Harmonic-NAS is a novel framework for the design of multimodal neural networks on resource-constrained devices. It employs a two-tier optimization strategy with a first-stage evolutionary search for the unimodal backbone networks and a second-stage differentiable search for the multimodal fusion network architecture. Harmonic-NAS also includes the hardware dimension within its optimization procedure by including the inference latency and energy consumption as optimization objectives for an optimal deployment on resource-constrained devices.
Please find our arXiv version here for the full paper with additional results. Our paper has been accepted for publication in the 15th Asian Conference on Machine Learning (ACML 2023)
- Python version: tested in Python 3.8.10
- Install the software environment in the yaml file environment.yml
Harmonic-NAS/
├── backbones/
| ├── maxout/ --- Our Maxout network configuration
| └── ofa/ --- Essential scripts from once-for-all for supernet specifications
|
├── configs/ --- Running configs for Harmonic-NAS search
├── data/ --- Essential scripts for data loading for our various datasets
├── evaluate/
| ├── backbone_eval/
| | ├── accuracy/ --- Essential scripts for evaluating the accuracy of the explored uni/multi-modal models
| | └── efficiency/ --- LUTs for evaluating the efficiency of our modality specific supernets on the targeted Edge devices
| └── fusion_eval/ --- LUTs for evaluating the efficiency of our fusion operators on the targeted Edge devices
|
├── fusion_search/ --- Scripts for the second-stage of optimization (fusion search)
├── saved_supernets/ --- Pretrained supernets for different modalities/datasets
├── utils/ --- Essential scripts for managing distributed training/evaluation across multiple GPUs
├── best_mm_model.py --- Script for the fusion micro-architecture seach for our best found multimodal models
└── search_algo.py --- Main script for Harmonic-NAS search
The following table provides a list of the employed backbones and supernets with their weights:
Dataset | Modality | Baseline Model Architecture | Max subnet Accuracy | Pretrained weights |
---|---|---|---|---|
AV-MNIST | Image | ofa_mbv3_d234_e346_k357_w1.0 | TOP1-Acc: 86.44% | Link |
AV-MNIST | Audio | ofa_mbv3_d234_e346_k357_w1.0 | TOP1-Acc: 88.22% | Link |
MM-IMDB | Image | ofa_mbv3_d234_e346_k357_w1.2 | F1-W: 46.26% | Link |
MM-IMDB | Text | Maxout | F1-W: 61.21% | Link |
Memes_Politics | Image | ofa_mbv3_d234_e346_k357_w1.0 | TOP1-Acc: 84.78% | Link |
Memes_Politics | Text | Maxout | TOP1-Acc: 83.38% | Link |
Donwload the AV-MNIST dataset by following the instructions provided in SMIL ,or uploed it direcly from Here.
Download the multimodal_imdb.hdf5 file from the original repo of MM-IMDB using the Link.
Use the pre-processing script to split the dataset.
$ python data/mmimdb/prepare_mmimdb.py
To download the different files for Meme Images and Annotations
Harm-P: Link
Entity features: Link
ROI features: Link
To download the required vocabulary file:
$ wget https://openaipublic.azureedge.net/clip/bpe_simple_vocab_16e6.txt.gz -O bpe_simple_vocab_16e6.txt.gz
In Harmonic-NAS, we conducted experiments within a distributed environment (i.e., clusters of GPUs). To replicate these experiments, follow these steps:
Modify the configuration file located in ./configs to match your customized settings.
Run the following command to initiate the Harmonic-NAS search.
$ python search_algo_DATASET.py
To reproduce the results achieved by our top-performing multimodal models without undergoing the entire Harmonic-NAS search process, simply specify the desired backbones architectures and the fusion macro-architecture (as detailed in Best Models Configuration) within the following script:
$ python best_mm_model_DATASET.py
The architectural configuration of our top-performing multimodal models and their efficiency on the NVIDIA Jetson TX2, as described in our paper Harmonic-NAS: Hardware-Aware Multimodal Neural Architecture Search on Resource-constrained Devices.
Image Backbone | Audio Backbone | Fusion Network | Multimodal Evaluation | |||||||||
Acc | K | E | D | Acc | K | E | D | Cells | Nodes | Acc | Lat | Enr |
79.77 | [5,5,5,5] | [3,3,4,3] | [2] | 85.55 | [3,3,7,3] | [3,3,3,6] | [2] | 2 | 1 | 92.88 | 8.96 | 13.93 |
77.55 | [3,5,7,3] | [3,3,3,6] | [2] | 85.77 | [3,5,5,5] | [3,3,3,3] | [2] | 3 | 4 | 95.55 | 14.41 | 25.49 |
82.66 | [5,5,5,7] | [3,6,4,3] | [2] | 85.55 | [3,3,7,5] | [3,3,3,6] | [2] | 2 | 1 | 95.33 | 9.11 | 13.88 |
Image Backbone | Text Backbone | Fusion Network | Multimodal Evaluation | |||||||
F1-W | K | E | D | F1-W | Maxout | Cells | Nodes | F1-W | Lat | Enr |
44.69 | [3,3,5,7,3,7,7,5,7,7,7,7,5,3,3,5,5,5,3,5] | [3,3,6,6,4,4,4,3,3,4,6,6,4,3,6,3,6,4,3,3] | [2,2,3,2,2] | 61.18 | hidden_features: 128, n_blocks: 2, factor_multiplier: 2 | 2 | 1 | 63.61 | 21.37 | 113.99 |
45.22 | [5,5,5,3,7,7,7,3,7,7,5,7,5,3,5,7,7,5,7,5] | [6,4,4,3,4,4,3,6,4,3,3,4,6,3,4,3,6,4,4,6] | [4,2,3,2,3] | 1 | 1 | 64.36 | 28.68 | 163.04 | ||
44.96 | [3,3,3,5,5,7,5,3,3,5,7,7,5,3,3,5,7,5,5,5] | [4,3,3,4,6,4,3,3,6,4,3,3,4,4,6,6,6,4,4,6] | [2,2,3,2,3] | 1 | 1 | 64.27 | 23.67 | 121.75 |
Image Backbone | Text Backbone | Fusion Network | Multimodal Evaluation | |||||||
Acc | K | E | D | Acc | Maxout | Cells | Nodes | Acc | Lat | Enr |
86.19 | [3,3,3,3,3,5,3,3,3,7,3,5] | [4,3,4,6,6,6,3,6,3,6,6,6] | [2,2,2] | 83.38 | hidden_features: 128, n_blocks: 2, factor_multiplier: 2 | 1 | 2 | 88.45 | 10.51 | 25.63 |
85.91 | [3,3,3,3,5,3,3,3,5,5,3,5] | [4,3,4,6,4,4,3,6,6,6,3,4] | [2,3,2] | 2 | 3 | 90.42 | 12.47 | 31.92 | ||
85.91 | [3,3,3,7,5,5,3,3,7,7,3,3] | [4,4,3,4,6,3,4,3,4,6,3,6] | [2,2,2] | 2 | 2 | 90.14 | 11.11 | 26.63 |
To visualize our multimodal models, we employ the BM-NAS plotter tool.
You can simply visulize the found fusion architectures by setting plot_arch=True
when calling train_darts_model()
.
If you find this implementation helpful, please consider citing our work:
@inproceedings{ghebriout2024harmonic,
title={Harmonic-NAS: Hardware-Aware Multimodal Neural Architecture Search on Resource-constrained Devices},
author={Ghebriout, Mohamed Imed Eddine and Bouzidi, Halima and Niar, Smail and Ouarnoughi, Hamza},
booktitle={Asian Conference on Machine Learning},
pages={374--389},
year={2024},
organization={PMLR}
}