A research-driven project for generating accessibility of macOS applications using computer vision and deep learning. Read more about the project in our paper.
- macOS
- Python (recommended β₯ 3.11)
- Conda
- Pip
Create and activate the project environment:
conda create -n screen2ax python=3.11
conda activate screen2ax
pip install -r requirements.txt
β οΈ The first run may take longer due to model downloads and initial setup.
Run the accessibility generation script:
python -m hierarchy_dl.hierarchy --help
usage: hierarchy.py [-h] [--image IMAGE] [--save] [--filename FILENAME] [--save_dir SAVE_DIR] [--flat]
options:
-h, --help show this help message and exit
--image IMAGE Path to the image
--save Save the result
--filename FILENAME Filename to save the result
--save_dir SAVE_DIR Directory to save the result. Default is './results/'
--flat Generate flat hierarchy (no groups)
Run the accessibility generation script on a screenshot of the Spotify app:
python -m hierarchy_dl.hierarchy --image ./screenshots/spotify.png --save --filename spotify.json
This will generate a JSON file with the accessibility of the app in the results folder.
Run the screen reader:
python -m screen_reader.screen_reader --help
usage: screen_reader.py [-h] [-b BUNDLE_ID] [-n NAME] [-dw] [-dh] [-r RATE] [-v VOICE] [-sa] [-sk SKIP_GROUPS]
options:
-h, --help show this help message and exit
-b, --bundle_id BUNDLE_ID Bundle ID of the target application
-n, --name NAME Name of the target application (alternative to bundle_id)
-dw, --deactivate_welcome Skip the "Welcome to the ScreenReader." message
-dh, --deactivate_help Skip reading the help message on startup
-r, --rate RATE Set speech rate for macOS `say` command (default: 190)
-v, --voice VOICE Set voice for macOS `say` command (see `say -v "?" | grep en`)
-sa, --system_accessibility Use macOS system accessibility data instead of vision-generated
-sk, --skip-groups N Skip groups with fewer than N children (default: 5)
Run the screen reader for the Spotify app:
python -m screen_reader.screen_reader --name Spotify
The YOLO models used for UI elements and UI groups detection are licensed under the GNU Affero General Public License (AGPL). This is inherited from the original YOLO model licensing.
The BLIP model for captioning UI elements is provided under the MIT License.
All datasets (Screen2AX-Tree, Screen2AX-Element, Screen2AX-Group, Screen2AX-Task) are released under the Apache 2.0 license.
All source code in this repository is licensed under the MIT License. See the LICENSE file for full terms and conditions.
If you use this code in your research, please cite our paper:
@misc{muryn2025screen2axvisionbasedapproachautomatic,
title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation},
author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai},
year={2025},
eprint={2507.16704},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.16704},
}
We would like to express our deepest gratitude to the Armed Forces of Ukraine. Your courage and unwavering defense of our country make it possible for us to live, work, and create in freedom. This work would not be possible without your sacrifice. Thank you.
Visit our site to learn more π