PyWinAssistant: An artificial assistant – MIT Licensed | Public Release: December 31, 2023 | Complies with federal coordinations AI Standards for Complex Adaptive Systems, Asilomar AI Principles and IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.
PyWinAssistant is the first open-source Artificial Narrow Intelligence to elicit spatial reasoning and perception as a generalist agentic framework Computer-Using-Agent that fully operates graphical-user-interfaces (GUIs) for Windows 10/11 through direct OS-native semantic interaction. It functions as a Computer-Using-Agent / Large-Action-Model, forming the foundation for a pure symbolic spatial cognition framework that enables artificial operation of a computer using only natural language, without relying on computer vision, OCR, or pixel-level imaging. PyWinAssistant emulates, plans, and simulates synthetic Human-Interface-Device (HID) interactions through native Windows Accessibility APIs, eliciting human-like abstraction across geometric, hierarchical, and temporal dimensions at an Operating-System level. This OS-integrated approach simulating spatial utilization of a computer provides a future-proof, generalized, modular, and dynamic ANI orchestration framework for multi-agent-driven automation, marking an important step in symbolic reasoning towards AGI.
Key Features:
- Not relying only on Imaging Pipeline: Operates exclusively through Windows UI Automation (UIA) and programmatic GUI semantics, enabling universal workflow orchestration.
- Symbolic Spatial Mapping: Hierarchical element tracking via OS-native parent/child relationships and coordinate systems.
- Non-Visual Perception: Real-time interface understanding through direct metadata extraction (control types, states, positions).
- Visual Perception: A single screenshot can elicit comprehension and perception with attention to detail by visualizing goal intent and environment changes in a spatial space over time, can be fine-tuned to look up for visual cues, bugs, causal reasoning bugs, static, semantic grounding, errors, corruption...
- Unified Automation: Automatic element detection. Combines GUI, system, and web automation under one Python API. Eliminates context-switching between tools.
- AI-Powered Script Generation: Translates natural language or demonstrations into any kind of code inside any IDE or text edit areas.
- Self-Healing Workflows: Auto-adjusts to UI changes (e.g., element ID shifts). Reducing maintenance overhead, making PyWinAssistant's algorithm future-proof.
- AI/ML Integration: Using NLP to generate scripts (e.g., “Automate Application” → plan of test execution steps in JSON) with self-correcting selectors.
- Cross-Context Automation: Seamlessly combining GUI, web, and API workflows in a Pythonic way, unifying disjointed automation methods (GUI, API, web) into a single framework.
- Accessibility: Enhancing accessibility for users with different needs, enabling voice or simple text commands to control complex actions.
- Generalization: Elicits spatial cognition to understand and execute a wide range of commands in a natural, intuitive manner.
- Small and compact: PyWinAssistant functions as an example algorithm of a modular and generalized computer assistant framework that elicits spatial cognition.
PyWinAssistant has its own set of reasoning agents, utilizing Visualization-of-Thought (VoT) and Chain-of-Thought (CoT) to enhance generalization, dynamically simulating actions through abstract GUI semantic dimensions rather than visual processing, making it future-proof for next-generation LLM models. By visualizing interface contents to dynamically simulate and plan actions over abstract GUI semantic dimensions, concepts, and differentials, PyWinAssistant redefines computer vision automation, enabling high-efficiency visual processing at a fraction of traditional computational costs. PyWinAssistant has achieved real-time spatial perception at an Operating-System level, allowing for memorization of visual cues and tracking of on-screen changes over time.
Released before key breakthroughs in AI for Spatial Reasoning, it predates:
- Microsoft’s Visualization-of-Thought research paper (April 4, 2024)
- Anthropic Claude’s Computer-Use Agent (October 22, 2024)
- OpenIA ChatGPT’s Operator Computer-Using Agent (CUA) (January 23, 2025)
PyWinAssistant represents a major paradigm shift in AI and automation by pioneering pure symbolic computer interaction bridging human intent with GUI automation at an OS level through these breakthroughs:
- First Agent to bypass OCR/imaging for Computer-Using-Agent GUI automation.
- First Framework using Windows UIA as the primary spatial perception channel.
- First System demonstrating OS-native hierarchical-temporal reasoning.
Traditional Approach:
Automation tools require scripting (e.g., AutoHotkey) or API integration (e.g., Selenium).
PyWinAssistant Breakthrough:
# True generalization for natural language directly driving UI actions
assistant("Play Daft Punk on Spotify and email the lyrics to my friend")
# The agent chooses a fitting item according to the related context to comply with user intent.
Mechanism: Combines UIAutomation’s GUI control detection with LLMs to:
- Parse intent ("play", "email lyrics")
- Map to UI elements (Spotify play button, Outlook compose window)
- Generate adaptive workflows
PyWinAssistant Innovation: Eliminates the need for:
- Predefined API integrations
- XPath/CSS selector knowledge
- Manual error handling
Traditional Limitation:
Tools operate in app silos (e.g., Power Automate connectors).
PyWinAssistant Innovation:
# Notes:
# The full set of steps generation from the Assistant is working flawlessly, but in-step modifier and memory-content retrieval was purposely disabled and commented into the code- [def act()](https://github.com/a-real-ai/pywinassistant/blob/6aae4e514a0dc661f7ed640181663f483972bc1e/core/driver.py#L648C1-L648C8)
# to comply with federal coordinations AI Standards for Complex Adaptive Systems, Asilomar AI Principles and IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.
# Accurately maintains context and intent across apps using UIA tree and spatial memory: (Example for further development)
assistant("Find for the best and cheapest flight to Mexico, and also look for local hotels and suggest me on new tabs the best on cultural options")
assistant("Look for various pizza coupons for anything but pineapple, fill in the details to order and show me the results")
# PyWinAssistant is highly modular (example):
def workflow():
song = assistant(goal="get the current track") # UIA
write_action(f"Review '{song}': Great bassline!", app="Notepad") # Win32
assistant(goal="Post on twitter the written text from notepad") # Web
# The previous set of actions can be also executed by simply using natural language:
assistant(f"Get the current song playing and in notepad put the title as Review song name: Great bassline, and write about why it is a great baseline, then post it on twitter", assistant_identity="You're an expert music critic")
Key Advancements:
- Unified Control Graph: Treats all apps as nodes in a single UIA-accessible graph
- State Transfer: Passes data between apps via clipboard/UIA properties
- Semantic Transfer: Passes semantics of goal intent acros all steps
- Error Recovery: Uses agentic reasoning systems to avoid failing actions
Traditional Model:
Deterministic scripts fail on UI changes.
PyWinAssistant’s Solution:
# Adaptive element discovery
def fast_action(goal):
speaker(f"Clicking onto the element without visioning context. No imaging is required.")
analyzed_ui = analyze_app(application=ai_choosen_app, additional_search_options=generated_keywords)
gen_coordinates = [{"role": "assistant",
f"content": f"You are an AI Windows Mouse Agent that can interact with the mouse. Only respond with the "
f"predicted coordinates of the mouse click position to the center of the element object "
f"\"x=, y=\" to achieve the goal."},
{"role": "system", "content": f"Goal: {single_step}\n\nContext:{original_goal}\n{analyzed_ui}"}]
coordinates = api_call(gen_coordinates, model_name="gpt-4-1106-preview", max_tokens=100, temperature=0.0)
print(f"AI decision coordinates: \'{coordinates}\'")
Revolutionary Features:
- Semantic Search by thinking: Example
synonyms("download") → ["save", "export", "↓ icon"]
- Spatial Probability: Prioritizes elements by utilizing sets of self-reasoning agents for the synthetic operation of the actions
- Spatial-Prevention: Senses and prevents possible bad actions or misaligned step execution by utilizing sets of self-reasoning agents
- Self-Healing: Automatically chooses the perfect plan to execute without failing its step reasoning, by utilizing sets of self-reasoning agents
Task: Automate to save a song on spotify GUI.
Before:
Automation required:
WinWait, Spotify
ControlClick, x=152 y=311 # Fragile coordinates
Now: Only 1 natural language command.
assistant("Like this song") # Language-first
Shift Metrics: | Traditional Tools | PyWinAssistant |
---|---|---|
Learning Curve | Days, even months | Minutes |
Cross-App Workflows | Manual Integration | Automatic |
Maintenance Overhead | High | LLM-AutoPatch |
-
From Scripts to Intent:
Replaces brittleclick(x,y)
with human-like "understand → act" cycles. -
From Silos to OS as API:
Treats the entire Windows environment as a programmable interface. -
From Fixed to Adaptive:
Leverages LLMs to handle UI changes (e.g., Spotify’s 2023 UI overhaul). -
From Developers to Everyone:
Makes advanced automation accessible through natural language, improving the generality quality and minimizing the overall data usage of LLM and vision models. Has built-in assistance options to improve human utilization of a computer, with a new technical approach to User Interface and User Experience assistance and testing by spatial visualization of thought, generalizes correctly any natural language prompt, and plans to perform correct actions into the OS with security in mind.
By directly interfacing with Windows underlying UI hierarchy, it achieves real-time spatial perception at the OS level while eliminating traditional computer vision pipelines, enabling:
- 100x Efficiency Gains: Native API access.
- Blind Operation: Can function on headless systems, virtual machines, or minimized windows.
- Precision Abstraction: Mathematical modeling of GUI relationships rather than visual pattern matching.
Image-Free by Design (Core Architecture)
While some projects require visual processing for fundamental operation, PyWinAssistant achieves complete GUI interaction capability without an imaging pipeline through:
-
Native OS Semantic Access
Direct Windows UIA API integration provides full control metadata:# Example of an element properties via UIA - No screenshots needed button = uia.Element.find(Name="Submit", ControlType="Button") print(button.BoundingRectangle) # {x: 120, y: 240, width: 80, height: 30}
-
Imaging Module
# PyWinAssistant imaging functions like Pixel level visualization can be enabled as real-time spatial perception with memorization of visual cues and tracking of on-screen changes over time. + Capable of planning successful sets of highly technical steps to perform operations on a computer at an OS level, with only one screenshot. + Pixel level visualization. + Visual hash matching can be enabled for dynamic elements. - OCR fallback / object detection for non-UIA legacy apps. # The experimental features of OCR were added but not fully developed as it was not necessary for the current implementation as the assistant currently works too well without it.
Key Differentiation | PyWinAssistant | Traditional Automation |
---|---|---|
Primary Perception | UIA Metadata | Screenshots/OCR |
Vision Dependency | Optional Add-on | Required Core |
Headless Ready | ✅ Native | ❌ Requires virtual display |
PyWinAssistant is limited to model's intelligence and time to inference. New advancements on LLM's are required to reach for a complete Artificial General Intelligence system with Artificial Narrow Intelligences managing it. The system's autonomous task decomposition leverages native semantic differentials rather than visual changes, visual changes can be optionally activated for real-time image corruption analysis in GUI/Screen. Long-term memory and self-learning mechanisms were designed to evolve symbolic state representations, and can be also represented into visual patterns, aligning with AGI development.
Paper related: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models (April 4, 2024):
https://arxiv.org/abs/2404.03622
PyWinAssistant includes built-in assistant features designed to enhance human-computer interaction for all users. It integrates real-time voice recognition, customizable assistant personalities, subtitles, and chat functionality. Talk with your computer friendly and naturally to perform any User Interface activity. Use natural language to operate freely your Windows Operating System. Generates and plans test cases of your User Interface applications for continuous testing on any Win32api supported application by simply using natural language. Your own open and secure personal assistant that responds as you want, control the way you want your computer to assist you. It's engineered to be modular, understand and execute a wide range of tasks, automating interactions with any desktop applications.
Voice 1 - Input Human (English Female Australian TTS)
Voice 2 - Output Assistant (English Female US Google TTS)
Use your computer by natural language - Real-time usage of VoT, an example of a Computer-Using-Agent; Single Action Model.
Does not use any vision. Only API LLM calls. Demonstrating flawless execution of multiple prompt actions.
Live.Usage.Example.mp4
Use your computer as an assistant - Real-time usage of planning VoT, an example of a Computer-Using-Agent; Large-Action-Model.
Takes only 1 screenshot: Gets to know what the user is doing and what is that the user wants to achieve, the assistant plans to perform it.
Voice Recognized Prompt: Make a new post on twitter saying hello world and a brief greeting explaining you're an artificial intelligence.
Assistant.Live.Example.mp4
The assistant can do anything for you - Real-time usage of planning VoT, an example of a Computer-Using-Agent; Large-Action-Model.
The inference is the only constraint for speed.
Voice Recognized Prompt: Create a new comment explaining why it is so important.
Context.Live.Example.mp4
November 16th 2023 live demo: (Firefox, Spotify, Notepad, Calculator, Mail)
assistant(goal=f"Open a new tab the song \'Wall Of Eyes - The Smile\', from google search results filter by videos then play it on Firefox") # Working 100%
assistant(goal=f"Pause the music on Spotify") # Working 100%
assistant(goal=f"Create a short greet text for the user using AI Automated Windows in notepad.exe") # Working 100%
assistant(goal=f"Open calc.exe and press 4 x 4 =") # Working 100%
November_2023_Demo.mp4
December 1st 2023 live demo: (Chrome, Spotify, Firefox) Example of programmable methods.
assistant(goal=f"Play the song \'Robot Rock - Daft Punk\' on Spotify", keep_in_mind=f"To start playback double click the song.") # Working 100%
assistant(goal=f"Open 3 new tabs on google chrome and in each of them search for 3 different types of funny AI Memes", keep_in_mind=" Filter the results by images.") # Working 100%
assistant(goal=f"Open a new tab the song \'Windows 95 but it's a PHAT hip hop beat\', from google search results filter by videos then play it by clicking on the text on Firefox.") # Working 100%
December.2023.Demo.Original.mp4
Currently supporting all generalized win32api apps, meaning: Chrome, Firefox, OperaGX, Discord, Telegram, Spotify...
- Dynamic Case Generator: The assistant() function accepts a goal parameter, which is a natural language command, and intelligently maps it to a series of executable actions. This allows for a seamless translation of user intentions into effective actions on the computer.
- Single Action Execution: The act() function is a streamlined method for executing actions, enhancing the tool's efficiency and responsiveness.
- Advanced Context Handling: The framework is adept at understanding context by analyzing the screen and the application, ensuring that actions are carried out with an awareness of the necessary prerequisites or steps.
- Semantic router map: The framework has a database of a semantic router map to successfully execute generated test cases. This semantic maps can be created by other AI.
- Wide Application Range: From multimedia control (like playing songs or pausing playback on Spotify and YouTube) to complex actions (like creating AI-generated text, sending emails, or managing applications like Telegram or Firefox), the framework covers a broad spectrum of tasks.
- Customizable AI Identity: The write_action() function allows for a customizable assistant identity, enabling personalized interactions and responses that align with the user's preferences or the nature of the task.
- Robust Error Handling and Feedback: The framework is designed to handle unexpected scenarios gracefully, providing clear feedback and ensuring reliability. (In Overview)
- Projects for mood and personality: Generate or suggest now and then useful scenarios based on your mood and personality. (In Overview)
- Natural Language Processing (NLP): Employs advanced NLP techniques to parse and understand user commands in a natural, conversational manner.
- Task Automation Algorithms: Utilizes sophisticated algorithms to break down complex tasks into executable steps.
- Context-Aware Execution: Integrates contextual awareness for more nuanced and effective task execution.
- Cross-Application Functionality: Seamlessly interfaces with various applications and web services, demonstrating extensive compatibility and integration capabilities.
- Use Cases.
- Automating repetitive tasks in a Windows environment.
- Streamlining workflows for professionals and casual users alike.
- Enhancing accessibility for users with different needs, enabling voice or simple text commands to control complex actions.
- Assisting in learning and exploration by providing AI-driven guidance and execution of tasks.
This Artificially Assisted User Interface Testing framework is a pioneering tool in the realm of desktop automation. Its ability to understand and execute a wide range of commands in a natural, intuitive manner makes it an invaluable asset for anyone looking to enhance their productivity and interaction with their Windows environment. It's not just a tool; it's a step towards a future where AI seamlessly integrates into our daily computing tasks, making technology more accessible and user-friendly.
# Add your Chat-GPT API Keys to the project:
add your API Key in /core/core_api.py -> line 3: client = OpenAI(api_key='insert_your_api_key_here')
add your API Key in /core/core_imaging.py -> line 12: api_key = 'insert_your_api_key_here'
# Install requirements:
cd pywinassistant
pip install -r .\requirements.txt
# Execute the assistant:
cd .\core
python ./assistant.py
Run "Assistant.py", say "Ok computer" to enable the assistant by voice commands or click to it or enable the chat to do a fast action. Use Right click above the Assistant to see the available options for the assistant.
For debugging mode execute "Driver.py". Inside it, you can debug and try easily the functions of "act" which is used alongside the assistant, "fast_act" and "assistant" by using the examples. To run a JSON test case, modify the JSON path from the "assistant" function.
assistant(goal=f"Play the song \'One More Time - Daft Punk\' on Spotify") # Working 100%
assistant(goal=f"Open a new tab the song \'Wall Of Eyes - The Smile\', from google search results filter by videos then play it on Firefox") # Working 100%
assistant(goal=f"Open a new tab the song \'Windows XP Error beat\', from google search results filter by videos then play it by clicking on the text on Firefox.") # Working 100%
fast_act(goal=f"Click on the Like button") # Working 100%
assistant(goal=f"Pause the music on Spotify") # Working 100%
write_action(goal="Comment about why IA is great for the current playing song", assistant_identity="You\'re an advanced music AI agent that specializes on music") # Working 100%
assistant(f"Create a long AI essay about an AI Starting to control a Windows computer on Notepad") # Working 100%
fast_act(goal="Click on the button at the bottom in HueSync app") # Working 100%
write_action(goal="Weird Fishes - Radiohead") # Working 100%
assistant(f"Open Calc and press 4 x 4 - 4 * 4 + 1 =") # Working 100%
assistant(goal=f"Open 3 new tabs on google chrome and in each of them search for 3 different types of funny dogs", keep_in_mind=" Filter the results by images.") # Working 100%
assistant(goal=f"Stop the playback from Firefox app") # Working 100%
assistant(f"Send a list of steps to make a joke about engineers whilist making it an essay to my friend Diana in Telegram") # Working 100%
assistant(f"Send a list of steps to make a chocolate cake to my saved messages in Telegram") # Working 100%
assistant(f"Create three new tabs on Firefox, in each of them search 3 different types of funny youtube bad tutorial videos, generate the titles to search.") # Working 100%
assistant(f"Write an essay about an AI that a person created to use freely the computer, like you. Write it in notepad.exe") # Working 100%
assistant(f"Send an AI joke and say it's generated by an AI to my friend Diana on Discord") # Working 100%
assistant(goal=f"Create a short greet text for the user using AI Automated Windows in notepad.exe") # Working 100%
assistant(goal=f"Open calc.exe and press 4 x 4 =") # Working 100%
assistant(goal=f"Send a mail to \'[email protected]\' with the subject \'Hello\' and generate the message \'Generate a message about how an AI is helping everyone as an users\' on the Mail app",
keep_in_mind="Press \'Tab\' tree times to navigate to the subject area. Do not combine steps.") # Need to update the app semantic map to get it working 100%.
assistant(goal=f"Play the song \'The Smile - Wall Of Eyes\' on Spotify") # Working 100%
assistant(goal=f"Play the song \'Panda Bear - Tropic of cancer\' on Spotify") # Working 100%
assistant(goal="Pause the music on the Spotify app") # Working 100%
assistant(goal=f"Open 3 new tabs with different Daft Punk songs on each of them on Firefox") # Working 100%
fast_act("Open spotify and Search the album \'Grimes - Visions\'") # Working 100%
write_action("Open spotify and Search the album \'Grimes - Visions\'") # Working 100%
fast_act("Click on the first result on spotify") # Working 100%
fast_act("Skip to the next song on Spotify") # Working 100%
fast_act("Add the album to the library") # Working 100%
fast_act("Go to Home on Spotify") # Working 100%
fast_act("Save the song to my library on Spotify") # Working 100%
- Manual Testing:
In manual testing, a human tester performs a set of operations to check whether the application is functioning correctly and that the graphical elements conform to the documented requirements. Manual-based testing has notable downsides in that it can be time-consuming, and the test coverage is extremely low. Additionally, the quality of testing in this approach depends on the knowledge and capabilities of the testing team.
- Record-and-Playback Testing:
Also known as record-and-replay testing, it is executed using automation tools. The automated UI testing tool records all tasks, actions, and interactions with the application. The recorded steps are then reproduced, executed, and compared with the expected behavior. For further testing, the replay phase can be repeated with various data sets.
- Model-Based Testing:
In this testing approach, we focus on building graphical models that describe the behavior of a system. This provides a deeper understanding of the system, which allows the tester to generate highly efficient test cases. In the models, we determine the inputs and outputs of the system, which are in turn, used to run the tests. Model-based testing works as follows:
Create a model for the system
Determine system inputs
Verify the expected output
Execute tests
Check and validate system output vs. the expected output
The model-based approach is great because it allows a higher level of automation. It also covers a higher number of states in the system, thereby improving the test coverage.
- Artificially Assisted User Interface Testing:
Artificially Assisted User Interface Testing harnesses the power of artificial intelligence to revolutionize the process of testing graphical user interfaces. Unlike traditional methods, Artificially Assisted User Interface Testing integrates machine learning algorithms and intelligent decision-making processes to autonomously identify, analyze, and interact with UI elements. This approach significantly enhances the depth and breadth of testing in several ways:
Dynamic Interaction with UI Elements: AI-driven tests can adapt to changes in the UI, such as modified button locations or altered element properties. This flexibility is achieved through the use of AI models trained to recognize and interact with various UI components, regardless of superficial changes.
Learning and Pattern Recognition: Utilizing machine learning, Artificially Assisted User Interface Testing systems can learn from previous interactions, test runs, and user feedback. This enables the AI to recognize patterns and predict potential issues, improving over time and offering more thorough testing with each iteration.
Automated Test Case Generation: The AI can generate test cases based on its understanding of the application's functionality and user behavior patterns. This not only saves time but also ensures that a wider range of scenarios is tested, including edge cases that might be overlooked in manual testing.
Natural Language Processing (NLP): AI Testing tools often incorporate NLP to interpret and execute tests written in plain language. This feature makes the testing process more accessible to non-technical stakeholders and facilitates better communication across the team.
Real-Time Feedback and Analytics: AI systems provide real-time insights into the testing process, identifying bugs, performance issues, and usability problems promptly. This immediate feedback loop enables quicker rectifications and enhances the overall quality of the product.
Predictive Analysis and Risk Assessment: By analyzing past data, Artificially Assisted User Interface Testing tools can predict potential problem areas and allocate testing resources more efficiently. This proactive approach to risk management ensures that critical issues are identified and addressed early in the development lifecycle.
In conclusion, Artificially Assisted User Interface Testing represents a significant leap forward in software quality assurance. By automating and enhancing the testing process, AI-driven tools offer improved accuracy, speed, and coverage, paving the way for more reliable and user-friendly applications.
This project is being updated as of start of 2024. The list of requirements is being updated.