A UI-Focused Agent for Windows OS Interaction.
UFO is a UI-Focused multi-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.
UFO operates as a multi-agent framework, encompassing:
Both agents leverage the multi-modal capabilities of GPT-4V(o) to comprehend the application UI and fulfill the user’s request. For more details, please consult our technical report and documentation.
UFO sightings have garnered attention from various media outlets, including:
These sources provide insights into the evolving landscape of technology and the implications of UFO phenomena on various platforms.
UFO requires Python >= 3.10 running on Windows OS >= 10. It can be installed by running the following command:
# [optional to create conda environment]
# conda create -n ufo python=3.10
# conda activate ufo
# clone the repository
git clone https://github.com/microsoft/UFO.git
cd UFO
# install the requirements
pip install -r requirements.txt
# If you want to use the Qwen as your LLMs, uncomment the related libs.
Before running UFO, you need to provide your LLM configurations individually for HostAgent and AppAgent. You can create your own config file ufo/config/config.yaml
, by copying the ufo/config/config.yaml.template
and editing config for HOST_AGENT and APP_AGENT as follows:
VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "openai" , # The API type, "openai" for the OpenAI API.
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint.
API_KEY: "sk-", # The OpenAI API key, begin with sk-
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview", # The only OpenAI model
VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "aoai" , # The API type, "aoai" for the Azure OpenAI.
API_BASE: "YOUR_ENDPOINT", # The AOAI API address. Format: https://{your-resource-name}.openai.azure.com
API_KEY: "YOUR_KEY", # The aoai API key
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview", # The only OpenAI model
API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API
You can also non-visial model (e.g., GPT-4) for each agent, by setting VISUAL_MODE: False
and proper API_MODEL
(openai) and API_DEPLOYMENT_ID
(aoai). You can also optionally set an backup LLM engine in the field of BACKUP_AGENT
if the above engines failed during the inference.
You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the config.yaml
file:
VISUAL_MODE: False # To enable non-visual mode.
API_MODEL
(OpenAI) and API_DEPLOYMENT_ID
(AOAI) for each agent.Optionally, you can set a backup language model (LLM) engine in the BACKUP_AGENT
field to handle cases where the primary engines fail during inference. Ensure you configure these settings accurately to leverage non-visual models effectively.
UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the documents for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in config_dev.yaml
.
If you want to enhance UFO’s ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the ufo/config/config.yaml
file.
We provide the following options for RAG to enhance UFO’s capabilities:
Consult their respective documentation for more information on how to configure these settings.
# assume you are in the cloned UFO folder
python -m ufo --task <your_task_name>
This will start the UFO process and you can interact with it through the command line interface.
If everything goes well, you will see the following message:
Welcome to use UFO🛸, A UI-focused Agent for Windows OS Interaction.
_ _ _____ ___
| | | || ___| / _ \
| | | || |_ | | | |
| |_| || _| | |_| |
\___/ |_| \___/
Please enter your request to be completed🛸:
You can find the screenshots taken and request & response logs in the following folder:
./ufo/logs/<your_task_name>/
You may use them to debug, replay, or analyze the agent output.
Please consult the WindowsBench provided in Section A of the Appendix within our technical report. Here are some tips (and requirements) to aid in completing your request:
Our technical report paper can be found here. Note that previous AppAgent and ActAgent in the paper are renamed to HostAgent and AppAgent in the code base to better reflect their functions.
If you use UFO in your research, please cite our paper:
@article{ufo,
title={{UFO: A UI-Focused Agent for Windows OS Interaction}},
author={Zhang, Chaoyun and Li, Liqun and He, Shilin and Zhang, Xu and Qiao, Bo and Qin, Si and Ma, Minghua and Kang, Yu and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and Zhang, Qi},
journal={arXiv preprint arXiv:2402.07939},
year={2024}
}
If you’re interested in data analytics agent frameworks, check out TaskWeaver, a code-first LLM agent framework designed for seamlessly planning and executing data analytics tasks.
For more information on GUI agents, refer to our survey paper: Large Language Model-Brained GUI Agents: A Survey. You can also explore the survey through:
By choosing to run the provided code, you acknowledge and agree to the following terms and conditions regarding the functionality and data handling practices in DISCLAIMER.md
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
Microsoft’s Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party’s policies.