An open-source framework for training tool-aware LLM agents. Train practical LLM agents by identifying tool failures and learning from effective actions through DW-(Discrepancy-aware Workflows genteration) and IM-(Instruct-Masked tuning).
Compositional Visual Reasoning (CVR), vital for human-like visual understanding in diverse applications, faces significant challenges due to the complexity of tasks and potential tool failures in agent frameworks. Conventional approaches using frozen large language models (LLMs) lack tool awareness, leading to suboptimal performance especially when tools produce errors. To address this, we propose DWIM, a novel framework that enhances tool-aware visual reasoning through discrepancy-aware workflow generation, which identifies and mitigates tool failures by refining actions (see §DW). Additionally, DWIM introduces an innovative instruct-masking fine-tuning method, training agentic LLMs on workflow-based data to improve tool utilization and reasoning efficiency (see §IM). Our experiments, detailed in §Experiments, demonstrate state-of-the-art performance across multiple VR datasets, highlighting DWIM's robust generalization and reduced reliance on manual prompt engineering.
In Discrepancy-aware Workflow Generation (DW), we address the challenge of inaccurate tool outputs in agent frameworks. When tool executions feedback produce potentially unreliable results, conditioning on the ground-truth label enables the LLM to be prompted to identify discrepancies between environmental feedback and the expected outcome, such as factual errors from tool failures. Upon detecting such conflicts, the LLM generates a Rethink action which includes a natural language description of the discrepancy and proposes an alternative action as the next step. This approach enhances the robustness and adaptability of agent workflows, as demonstrated in our experiments (see §Experiments).
In the data collection process, standard methods struggle to produce workflows that are both logically feasible and practically executable, as tools may return erroneous outputs, leading to workflows that are theoretically sound but fail in practice. Our approach enables the LLM to compare actual feedback with the ground-truth label to check each step, and upon detecting discrepancies, generates a Rethink action to articulate the issue and propose a revised action.
Action Flagging employs a rule-based approach to evaluate environment feedback and detect erroneous tool usage, such as Python code that cannot be executed due to errors generated by the agent. Additionally, it identifies actions that are executable but ineffective for solving the task or logically feasible actions that fail due to tool limitations. These are classified as ineffective actions, while others are labeled as effective actions. This distinction facilitates subsequent training by masking only effective actions.
Subsequently, each effective action is masked and the LLM is instructed to reproduce it iteratively. The loss is computed between the original and the LLM-generated action, allowing the model to focus on learning effective tool usage and planning strategies, rather than memorizing noisy workflows. In contrast to BERT's token-level masking, we mask actions at the semantic level, instead of token level. The instruct approach, which can enhance the model's understanding of workflow dynamics, is inspired by instruction-tuning and incorporates the concept of an end-of-sequence token.
We use * to highlight the second high score. For all E2E methods, we present their results in grey as they serve as reference points but are not compositional visual reasoning methods and are not intended as direct comparison targets. Shots means the number of provided in-context learning examples.
We use the checkpoint trained on the GQA training set to test models on other datasets. Results for all E2E methods and frozen agentic LLMs are shown in grey as reference points; these models are neither compositional nor trained and are not intended for direct comparison.
In this experiment, we train and evaluate all compositional methods using a complete tool library instead of a task-specific one on two tasks to examine their dependence on designed tool libraries.
In this study, we investigate the impact of discrepancy-aware training workflow generation on data utilization. Data utilization refers to the proportion of training data points that can generate workflows yielding correct results and are suitable for training.
We examine the effectiveness of the instruct-masking fine-tuning method compared to SFT. instruct-masking exhibits a clear advantage over both Random-Masking and Masking-W-Rethink. Random-Masking indiscriminately masks actions rather than selectively targeting correct actions within the workflow, while Masking-W-Rethink masks both effective actions and ineffective actions that trigger ``Rethink."
we apply DWIM to various open-source models (e.g., LLaMa-3.1-8B, Mistral-v0.2/0.3-7B, and three Qwen2.5 variants), and compare their performance to frozen LLMs with 10-shot in-context learning.
@article{ke2025dwim,
title={{DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning}},
author={Ke, Fucai and Leng, Xingjian and Cai, Zhixi and Khan, Zaid and Wang, Weiqing and Haghighi, Pari Delir and Rezatofighi, Hamid and Chandraker, Manmohan and others},
year={2025},
journal={arXiv preprint arXiv:2503.19263},
}