Authors

Fucai Ke ^✧^★

Affiliations

^✧ Building 4.0 CRC

^★ Monash University

^△ NEC Labs America

^▣ The Australian National University

^◎ The University of North Carolina at Chapel Hill

^□ University of California San Diego

Date

Jul 18^th, 2025

(Best viewed in 4K)

Compositional Visual Reasoning (CVR), vital for human-like visual understanding in diverse applications, faces significant challenges due to the complexity of tasks and potential tool failures in agent frameworks. Conventional approaches using frozen large language models (LLMs) lack tool awareness, leading to suboptimal performance especially when tools produce errors. To address this, we propose DWIM, a novel framework that enhances tool-aware visual reasoning through discrepancy-aware workflow generation, which identifies and mitigates tool failures by refining actions (see §DW). Additionally, DWIM introduces an innovative instruct-masking fine-tuning method, training agentic LLMs on workflow-based data to improve tool utilization and reasoning efficiency (see §IM). Our experiments, detailed in §Experiments, demonstrate state-of-the-art performance across multiple VR datasets, highlighting DWIM's robust generalization and reduced reliance on manual prompt engineering.

DW: Discrepancy-away training Workflow generation

In Discrepancy-aware Workflow Generation (DW), we address the challenge of inaccurate tool outputs in agent frameworks. When tool executions feedback produce potentially unreliable results, conditioning on the ground-truth label enables the LLM to be prompted to identify discrepancies between environmental feedback and the expected outcome, such as factual errors from tool failures. Upon detecting such conflicts, the LLM generates a Rethink action which includes a natural language description of the discrepancy and proposes an alternative action as the next step. This approach enhances the robustness and adaptability of agent workflows, as demonstrated in our experiments (see §Experiments).

In the data collection process, standard methods struggle to produce workflows that are both logically feasible and practically executable, as tools may return erroneous outputs, leading to workflows that are theoretically sound but fail in practice. Our approach enables the LLM to compare actual feedback with the ground-truth label to check each step, and upon detecting discrepancies, generates a Rethink action to articulate the issue and propose a revised action.

Action Flagging

Action Flagging employs a rule-based approach to evaluate environment feedback and detect erroneous tool usage, such as Python code that cannot be executed due to errors generated by the agent. Additionally, it identifies actions that are executable but ineffective for solving the task or logically feasible actions that fail due to tool limitations. These are classified as ineffective actions, while others are labeled as effective actions. This distinction facilitates subsequent training by masking only effective actions.

IM: Instruct-Masking Fine Tuning

Subsequently, each effective action is masked and the LLM is instructed to reproduce it iteratively. The loss is computed between the original and the LLM-generated action, allowing the model to focus on learning effective tool usage and planning strategies, rather than memorizing noisy workflows. In contrast to BERT's token-level masking, we mask actions at the semantic level, instead of token level. The instruct approach, which can enhance the model's understanding of workflow dynamics, is inspired by instruction-tuning and incorporates the concept of an end-of-sequence token.

Experiments Results

Quantitative Analysis

Comparison of average performance across six datasets

We use * to highlight the second high score. For all E2E methods, we present their results in grey as they serve as reference points but are not compositional visual reasoning methods and are not intended as direct comparison targets. Shots means the number of provided in-context learning examples.

Cross-dataset generalization ability study.

We use the checkpoint trained on the GQA training set to test models on other datasets. Results for all E2E methods and frozen agentic LLMs are shown in grey as reference points; these models are neither compositional nor trained and are not intended for direct comparison.

Dependence on Task-Specific Tool Libraries.

In this experiment, we train and evaluate all compositional methods using a complete tool library instead of a task-specific one on two tasks to examine their dependence on designed tool libraries.

Ablation Study

Effectiveness of the proposed training workflow generation

In this study, we investigate the impact of discrepancy-aware training workflow generation on data utilization. Data utilization refers to the proportion of training data points that can generate workflows yielding correct results and are suitable for training.

Effectiveness of Instruct-Masking Fine-tuning

We examine the effectiveness of the instruct-masking fine-tuning method compared to SFT. instruct-masking exhibits a clear advantage over both Random-Masking and Masking-W-Rethink. Random-Masking indiscriminately masks actions rather than selectively targeting correct actions within the workflow, while Masking-W-Rethink masks both effective actions and ineffective actions that trigger ``Rethink."

Experiment Using Various LLM Backbones for DWIM

we apply DWIM to various open-source models (e.g., LLaMa-3.1-8B, Mistral-v0.2/0.3-7B, and three Qwen2.5 variants), and compare their performance to frozen LLMs with 10-shot in-context learning.

BibTex

@article{ke2025dwim,
                title={{DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning}},
                author={Ke, Fucai and Leng, Xingjian and Cai, Zhixi and Khan, Zaid and Wang, Weiqing and Haghighi, Pari Delir and Rezatofighi, Hamid and Chandraker, Manmohan and others},
                year={2025},
                journal={arXiv preprint arXiv:2503.19263},
                }