Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

2404.03622 Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models In this paper, a prompting technique called Visualization-of-Thought (VoT) is proposed to improve the spatial reasoning capability of large-scale language models (LLMs). The main contents are as follows.

VoT is a method to make LLM imitate spatial inference by human “mind’s eye”. It elicits spatial inference by performing visualization (data, results, etc.) at each inference step and guiding subsequent steps.

evaluated the effectiveness of VoT in three tasks: natural language navigation, visual navigation, and visual tiling; VoT significantly improved the spatial inference capability of LLMs in these tasks, outperforming existing multimodal LLMs.
developed two new tasks and datasets for “visual navigation” and “visual tiling” in a 2D grid world. These will be suitable testbeds for the study of spatial reasoning with varying complexity.

4.Mental image generation by VoT resembles the human mind’s eye, suggesting its potential application to multimodal LLM.

experiments with GPT-4 demonstrated the effectiveness of VoT; quantitative and qualitative analysis of LLM’s ability to generate mental images was conducted, and its limitations were also discussed.

How to emulate thoughts using scenery in one’s mind’s eye in the LLM
If this can be done, then the current LLM can do KJ method.

This page is auto-translated from [/nishio/Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models](https://scrapbox.io/nishio/Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Graph View

Backlinks