2404.03622 Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models In this paper, a prompting technique called Visualization-of-Thought (VoT) is proposed to improve the spatial reasoning capability of large-scale language models (LLMs). The main contents are as follows.
- VoT is a method to make LLM imitate spatial inference by human “mind’s eye”. It elicits spatial inference by performing visualization (data, results, etc.) at each inference step and guiding subsequent steps.
- evaluated the effectiveness of VoT in three tasks: natural language navigation, visual navigation, and visual tiling; VoT significantly improved the spatial inference capability of LLMs in these tasks, outperforming existing multimodal LLMs.
- developed two new tasks and datasets for “visual navigation” and “visual tiling” in a 2D grid world. These will be suitable testbeds for the study of spatial reasoning with varying complexity.
- 4.Mental image generation by VoT resembles the human mind’s eye, suggesting its potential application to multimodal LLM.
- experiments with GPT-4 demonstrated the effectiveness of VoT; quantitative and qualitative analysis of LLM’s ability to generate mental images was conducted, and its limitations were also discussed.
- How to emulate thoughts using scenery in one’s mind’s eye in the LLM
- If this can be done, then the current LLM can do KJ method.
This page is auto-translated from [/nishio/Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models](https://scrapbox.io/nishio/Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.