2404.03622 Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models Claude.iconIn this paper, a prompting technique called Visualization-of-Thought (VoT) is proposed to improve the spatial reasoning capability of large-scale language models (LLMs). The main contents are as follows.

  1. evaluated the effectiveness of VoT in three tasks: natural language navigation, visual navigation, and visual tiling; VoT significantly improved the spatial inference capability of LLMs in these tasks, outperforming existing multimodal LLMs.
  2. developed two new tasks and datasets for “visual navigation” and “visual tiling” in a 2D grid world. These will be suitable testbeds for the study of spatial reasoning with varying complexity.
  • 4.Mental image generation by VoT resembles the human mind’s eye, suggesting its potential application to multimodal LLM.
  1. experiments with GPT-4 demonstrated the effectiveness of VoT; quantitative and qualitative analysis of LLM’s ability to generate mental images was conducted, and its limitations were also discussed.

This page is auto-translated from [/nishio/Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models](https://scrapbox.io/nishio/Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.