Drawing from the Conceptual Metaphor Theory and the
Structure-Mapping Theory, this paper introduces two exploratory works in the field of metaphorical and visual reasoning using vision models and multimodal large language models. (i) The Multimodal Chain-of-Thought Prompting for
Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the
multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-of-thought prompting. The results showed
the model's ability to generate metaphorical expressions, as
92% of them were classified as metaphors by human evaluators. Additionally, the evaluation revealed interesting patterns
in terms of metaphoricity, familiarity and appeal scores across
the generated metaphors. (ii) The Metaphorical Visual Analogy (MeVA) task consisted in solving visual analogies of the
kind "source_domain : target_domain :: source_element : ?"
by choosing the correct target element among three difficult
distractors, varying in semantic domains and roles. The results showed that all six models and humans performed higher
than chance level, with only GPT-4o and ConvNeXt achieving higher than humans. Moreover, the error analysis showed
that, in solving the analogies, the most frequent error was the
selection of distractor 1. These works showed encouraging results for future research in the field of metaphorical and visual
reasoning, contributing to the broader question of whether AI
models serve as empirical tests of existing cognitive theories.