Exploring the Synergy Between Vision-Language Pretraining and ChatGPT for Artwork Captioning: A Preliminary Study

While AI techniques have enabled automated analysis and interpretation of visual content, generating meaningful captions for artworks presents unique challenges. These include understanding artistic intent, historical context, and complex visual elements. Despite recent developments in multi-modal techniques, there are still gaps in generating complete and accurate captions. This paper contributes by introducing a new dataset for artwork captioning generated using prompt engineering techniques and ChatGPT. We refined the captions with CLIPScore to filter out noise; then, we fine-tuned GIT-Base, resulting in visually accurate captions that surpass the ground truth. Enrichment of descriptions with predicted metadata improves their informativeness. Artwork captioning has implications for art appreciation, inclusivity, education, and cultural exchange, particularly for people with visual impairments or limited knowledge of art.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods