The evolution of Vision Language Models (VLMs) towards general-purpose models relies on their ability to understand images and perform tasks via natural language instructions. However, it must be clarified if current VLMs truly grasp detailed object information in images. The analysis shows that their image understanding correlates strongly with zero-shot performance on vision language tasks.…
		 
						
									
			 
				 
								 
						
									 
						
									 
						
									 
						
									 
						
									 
						
									 
						
									 
						
									