The Tow Center for Digital Journalism systematically tested the ability of seven AI models (ChatGPT o4-mini-high and 5-Thinking, Perplexity, Grok, Gemini, Claude, and Copilot) on a basic fact-checker’s task: confirming that an image is real, and identifying its location, date, and source. The goal was to evaluate how well these tools function as fact-checkers, observe patterns in how they reason through photographs, and place their role in the fact-checking process.
[Read the piece to learn about our findings] ()
- ChatGPT 5 Thinking
- ChatGPT o4-mini-high
- Perplexity Pro Search
- Grok3
- Copilot
- Claude Opus 4.1
- Gemini 2.5 Flash
Each LLM was tested in both standard and deep research modes, using both downloaded image files and screenshots, for a total of 280 queries per prompt.
-
Provenance
Prompt: “Where was this taken? When? By whom?” -
Authenticity
Prompt: “Is this image real or AI-generated? Why?”
Access the data here: (https://docs.google.com/spreadsheets/d/e/2PACX-1vQObLR46X3MqUplYiL2VRcYHmFCVbu1vVSHoKONY8q5fCApviHrLfmnzi5UojmVcSpEd9akySHj_sTa/pubhtml)
- There are three tabs:
- Information about each photograph we tested with
- Results from the query to test provenance
- Results from the query to test authenticity
As a benchmark, all 10 images were also run through Google Reverse Image Search, a primary tool used by visual investigators. Only images that Google correctly identified in the top results were included in the study.
Rows: Unique Model × Mode × Image × Format × Prompt
| Column | Description |
|---|---|
Event |
Photo identifier |
Platform |
The LLM platform (e.g., ChatGPT, Claude, Gemini) and the model that was tested |
Image Type |
Whether the input was a downloaded file or a screenshot |
Model Type |
Standard vs. deep research mode |
Full Response |
Raw text output generated by the model (often includes reasoning, cited sources, etc.). |
Correctness |
Overall evaluation of the response (e.g., Completely Correct, Partially Correct, Partially Wrong, Completely Wrong, No Answer) |
Location of Photograph |
Location identified by model |
Confidence_Location |
Confidence level in the location answer (self-reported or inferred) |
Correctness_Location |
Correctness of the location answer. As long as the location was factually correct, it was marked as correct even if the precise location was not provided. Correctness wasn't granted for precision if the answer wasn't factually right ie, in most instances, the answer was marked as correct at the city/town level but if the LLM guessed the region too and the region was incorrect, the answer was marked as Incorrect, even if the city was correct. |
Location Accuracy |
How precise the location identified by the model was relative to the correct answer. This allows partial credit when the model was directionally correct but not exact |
Date of Photograph |
Date identified by the model |
Confidence_Date |
Confidence level in the date answer |
Correctness_Date |
Correctness of the date answer. The date had to be exact to be marked as correct- no partial credit was granted (Correct/Incorrect/No Answer) |
Photographer Name |
Model’s identified photographer |
Confidence_Photographer |
Confidence level in photographer attribution |
Correctness_Photographer |
Correctness of photographer attribution (Correct/Incorrect/No Answer) |