Skip to content

TowCenter/factchecking-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Why AI models are bad at verifying photographs

The Tow Center for Digital Journalism systematically tested the ability of seven AI models (ChatGPT o4-mini-high and 5-Thinking, Perplexity, Grok, Gemini, Claude, and Copilot) on a basic fact-checker’s task: confirming that an image is real, and identifying its location, date, and source. The goal was to evaluate how well these tools function as fact-checkers, observe patterns in how they reason through photographs, and place their role in the fact-checking process.

[Read the piece to learn about our findings] ()

Models Tested

  • ChatGPT 5 Thinking
  • ChatGPT o4-mini-high
  • Perplexity Pro Search
  • Grok3
  • Copilot
  • Claude Opus 4.1
  • Gemini 2.5 Flash

Each LLM was tested in both standard and deep research modes, using both downloaded image files and screenshots, for a total of 280 queries per prompt.

Research Questions

  1. Provenance
    Prompt: “Where was this taken? When? By whom?”

  2. Authenticity
    Prompt: “Is this image real or AI-generated? Why?”

📂Dataset

Access the data here: (https://docs.google.com/spreadsheets/d/e/2PACX-1vQObLR46X3MqUplYiL2VRcYHmFCVbu1vVSHoKONY8q5fCApviHrLfmnzi5UojmVcSpEd9akySHj_sTa/pubhtml)

  • There are three tabs:
  1. Information about each photograph we tested with
  2. Results from the query to test provenance
  3. Results from the query to test authenticity

Baseline Comparison

As a benchmark, all 10 images were also run through Google Reverse Image Search, a primary tool used by visual investigators. Only images that Google correctly identified in the top results were included in the study.

Data Collection & Structure

Rows: Unique Model × Mode × Image × Format × Prompt

Provenance Data Table

Column Description
Event Photo identifier
Platform The LLM platform (e.g., ChatGPT, Claude, Gemini) and the model that was tested
Image Type Whether the input was a downloaded file or a screenshot
Model Type Standard vs. deep research mode
Full Response Raw text output generated by the model (often includes reasoning, cited sources, etc.).
Correctness Overall evaluation of the response (e.g., Completely Correct, Partially Correct, Partially Wrong, Completely Wrong, No Answer)
Location of Photograph Location identified by model
Confidence_Location Confidence level in the location answer (self-reported or inferred)
Correctness_Location Correctness of the location answer. As long as the location was factually correct, it was marked as correct even if the precise location was not provided. Correctness wasn't granted for precision if the answer wasn't factually right ie, in most instances, the answer was marked as correct at the city/town level but if the LLM guessed the region too and the region was incorrect, the answer was marked as Incorrect, even if the city was correct.
Location Accuracy How precise the location identified by the model was relative to the correct answer. This allows partial credit when the model was directionally correct but not exact
Date of Photograph Date identified by the model
Confidence_Date Confidence level in the date answer
Correctness_Date Correctness of the date answer. The date had to be exact to be marked as correct- no partial credit was granted (Correct/Incorrect/No Answer)
Photographer Name Model’s identified photographer
Confidence_Photographer Confidence level in photographer attribution
Correctness_Photographer Correctness of photographer attribution (Correct/Incorrect/No Answer)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published