---
date: "2024-07-13T20:45:48.000Z"
title: "2024-07-13"
tags: ["claude-3.5-sonnet"]
draft: false
---

While I didn't have much success getting `gpt-4o` to perform [Task 1 - Counting Line Intersection](https://github.com/anguyen8/vision-llms-are-blind/tree/main?tab=readme-ov-file#task-1---counting-line-intersection) from the Vision Language Models Are Blind paper, I pulled down some code and did a bit of testing with Claude 3.5 Sonnet.
The paper reports the following success rate for Sonnet for this line intersection task:

| Thickness | Sonnet 3.5 |
| --------- | ---------- |
| 2         | 80.00      |
| 3         | 79.00      |
| 4         | 73.00      |
| Average   | 77.33      |

I used the [code from the paper](https://github.com/anguyen8/vision-llms-are-blind/blob/main/src/LineIntersection/GenerateSamples.ipynb) to generate 30 similar images with line thickness 4 of intersecting (or not) lines.
I chose a thickness of 4 because this was the worst performing thickness according to the paper.
With these classified manually (I didn't realize the `configurations.json` file already had ground truths in it), I ran the prompt from the paper against these images using Sonnet.

> How many times do the blue and red lines intersect?

My results were much better than those reported by the paper.

```
Results summary:
Expected count 0: 9/9 correct (Accuracy: 100.00%)
Expected count 1: 10/11 correct (Accuracy: 90.91%)
Failed images for count 1:
  - ./images/1/image_14_thickness_4.png
Expected count 2: 10/10 correct (Accuracy: 100.00%)
Overall accuracy: 96.67%
```

Sonnet only made a single error out of 30 images.
I need to dig deeper into [the paper](https://arxiv.org/pdf/2407.06581) to ensure my methodology is the same as the authors', but I was surprised to see such a divergence here since the implementation seemed straightforward.
Hopefully, I can find a bit more time to dig into these discrepancies.