---
date: "2024-08-12T19:59:57.000Z"
title: "2024-08-12"
tags: ["huggingface"]
draft: false
---

I tried to run `florence-2` and `colpali` using the [Huggingface serverless inference API](https://huggingface.co/inference-api/serverless).
Searching around, there seems to pretty pretty start support for `image-text-to-text` models.
On Github, I only found a [few projects](https://github.com/search?type=code&q=image-text-to-text) that even reference these types of models.

I didn't really know what I was doing, so I copied the [example code](https://huggingface.co/docs/api-inference/quicktour) then tried to use a model to augment it to call `florence-2`.
Initially, it seemed like it was working:

```sh
❯ python run.py
{'error': 'Model microsoft/Florence-2-large is currently loading', 'estimated_time': 61.724300384521484}
```

After the time passed, there seemed to be a problem:

```sh
❯ python run.py
{'error': 'image-text-to-text is not a valid pipeline'}
```

When trying out `vidore/colpali`, there seems to be an architectural incompatibility that prevents the serverless inference API from even loading the model

```sh
❯ python run.py
{'error': 'HfApiJson(Deserialize(Error("unknown variant `colpali`, expected one of `text-generation-inference`, `transformers`, `allennlp`, `flair`, `espnet`, `asteroid`, `speechbrain`, `timm`, `sentence-transformers`, `spacy`, `sklearn`, `stanza`, `adapter-transformers`, `fasttext`, `fairseq`, `pyannote-audio`, `doctr`, `nemo`, `fastai`, `k2`, `diffusers`, `paddlenlp`, `mindspore`, `open_clip`, `span-marker`, `bertopic`, `peft`, `setfit`", line: 1, column: 411)))'}
```

Overall, not a great success.
Despite being more worked on an available than ever, models continue to be challenging to deploy for a layperson like me.
Without an API, I've generally not made a ton of progress using large, state of the art models.