Documentation

3.Pick your models wisely

3.Pick your models wisely
Prev	Chapter1.Getting Started	Next

3.Pick your models wisely

Not all AI models are made equal. When choosing a model you need to consider the following factors:

What is it trained to do?
How many parameters does it have?
What is its quantisation?
What kind of inputs and outputs does it support?

Choosing the wrong model will lead to low quality or even unusable results. You may wrongly perceive this as a failure of the software but is, in fact, a failure of the AI model you're using. TinyAI is merely the "conduit" between your site and the AI model. The actual processing is done by the AI model you choose. TinyAI shows you the results of the AI model's processing; it has minimal say in how that processing is done, and all these options are exposed to you.

The limitations below apply equally for AI services with closed-source models, AI services with open-source models, and locally hosted open-source models. You are less likely to see them with hosted services for two reasons.

First, all hosted services use commercial-grade AI accelerator server racks with tons of memory and processing power. This allows them to offer you AI models with more parameters, at a better quantisation level, with usable performance. When you are hosting an AI model locally you are bound by your hardware. Even a multi-thousand-dollar workstation AI accelerator (or a beefed up Mac Studio) will perform at a fraction of the speed than a commercial AI accelerator server rack. You will also be forced to use models with fewer parameters, at an inferior quantisation level due to the fact that you cannot possibly have as much memory.

The second reason, which only applies to AI services using proprietary models, is that they are constantly fine-tuning their models and platform for quality and performance. Even the cheaper models –such as ChatGPT-mini and Claude Haiku– tend to yield better results than the majority of models you can run locally. Well, at least unless you can spend tens of thousands of Euro / dollars on a beefy AI server.

So, let's discuss the factors which can make or break a model for your use case.

Different models are trained to solve different problems

Some models are trained for specific tasks, e.g. coding, solving math problems, or following steps (usually marked as "coder", "instruct", etc). These models, by definition, will NOT work properly with AITiny for generating and processing text; they are not trained for this use case. Likewise, if a model's card says it's trained, for example, primarily on English and German text you cannot ask it a question in Spanish, or ask it to translate text to French; it will fail, the same way a person who only speaks English and German would fail to carry out those language-specific tasks. If you come across these limitations you will have to use a different model. Example: using Llama 3.1 Instruct with 8 billion parameters yielded far worse results in writing an article summary than using the generic Llama 3.2 (not an "instruct" model) with a much smaller 3 billion parameters.

The number of parameters determines how well they can follow instructions

The model's ability to follow the instructions laid out in the prompt and do useful work is intrinsically tied to the number of parameters. Most models with a fairly small number of parameters (under around 8 billion) may perform poorly. We have noted that smaller models tend to ignore the instructions about formatting the text as HTML instead returning a random mix of HTML and Markdown, strip images and HTML attributes, or otherwise make a mess of the input text. The only way to address this is to use a model with a higher number of parameters. Example: using Phi-3.5-3b (3 billion parameters) at 4-bit (Q4_K_M) quantisation to rewrite an article in a professional tone resulted in strangely formatted text with a lot of junk commentary and internal tokens, but using Phi-4-14b with the same quantisation worked fine.

Higher quantisation is required for quality output

Likewise, the quantisation of the model matters. Anything below a Q8_K_M quantisation level will run faster, but comes at the expense of quality. Like with a limited number of parameters, you may experience erratic output. This is not a failure of our software, it's a failure of the AI model you use. Example: using Aya 8b at 4-bit (Q4_K_M) quantisation to translate an article failed to follow our instructions on output formatting, but using the same model with an 8-bit (Q8_0) quantisation worked perfectly, even keeping the nested DIV structure and their CSS classes, albeit more slowly.

Not all models support image input

Generating ALT text for images requires a model which supports both text and image input. If you provide a model which does not support image inputs you will get either an error, or hallucinated text. Example: Trying to use Aya 8b to generate ALT text results in an error saying the model does not support image input, but using Gemma 3 12b (a model explicitly marked as supporting image input) works fine.

Rule of thumb

I have come up with these rule of thumb when picking an AI model. They lack nuance, they are not gospel, but they do get you a fairly good initial pick.

When you are using a locally hosted AI model try to find a model which fits into your GPU memory. Use an 8-bit (preferred) or 6-bit quantisation; do NOT use the 4-bit quantised models LM Studio proposes by default (they are fast, but their accuracy is virtually non-existent). With those two factors as a given, try to go with a model that has as many parameters as you can fit in your GPU memory (VRAM). This won't be a fast AI model, but it will be a fairly accurate one, even on systems with 8GB of VRAM.

If you are using an online service, start with the cheaper available model. Most likely that's more than enough for your use case. A more expensive, bigger model is not necessarily better in any tangible way. Do not spend more money than you absolutely have to.

Observations on consumer hardware for locally hosted AI models

If you're on a Mac, you need an Apple Silicon (M-series) processor with at least 16GB of memory. Don't even think about using an AI model on 8GB of RAM. More is always better, be it CPU specs or RAM. For decent performance you need at least an M3 processor with 16GB RAM. In our tests, even the passively cooled MacBook Air M3 16GB model was perfectly usable. There was no observed thermal throttling even after fairly heavy use of AITiny.

Desktops (and, to a lesser extent, laptops) with Nvidia and AMD graphics cards will usually perform very well, as long as your model fits in your GPU memory (VRAM) and the AI server you're using supports acceleration with the graphics card you're using. Please note that laptops tend to get thermally throttled very fast.

If you have an Intel GPU, or even an Intel CPU with at least Intel Xe graphics, you can use the GPU cores for AI acceleration – even on a laptop! It's not going to be particularly fast due to the low memory bandwidth, but it's usable. The Xe graphics on a mobile 12th gen i7 coupled with 5300 MT/s DDR5 SODIMMs (i.e. a configuration without a lot of memory bandwidth to begin with) was about half as fast as our MacBook Air M3 16GB which, given the restrictions of the configuration we were using and the inevitable thermal throttling, is pretty impressive.

Prev	Up	Next
2.A crash course in generative AI	Home	Chapter2.Using the plugin

Support

3.Pick your models wisely

Different models are trained to solve different problems

The number of parameters determines how well they can follow instructions

Higher quantisation is required for quality output

Not all models support image input

Rule of thumb

Observations on consumer hardware for locally hosted AI models