Model compatibility
LocalAI is compatible with the models supported by llama.cpp supports also GPT4ALL-J and cerebras-GPT with ggml.
Note
LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.
Hardware requirements
Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also here for gguf
based backends. rwkv
is less expensive on resources.
Model compatibility table
Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.
Backend and Bindings |
Compatible models |
Completion/Chat endpoint |
Capability |
Embeddings support |
Token stream support |
Acceleration |
llama.cpp |
Vicuna, Alpaca, LLaMa |
yes |
GPT and Functions |
yes** |
yes |
CUDA, openCL, cuBLAS, Metal |
gpt4all-llama |
Vicuna, Alpaca, LLaMa |
yes |
GPT |
no |
yes |
N/A |
gpt4all-mpt |
MPT |
yes |
GPT |
no |
yes |
N/A |
gpt4all-j |
GPT4ALL-J |
yes |
GPT |
no |
yes |
N/A |
falcon-ggml (binding) |
Falcon (*) |
yes |
GPT |
no |
no |
N/A |
gpt2 (binding) |
GPT2, Cerebras |
yes |
GPT |
no |
no |
N/A |
dolly (binding) |
Dolly |
yes |
GPT |
no |
no |
N/A |
gptj (binding) |
GPTJ |
yes |
GPT |
no |
no |
N/A |
mpt (binding) |
MPT |
yes |
GPT |
no |
no |
N/A |
replit (binding) |
Replit |
yes |
GPT |
no |
no |
N/A |
gptneox (binding) |
GPT NeoX, RedPajama, StableLM |
yes |
GPT |
no |
no |
N/A |
starcoder (binding) |
Starcoder |
yes |
GPT |
no |
no |
N/A |
bloomz (binding) |
Bloom |
yes |
GPT |
no |
no |
N/A |
rwkv (binding) |
rwkv |
yes |
GPT |
no |
yes |
N/A |
bert (binding) |
bert |
no |
Embeddings only |
yes |
no |
N/A |
whisper |
whisper |
no |
Audio |
no |
no |
N/A |
stablediffusion (binding) |
stablediffusion |
no |
Image |
no |
no |
N/A |
langchain-huggingface |
Any text generators available on HuggingFace through API |
yes |
GPT |
no |
no |
N/A |
piper (binding) |
Any piper onnx model |
no |
Text to voice |
no |
no |
N/A |
falcon (binding) |
Falcon *** |
yes |
GPT |
no |
yes |
CUDA |
huggingface-embeddings sentence-transformers |
BERT |
no |
Embeddings only |
yes |
no |
N/A |
bark |
bark |
no |
Audio generation |
no |
no |
yes |
AutoGPTQ |
GPTQ |
yes |
GPT |
yes |
no |
N/A |
exllama |
GPTQ |
yes |
GPT only |
no |
no |
N/A |
diffusers |
SD,… |
no |
Image generation |
no |
no |
N/A |
vall-e-x |
Vall-E |
no |
Audio generation and Voice cloning |
no |
no |
CPU/CUDA |
vllm |
Various GPTs and quantization formats |
yes |
GPT |
no |
no |
CPU/CUDA |
Note: any backend name listed above can be used in the backend
field of the model configuration file (See the advanced section).
Tested with:
Note: You might need to convert some models from older models to the new format, for indications, see the README in llama.cpp for instance to run gpt4all
.
Subsections of Model compatibility
RWKV
A full example on how to run a rwkv model is in the examples.
Note: rwkv models needs to specify the backend rwkv
in the YAML config files and have an associated tokenizer along that needs to be provided with it:
36464540 -rw-r--r-- 1 mudler mudler 1.2G May 3 10:51 rwkv_small
36464543 -rw-r--r-- 1 mudler mudler 2.4M May 3 10:51 rwkv_small.tokenizer.json
๐ฆ llama.cpp
llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.
Note
The ggml
file format has been deprecated. If you are using ggml
models and you are configuring your model with a YAML file, specify, use the llama-stable
backend instead. If you are relying in automatic detection of the model, you should be fine. For gguf
models, use the llama
backend.
Features
The llama.cpp
model supports the following features:
Setup
LocalAI supports llama.cpp
models out of the box. You can use the llama.cpp
model in the same way as any other model.
Manual setup
It is sufficient to copy the ggml
or guf
model files in the models
folder. You can refer to the model in the model
parameter in the API calls.
You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.
Prompt templates are useful for models that are fine-tuned towards a specific prompt.
Automatic setup
LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml
or gguf
models.
For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.1
}'
LocalAI will automatically download and configure the model in the model
directory.
Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.
YAML configuration
To use the llama.cpp
backend, specify llama
as the backend in the YAML file:
name: llama
backend: llama
parameters:
# Relative to the models path
model: file.gguf.bin
In the example above we specify llama
as the backend to restrict loading gguf
models only.
For instance, to use the llama-stable
backend for ggml
models:
name: llama
backend: llama-stable
parameters:
# Relative to the models path
model: file.ggml.bin
Reference
๐ฆ Exllama
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”
Prerequisites
This is an extra backend - in the container images is already available and there is nothing to do for the setup.
If you are building LocalAI locally, you need to install exllama manually first.
Model setup
Download the model as a folder inside the model
directory and create a YAML file specifying the exllama
backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ
model:
$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/
.keep WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml
name: exllama
parameters:
model: WizardLM-7B-uncensored-GPTQ
backend: exllama
# ...
Test with:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "exllama",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.1
}'
๐ฆ AutoGPTQ
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Prerequisites
This is an extra backend - in the container images is already available and there is nothing to do for the setup.
If you are building LocalAI locally, you need to install AutoGPTQ manually.
Model setup
The models are automatically downloaded from huggingface
if not present the first time. It is possible to define models via YAML
config file, or just by querying the endpoint with the huggingface
repository model name. For example, create a YAML
config file in models/
:
name: orca
backend: autogptq
model_base_name: "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
parameters:
model: "TheBloke/orca_mini_v2_13b-GPTQ"
# ...
Test with:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "orca",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.1
}'
๐ถ Bark
Bark allows to generate audio from text prompts.
Setup
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
Usage
Use the tts
endpoint by specifying the bark
backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "bark",
"input":"Hello!"
}' | aplay
To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model
parameter:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "bark",
"input":"Hello!",
"model": "v2/en_speaker_4"
}' | aplay
Vall-E-X
VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.
Setup
The backend will automatically download the required files in order to run the model.
This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
Usage
Use the tts endpoint by specifying the vall-e-x backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "vall-e-x",
"input":"Hello!"
}' | aplay
Voice cloning
In order to use voice cloning capabilities you must create a YAML
configuration file to setup a model:
name: cloned-voice
backend: vall-e-x
parameters:
model: "cloned-voice"
vall-e:
# The path to the audio file to be cloned
# relative to the models directory
audio_path: "path-to-wav-source.wav"
Then you can specify the model name in the requests:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "vall-e-x",
"model": "cloned-voice",
"input":"Hello!"
}' | aplay
vLLM
vLLM is a fast and easy-to-use library for LLM inference.
LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm
performance here.
Setup
Create a YAML file for the model you want to use with vllm
.
To setup a model, you need to just specify the model name in the YAML config file:
name: vllm
backend: vllm
parameters:
model: "facebook/opt-125m"
# Decomment to specify a quantization method (optional)
# quantization: "awq"
The backend will automatically download the required files in order to run the model.
Usage
Use the completions
endpoint by specifying the vllm
backend:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "vllm",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
๐งจ Diffusers
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers
library.
(Generated with AnimagineXL)
Note: currently only the image generation is supported. It is experimental, so you might encounter some issues on models which weren’t tested yet.
Setup
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
The models will be downloaded the first time you use the backend from huggingface
automatically.
Create a model configuration file in the models
directory, for instance to use Linaqruf/animagine-xl
with CPU:
name: animagine-xl
parameters:
model: Linaqruf/animagine-xl
backend: diffusers
# Force CPU usage - set to true for GPU
f16: false
diffusers:
pipeline_type: StableDiffusionXLPipeline
cuda: false # Enable for GPU usage (CUDA)
scheduler_type: euler_a
Local models
You can also use local models, or modify some parameters like clip_skip
, scheduler_type
, for instance:
name: stablediffusion
parameters:
model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
pipeline_type: StableDiffusionPipeline
cuda: true
enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
scheduler_type: "k_dpmpp_sde"
cfg_scale: 8
clip_skip: 11
Configuration parameters
The following parameters are available in the configuration file:
Parameter |
Description |
Default |
f16 |
Force the usage of float16 instead of float32 |
false |
step |
Number of steps to run the model for |
30 |
cuda |
Enable CUDA acceleration |
false |
enable_parameters |
Parameters to enable for the model |
negative_prompt,num_inference_steps,clip_skip |
scheduler_type |
Scheduler type |
k_dpp_sde |
cfg_scale |
Configuration scale |
8 |
clip_skip |
Clip skip |
None |
pipeline_type |
Pipeline type |
StableDiffusionPipeline |
There are available several types of schedulers:
Scheduler |
Description |
ddim |
DDIM |
pndm |
PNDM |
heun |
Heun |
unipc |
UniPC |
euler |
Euler |
euler_a |
Euler a |
lms |
LMS |
k_lms |
LMS Karras |
dpm_2 |
DPM2 |
k_dpm_2 |
DPM2 Karras |
dpm_2_a |
DPM2 a |
k_dpm_2_a |
DPM2 a Karras |
dpmpp_2m |
DPM++ 2M |
k_dpmpp_2m |
DPM++ 2M Karras |
dpmpp_sde |
DPM++ SDE |
k_dpmpp_sde |
DPM++ SDE Karras |
dpmpp_2m_sde |
DPM++ 2M SDE |
k_dpmpp_2m_sde |
DPM++ 2M SDE Karras |
Pipelines types available:
Pipeline type |
Description |
StableDiffusionPipeline |
Stable diffusion pipeline |
StableDiffusionImg2ImgPipeline |
Stable diffusion image to image pipeline |
StableDiffusionDepth2ImgPipeline |
Stable diffusion depth to image pipeline |
DiffusionPipeline |
Diffusion pipeline |
StableDiffusionXLPipeline |
Stable diffusion XL pipeline |
Usage
Text to Image
Use the image
generation endpoint with the model
name from the configuration file:
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "<positive prompt>|<negative prompt>",
"model": "animagine-xl",
"step": 51,
"size": "1024x1024"
}'
Image to Image
https://huggingface.co/docs/diffusers/using-diffusers/img2img
An example model (GPU):
name: stablediffusion-edit
parameters:
model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25
f16: true
diffusers:
pipeline_type: StableDiffusionImg2ImgPipeline
cuda: true
enable_parameters: "negative_prompt,num_inference_steps,image"
IMAGE_PATH=/path/to/your/image
(echo -n '{"image": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @- http://localhost:8080/v1/images/generations
Depth to Image
https://huggingface.co/docs/diffusers/using-diffusers/depth2img
name: stablediffusion-depth
parameters:
model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
# Force CPU usage
f16: true
diffusers:
pipeline_type: StableDiffusionDepth2ImgPipeline
cuda: true
enable_parameters: "negative_prompt,num_inference_steps,image"
cfg_scale: 6
(echo -n '{"image": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @- http://localhost:8080/v1/images/generations