Features
This section contains the documentation for the features supported by LocalAI.
This section contains the documentation for the features supported by LocalAI.
Section under construction
This section contains instruction on how to use LocalAI with GPU acceleration.
For accelleration for AMD or Metal HW there are no specific container images, see the build
Requirement: nvidia-container-toolkit (installation instructions 1 2)
To use CUDA, use the images with the cublas
tag.
The image list is on quay:
11
tags: master-cublas-cuda11
, v1.40.0-cublas-cuda11
, …12
tags: master-cublas-cuda12
, v1.40.0-cublas-cuda12
, …11
+ FFmpeg tags: master-cublas-cuda11-ffmpeg
, v1.40.0-cublas-cuda11-ffmpeg
, …12
+ FFmpeg tags: master-cublas-cuda12-ffmpeg
, v1.40.0-cublas-cuda12-ffmpeg
, …In addition to the commands to run LocalAI normally, you need to specify --gpus all
to docker, for example:
docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda12
If the GPU inferencing is working, you should be able to see something like:
5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size = 512.00 MB
Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp
workloads a configuration file might look like this (where gpu_layers
is the number of layers to offload to the GPU):
name: my-model-name
# Default model parameters
parameters:
# Relative to the models path
model: llama.cpp-model.ggmlv3.q5_K_M.bin
context_size: 1024
threads: 1
f16: true # enable with GPU acceleration
gpu_layers: 22 # GPU Layers (only used when built with cublas)
For diffusers instead, it might look like this instead:
name: stablediffusion
parameters:
model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
pipeline_type: StableDiffusionPipeline
cuda: true
enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
scheduler_type: "k_dpmpp_sde"
(Generated with AnimagineXL)
LocalAI supports generating images with Stable diffusion, running on CPU using a C++ implementation, Stable-Diffusion-NCNN (binding) and ๐งจ Diffusers.
OpenAI docs: https://platform.openai.com/docs/api-reference/images/create
To generate an image you can send a POST request to the /v1/images/generations
endpoint with the instruction as the request body:
# 512x512 is supported too
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "A cute baby sea otter",
"size": "256x256"
}'
Available additional parameters: mode
, step
.
Note: To set a negative prompt, you can split the prompt with |
, for instance: a cute baby sea otter|malformed
.
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
"size": "256x256"
}'
mode=0 | mode=1 (winograd/sgemm) |
---|---|
|
|
|
|
|
|
Note: image generator supports images up to 512x512. You can use other tools however to upscale the image, for instance: https://github.com/upscayl/upscayl.
Note: In order to use the images/generation
endpoint with the stablediffusion
C++ backend, you need to build LocalAI with GO_TAGS=stablediffusion
. If you are using the container images, it is already enabled.
While the API is running, you can install the model by using the /models/apply
endpoint and point it to the stablediffusion
model in the models-gallery:
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
"url": "github:go-skynet/model-gallery/stablediffusion.yaml"
}'
You can set the PRELOAD_MODELS
environment variable:
PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]
or as arg:
local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'
or in a YAML file:
local-ai --preload-models-config "/path/to/yaml"
YAML:
- url: github:go-skynet/model-gallery/stablediffusion.yaml
stablediffusion.yaml
in the models folder:name: stablediffusion
backend: stablediffusion
parameters:
model: stablediffusion_assets
stablediffusion_assets
directory inside your models
directorystablediffusion_assets
.The models directory should look like the following:
models
โโโ stablediffusion_assets
โย ย โโโ AutoencoderKL-256-256-fp16-opt.param
โย ย โโโ AutoencoderKL-512-512-fp16-opt.param
โย ย โโโ AutoencoderKL-base-fp16.param
โย ย โโโ AutoencoderKL-encoder-512-512-fp16.bin
โย ย โโโ AutoencoderKL-fp16.bin
โย ย โโโ FrozenCLIPEmbedder-fp16.bin
โย ย โโโ FrozenCLIPEmbedder-fp16.param
โย ย โโโ log_sigmas.bin
โย ย โโโ tmp-AutoencoderKL-encoder-256-256-fp16.param
โย ย โโโ UNetModel-256-256-MHA-fp16-opt.param
โย ย โโโ UNetModel-512-512-MHA-fp16-opt.param
โย ย โโโ UNetModel-base-MHA-fp16.param
โย ย โโโ UNetModel-MHA-fp16.bin
โย ย โโโ vocab.txt
โโโ stablediffusion.yaml
This is an extra backend - in the container is already available and there is nothing to do for the setup.
The models will be downloaded the first time you use the backend from huggingface
automatically.
Create a model configuration file in the models
directory, for instance to use Linaqruf/animagine-xl
with CPU:
name: animagine-xl
parameters:
model: Linaqruf/animagine-xl
backend: diffusers
# Force CPU usage - set to true for GPU
f16: false
diffusers:
pipeline_type: StableDiffusionXLPipeline
cuda: false # Enable for GPU usage (CUDA)
scheduler_type: euler_a
LocalAI supports generating text with GPT with llama.cpp
and other backends (such as rwkv.cpp
as ) see also the Model compatibility for an up-to-date list of the supported model families.
Note:
https://platform.openai.com/docs/api-reference/chat
For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions
endpoint with the instruction as the request body:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Available additional parameters: top_p
, top_k
, max_tokens
https://platform.openai.com/docs/api-reference/edits
To generate an edit completion you can send a POST request to the /v1/edits
endpoint with the instruction as the request body:
curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"instruction": "rephrase",
"input": "Black cat jumped out of the window",
"temperature": 0.7
}'
Available additional parameters: top_p
, top_k
, max_tokens
.
https://platform.openai.com/docs/api-reference/completions
To generate a completion, you can send a POST request to the /v1/completions
endpoint with the instruction as per the request body:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Available additional parameters: top_p
, top_k
, max_tokens
You can list all the models available with:
curl http://localhost:8080/v1/models
The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint supports the audio formats supported by ffmpeg
.
Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions
API endpoint.
For instance, with cURL:
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"
Download one of the models from here in the models
folder, and create a YAML file for your model:
name: whisper-1
backend: whisper
parameters:
model: whisper-en
The transcriptions endpoint then can be tested like so:
## Get an example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"
## Result
{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}
LocalAI supports running OpenAI functions with llama.cpp
compatible models.
To learn more about OpenAI functions, see the OpenAI API blog post.
๐ก Check out also LocalAGI for an example on how to use LocalAI functions.
OpenAI functions are available only with ggml
or gguf
models compatible with llama.cpp
.
You don’t need to do anything specific - just use ggml
or gguf
models.
You can configure a model manually with a YAML config file in the models directory, for example:
name: gpt-3.5-turbo
parameters:
# Model file name
model: ggml-openllama.bin
top_p: 80
top_k: 0.9
temperature: 0.1
To use the functions with the OpenAI client in python:
import openai
# ...
# Send the conversation and available functions to GPT
messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
functions = [
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}
]
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=messages,
functions=functions,
function_call="auto",
)
# ...
When running the python script, be sure to:
OPENAI_API_KEY
environment variable to a random string (the OpenAI api key is NOT required!)OPENAI_API_BASE
to point to your LocalAI service, for example OPENAI_API_BASE=http://localhost:8080
It is possible to also specify the full function signature (for debugging, or to use with other clients).
The chat endpoint accepts the grammar_json_functions
additional parameter which takes a JSON schema object.
For example, with curl:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.1,
"grammar_json_functions": {
"oneOf": [
{
"type": "object",
"properties": {
"function": {"const": "create_event"},
"arguments": {
"type": "object",
"properties": {
"title": {"type": "string"},
"date": {"type": "string"},
"time": {"type": "string"}
}
}
}
},
{
"type": "object",
"properties": {
"function": {"const": "search"},
"arguments": {
"type": "object",
"properties": {
"query": {"type": "string"}
}
}
}
}
]
}
}'
A full e2e example with docker-compose
is available here.
Available only on master
builds
LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.
OpenAI docs: https://platform.openai.com/docs/guides/vision
To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions
endpoint, for example with curl:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llava",
"messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
To setup the LLaVa models, follow the full example in the configuration examples.
The /tts
endpoint can be used to generate speech from text.
Input: input
, model
For example, to generate an audio file, you can send a POST request to the /tts
endpoint with the instruction as the request body:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Hello world",
"model": "tts"
}'
Returns an audio/wav
file.
LocalAI supports bark , piper
and vall-e-x
:
The piper
backend is used for onnx
models and requires the modules to be downloaded first.
To install the piper
audio models manually:
.tar.tgz
files (.onnx,.json) inside models
To use the tts endpoint, run the following command. You can specify a backend with the backend
parameter. For example, to use the piper
backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model":"it-riccardo_fasol-x-low.onnx",
"backend": "piper",
"input": "Ciao, sono Ettore"
}' | aplay
Note:
aplay
is a Linux command. You can use other tools to play the audio file.GO_TAGS=tts
flag.LocalAI also has experimental support for transformers-musicgen
for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
curl --request POST \
--url http://localhost:8080/tts \
--header 'Content-Type: application/json' \
--data '{
"backend": "transformers-musicgen",
"model": "facebook/musicgen-medium",
"input": "Cello Rave"
}' | aplay```
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
#### Configuration
Audio models can be configured via `YAML` files. This allows to configure specific setting for each backend. For instance, backends might be specifying a voice or supports voice cloning which must be specified in the configuration file.
```yaml
name: tts
backend: vall-e-x
parameters: ...
LocalAI supports generating embeddings for text or list of tokens.
For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings
The embedding endpoint is compatible with llama.cpp
models, bert.cpp
models and sentence-transformers models available in huggingface.
Create a YAML
config file in the models
directory. Specify the backend
and the model file.
name: text-embedding-ada-002 # The model name used in the API
parameters:
model: <model_file>
backend: "<backend>"
embeddings: true
# .. other parameters
To use bert.cpp
models you can use the bert
embedding backend.
An example model config file:
name: text-embedding-ada-002
parameters:
model: bert
backend: bert-embeddings
embeddings: true
# .. other parameters
The bert
backend uses bert.cpp and uses ggml
models.
For instance you can download the ggml
quantized version of all-MiniLM-L6-v2
from https://huggingface.co/skeskinen/ggml:
wget https://huggingface.co/skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert
To test locally (LocalAI server running on localhost
),
you can use curl
(and jq
at the end to prettify):
curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
"input": "Your text string goes here",
"model": "text-embedding-ada-002"
}' | jq "."
To use sentence-transformers
and models in huggingface
you can use the sentencetransformers
embedding backend.
name: text-embedding-ada-002
backend: sentencetransformers
embeddings: true
parameters:
model: all-MiniLM-L6-v2
The sentencetransformers
backend uses Python sentence-transformers. For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models
sentencetransformers
backend is an optional backend of LocalAI and uses Python. If you are running LocalAI
from the containers you are good to go and should be already configured for use.LocalAI
manually you must install the python dependencies (make prepare-extra-conda-environments
). This requires conda
to be installed.EXTERNAL_GRPC_BACKENDS
environment variable.
EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"
sentencetransformers
backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert
backend or llama.cpp
.sentencetransformers
backend. The models will be downloaded automatically the first time the API is used.Embeddings with llama.cpp
are supported with the llama
backend.
name: my-awesome-model
backend: llama
embeddings: true
parameters:
model: ggml-file.bin
# ...
The chat endpoint accepts an additional grammar
parameter which takes a BNF defined grammar.
This allows the LLM to constrain the output to a user-defined schema, allowing to generate JSON
, YAML
, and everything that can be defined with a BNF grammar.
This feature works only with models compatible with the llama.cpp backend (see also Model compatibility). For details on how it works, see the upstream PRs: https://github.com/ggerganov/llama.cpp/pull/1773, https://github.com/ggerganov/llama.cpp/pull/1887
Follow the setup instructions from the LocalAI functions page.
For example, to constrain the output to either yes
, no
:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Do you like apples?"}],
"grammar": "root ::= (\"yes\" | \"no\")"
}'