Subsections of Features

โšก GPU acceleration

Note

Section under construction

This section contains instruction on how to use LocalAI with GPU acceleration.

Note

For accelleration for AMD or Metal HW there are no specific container images, see the build

CUDA

Requirement: nvidia-container-toolkit (installation instructions 1 2)

To use CUDA, use the images with the cublas tag.

The image list is on quay:

  • CUDA 11 tags: master-cublas-cuda11, v1.40.0-cublas-cuda11, …
  • CUDA 12 tags: master-cublas-cuda12, v1.40.0-cublas-cuda12, …
  • CUDA 11 + FFmpeg tags: master-cublas-cuda11-ffmpeg, v1.40.0-cublas-cuda11-ffmpeg, …
  • CUDA 12 + FFmpeg tags: master-cublas-cuda12-ffmpeg, v1.40.0-cublas-cuda12-ffmpeg, …

In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda12

If the GPU inferencing is working, you should be able to see something like:

5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

Model configuration

Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):

name: my-model-name
# Default model parameters
parameters:
  # Relative to the models path
  model: llama.cpp-model.ggmlv3.q5_K_M.bin

context_size: 1024
threads: 1

f16: true # enable with GPU acceleration
gpu_layers: 22 # GPU Layers (only used when built with cublas)

For diffusers instead, it might look like this instead:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"

๐ŸŽจ Image generation

anime_girl anime_girl (Generated with AnimagineXL)

LocalAI supports generating images with Stable diffusion, running on CPU using a C++ implementation, Stable-Diffusion-NCNN (binding) and ๐Ÿงจ Diffusers.

Usage

OpenAI docs: https://platform.openai.com/docs/api-reference/images/create

To generate an image you can send a POST request to the /v1/images/generations endpoint with the instruction as the request body:

# 512x512 is supported too
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "A cute baby sea otter",
  "size": "256x256"
}'

Available additional parameters: mode, step.

Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
  "size": "256x256"
}'

stablediffusion-cpp

mode=0 mode=1 (winograd/sgemm)
test test b643343452981 b643343452981
b6441997879 b6441997879 winograd2 winograd2
winograd winograd winograd3 winograd3

Note: image generator supports images up to 512x512. You can use other tools however to upscale the image, for instance: https://github.com/upscayl/upscayl.

Setup

Note: In order to use the images/generation endpoint with the stablediffusion C++ backend, you need to build LocalAI with GO_TAGS=stablediffusion. If you are using the container images, it is already enabled.

While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "url": "github:go-skynet/model-gallery/stablediffusion.yaml"
}'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]

or as arg:

local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:go-skynet/model-gallery/stablediffusion.yaml
  1. Create a model file stablediffusion.yaml in the models folder:
name: stablediffusion
backend: stablediffusion
parameters:
  model: stablediffusion_assets
  1. Create a stablediffusion_assets directory inside your models directory
  2. Download the ncnn assets from https://github.com/EdVince/Stable-Diffusion-NCNN#out-of-box and place them in stablediffusion_assets.

The models directory should look like the following:

models
โ”œโ”€โ”€ stablediffusion_assets
โ”‚ย ย  โ”œโ”€โ”€ AutoencoderKL-256-256-fp16-opt.param
โ”‚ย ย  โ”œโ”€โ”€ AutoencoderKL-512-512-fp16-opt.param
โ”‚ย ย  โ”œโ”€โ”€ AutoencoderKL-base-fp16.param
โ”‚ย ย  โ”œโ”€โ”€ AutoencoderKL-encoder-512-512-fp16.bin
โ”‚ย ย  โ”œโ”€โ”€ AutoencoderKL-fp16.bin
โ”‚ย ย  โ”œโ”€โ”€ FrozenCLIPEmbedder-fp16.bin
โ”‚ย ย  โ”œโ”€โ”€ FrozenCLIPEmbedder-fp16.param
โ”‚ย ย  โ”œโ”€โ”€ log_sigmas.bin
โ”‚ย ย  โ”œโ”€โ”€ tmp-AutoencoderKL-encoder-256-256-fp16.param
โ”‚ย ย  โ”œโ”€โ”€ UNetModel-256-256-MHA-fp16-opt.param
โ”‚ย ย  โ”œโ”€โ”€ UNetModel-512-512-MHA-fp16-opt.param
โ”‚ย ย  โ”œโ”€โ”€ UNetModel-base-MHA-fp16.param
โ”‚ย ย  โ”œโ”€โ”€ UNetModel-MHA-fp16.bin
โ”‚ย ย  โ””โ”€โ”€ vocab.txt
โ””โ”€โ”€ stablediffusion.yaml

Diffusers

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

# Force CPU usage - set to true for GPU
f16: false
diffusers:
  pipeline_type: StableDiffusionXLPipeline
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a

๐Ÿ“– Text generation (GPT)

LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.

Note:

  • You can also specify the model name as part of the OpenAI token.
  • If only one model is available, the API will use it for all the requests.

Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:

curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens.

Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

List models

You can list all the models available with:

curl http://localhost:8080/v1/models

๐Ÿ”ˆ Audio to text

The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint supports the audio formats supported by ffmpeg.

Usage

Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.

For instance, with cURL:

curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"

Example

Download one of the models from here in the models folder, and create a YAML file for your model:

name: whisper-1
backend: whisper
parameters:
  model: whisper-en

The transcriptions endpoint then can be tested like so:

## Get an example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"

## Result
{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}

๐Ÿ”ฅ OpenAI functions

LocalAI supports running OpenAI functions with llama.cpp compatible models.

localai-functions-1 localai-functions-1

To learn more about OpenAI functions, see the OpenAI API blog post.

๐Ÿ’ก Check out also LocalAGI for an example on how to use LocalAI functions.

Setup

OpenAI functions are available only with ggml or gguf models compatible with llama.cpp.

You don’t need to do anything specific - just use ggml or gguf models.

Usage example

You can configure a model manually with a YAML config file in the models directory, for example:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

To use the functions with the OpenAI client in python:

import openai
# ...
# Send the conversation and available functions to GPT
messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
functions = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    }
]
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=messages,
    functions=functions,
    function_call="auto",
)
# ...
Note

When running the python script, be sure to:

  • Set OPENAI_API_KEY environment variable to a random string (the OpenAI api key is NOT required!)
  • Set OPENAI_API_BASE to point to your LocalAI service, for example OPENAI_API_BASE=http://localhost:8080

Advanced

It is possible to also specify the full function signature (for debugging, or to use with other clients).

The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-4",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.1,
     "grammar_json_functions": {
        "oneOf": [
            {
                "type": "object",
                "properties": {
                    "function": {"const": "create_event"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "date": {"type": "string"},
                            "time": {"type": "string"}
                        }
                    }
                }
            },
            {
                "type": "object",
                "properties": {
                    "function": {"const": "search"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        }
                    }
                }
            }
        ]
    }
   }'

๐Ÿ’ก Examples

A full e2e example with docker-compose is available here.

๐Ÿ†• GPT Vision

Note

Available only on master builds

LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.

llava llava

Usage

OpenAI docs: https://platform.openai.com/docs/guides/vision

To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions endpoint, for example with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Setup

To setup the LLaVa models, follow the full example in the configuration examples.

๐Ÿ—ฃ Text to audio (TTS)

The /tts endpoint can be used to generate speech from text.

Input: input, model

For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts"
}'

Returns an audio/wav file.

Setup

LocalAI supports bark , piper and vall-e-x:

Note

The piper backend is used for onnx models and requires the modules to be downloaded first.

To install the piper audio models manually:

To use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "model":"it-riccardo_fasol-x-low.onnx",
  "backend": "piper",
  "input": "Ciao, sono Ettore"
}' | aplay

Note:

  • aplay is a Linux command. You can use other tools to play the audio file.
  • The model name is the filename with the extension.
  • The model name is case sensitive.
  • LocalAI must be compiled with the GO_TAGS=tts flag.

LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:

curl --request POST \
  --url http://localhost:8080/tts \
  --header 'Content-Type: application/json' \
  --data '{
    "backend": "transformers-musicgen",
    "model": "facebook/musicgen-medium",
    "input": "Cello Rave"
}' | aplay```

Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

#### Configuration

Audio models can be configured via `YAML` files. This allows to configure specific setting for each backend. For instance, backends might be specifying a voice or supports voice cloning which must be specified in the configuration file.

```yaml
name: tts
backend: vall-e-x
parameters: ...

๐Ÿง  Embeddings

LocalAI supports generating embeddings for text or list of tokens.

For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings

Model compatibility

The embedding endpoint is compatible with llama.cpp models, bert.cpp models and sentence-transformers models available in huggingface.

Manual Setup

Create a YAML config file in the models directory. Specify the backend and the model file.

name: text-embedding-ada-002 # The model name used in the API
parameters:
  model: <model_file>
backend: "<backend>"
embeddings: true
# .. other parameters

Bert embeddings

To use bert.cpp models you can use the bert embedding backend.

An example model config file:

name: text-embedding-ada-002
parameters:
  model: bert
backend: bert-embeddings
embeddings: true
# .. other parameters

The bert backend uses bert.cpp and uses ggml models.

For instance you can download the ggml quantized version of all-MiniLM-L6-v2 from https://huggingface.co/skeskinen/ggml:

wget https://huggingface.co/skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert

To test locally (LocalAI server running on localhost), you can use curl (and jq at the end to prettify):

curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
  "input": "Your text string goes here",
  "model": "text-embedding-ada-002"
}' | jq "."

Huggingface embeddings

To use sentence-transformers and models in huggingface you can use the sentencetransformers embedding backend.

name: text-embedding-ada-002
backend: sentencetransformers
embeddings: true
parameters:
  model: all-MiniLM-L6-v2

The sentencetransformers backend uses Python sentence-transformers. For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models

Note
  • The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
  • If you are running LocalAI manually you must install the python dependencies (make prepare-extra-conda-environments). This requires conda to be installed.
  • For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
    • Example: EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"
  • The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
  • No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.

Llama.cpp embeddings

Embeddings with llama.cpp are supported with the llama backend.

name: my-awesome-model
backend: llama
embeddings: true
parameters:
  model: ggml-file.bin
# ...

๐Ÿ’ก Examples

  • Example that uses LLamaIndex and LocalAI as embedding: here.

โœ๏ธ Constrained grammars

The chat endpoint accepts an additional grammar parameter which takes a BNF defined grammar.

This allows the LLM to constrain the output to a user-defined schema, allowing to generate JSON, YAML, and everything that can be defined with a BNF grammar.

Note

This feature works only with models compatible with the llama.cpp backend (see also Model compatibility). For details on how it works, see the upstream PRs: https://github.com/ggerganov/llama.cpp/pull/1773, https://github.com/ggerganov/llama.cpp/pull/1887

Setup

Follow the setup instructions from the LocalAI functions page.

๐Ÿ’ก Usage example

For example, to constrain the output to either yes, no:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Do you like apples?"}],
  "grammar": "root ::= (\"yes\" | \"no\")"
}'