Advanced

Advanced configuration with YAML files

In order to define default prompts, model parameters (such as custom default top_p or top_k), LocalAI can be configured to serve user-defined models with a set of default parameters and templates.

You can create multiple yaml files in the models path or either specify a single YAML configuration file. Consider the following models folder in the example/chatbot-ui:

base ❯ ls -liah examples/chatbot-ui/models 
36487587 drwxr-xr-x 2 mudler mudler 4.0K May  3 12:27 .
36487586 drwxr-xr-x 3 mudler mudler 4.0K May  3 10:42 ..
36465214 -rw-r--r-- 1 mudler mudler   10 Apr 27 07:46 completion.tmpl
36464855 -rw-r--r-- 1 mudler mudler   ?G Apr 27 00:08 luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
36464537 -rw-r--r-- 1 mudler mudler  245 May  3 10:42 gpt-3.5-turbo.yaml
36467388 -rw-r--r-- 1 mudler mudler  180 Apr 27 07:46 chat.tmpl

In the gpt-3.5-turbo.yaml file it is defined the gpt-3.5-turbo model which is an alias to use luna-ai-llama2 with pre-defined options.

For instance, consider the following that declares gpt-3.5-turbo backed by the luna-ai-llama2 model:

name: gpt-3.5-turbo
# Default model parameters
parameters:
  # Relative to the models path
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..

# Default context size
context_size: 512
threads: 10
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv

# Enable prompt caching
prompt_cache_path: "alpaca-cache"
prompt_cache_all: true

# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# define chat roles
roles:
  assistant: '### Response:'
  system: '### System Instruction:'
  user: '### Instruction:'
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: chat

Specifying a config-file via CLI allows to declare models in a single file as a list, for instance:

- name: list1
  parameters:
    model: testmodel
  context_size: 512
  threads: 10
  stopwords:
  - "HUMAN:"
  - "### Response:"
  roles:
    user: "HUMAN:"
    system: "GPT:"
  template:
    completion: completion
    chat: chat
- name: list2
  parameters:
    model: testmodel
  context_size: 512
  threads: 10
  stopwords:
  - "HUMAN:"
  - "### Response:"
  roles:
    user: "HUMAN:"
    system: "GPT:"
  template:
    completion: completion
   chat: chat

See also chatbot-ui as an example on how to use config files.

Full config model file reference

# Model name.
# The model name is used to identify the model in the API calls.
name: gpt-3.5-turbo

# Default model parameters.
# These options can also be specified in the API calls
parameters:
  # Relative to the models path
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..
  top_k: 
  top_p: 
  max_tokens:
  ignore_eos: true
  n_keep: 10
  seed: 
  mode: 
  step:
  negative_prompt:
  typical_p:
  tfz:
  frequency_penalty:
  mirostat_eta:
  mirostat_tau:
  mirostat: 
  rope_freq_base:
  rope_freq_scale:
  negative_prompt_scale:

# Default context size
context_size: 512
# Default number of threads
threads: 10
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv
# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# string to trim space to
trimspace:
- string
# Strings to cut from the response
cutstrings:
- "string"

# Directory used to store additional assets
asset_dir: ""

# define chat roles
roles:
  user: "HUMAN:"
  system: "GPT:"
  assistant: "ASSISTANT:"
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: chat
  edit: edit_template
  function: function_template

function:
   disable_no_action: true
   no_action_function_name: "reply"
   no_action_description_name: "Reply to the AI assistant"

system_prompt:
rms_norm_eps:
# Set it to 8 for llama2 70b
ngqa: 1
## LLAMA specific options
# Enable F16 if backend supports it
f16: true
# Enable debugging
debug: true
# Enable embeddings
embeddings: true
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: 22
# Enable memory lock
mmlock: true
# GPU setting to split the tensor in multiple parts and define a main GPU
# see llama.cpp for usage
tensor_split: ""
main_gpu: ""
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Enable mmap
mmap: true
# Enable low vram mode (GPU only)
low_vram: true
# Set NUMA mode (CPU only)
numa: true
# Lora settings
lora_adapter: "/path/to/lora/adapter"
lora_base: "/path/to/lora/base"
# Disable mulmatq (CUDA)
no_mulmatq: true

# Diffusers/transformers
cuda: true

Prompt templates

The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:

The below instruction describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

See the prompt-templates directory in this repository for templates for some of the most popular models.

For the edit endpoint, an example template for alpaca-based models can be:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{{.Instruction}}

### Input:
{{.Input}}

### Response:

Install models using the API

Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.

A curated collection of model files is in the model-gallery (work in progress!). The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.

To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
    "name": "lunademo"
}'

Preloading models during startup

In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai

PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.

Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):

- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
  name: gpt4all-j
# ...

Automatic prompt caching

LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.

To enable prompt caching, you can control the settings in the model config YAML file:


# Enable prompt caching
prompt_cache_path: "cache"
prompt_cache_all: true

prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.

Configuring a specific backend for the model

By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

The available backends are listed in the model compatibility table.

In order to specify a backend for your models, create a model config file in your models directory specifying the backend:

name: gpt-3.5-turbo

# Default model parameters
parameters:
  # Relative to the models path
  model: ...

backend: llama-stable
# ...

Connect external backends

LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.

The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.

So for instance, to register a new backend which is a local file:

./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"

Or a remote URI:

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"

Environment variables

When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:

Environment variable	Default	Description
`REBUILD`	`false`	Rebuild LocalAI on startup
`BUILD_TYPE`		Build type. Available: `cublas`, `openblas`, `clblas`
`GO_TAGS`		Go tags. Available: `stablediffusion`
`HUGGINGFACEHUB_API_TOKEN`		Special token for interacting with HuggingFace Inference API, required only when using the `langchain-huggingface` backend
`EXTRA_BACKENDS`		A space separated list of backends to prepare. For example `EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers"` prepares the conda environment on start

Here is how to configure these variables:

# Option 1: command line
docker run --env REBUILD=true localai
# Option 2: set within an env file
docker run --env-file .env localai

Build only a single backend

You can control the backends that are built by setting the GRPC_BACKENDS environment variable. For instance, to build only the llama-cpp backend only:

make GRPC_BACKENDS=backend-assets/grpc/llama-cpp build

By default, all the backends are built.

Extra backends

LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. The container images that are built and published on quay.io contain a set of images split in core and extra. By default Images bring all the dependencies and backends supported by LocalAI (we call those extra images). The -core images instead bring only the strictly necessary dependencies to run LocalAI without only a core set of backends.

If you wish to build a custom container image with extra backends, you can use the core images and build only the backends you are interested into or prepare the environment on startup by using the EXTRA_BACKENDS environment variable. For instance, to use the diffusers backend:

FROM quay.io/go-skynet/local-ai:master-ffmpeg-core

RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers

Remember also to set the EXTERNAL_GRPC_BACKENDS environment variable (or --external-grpc-backends as CLI flag) to point to the backends you are using (EXTERNAL_GRPC_BACKENDS="backend_name:/path/to/backend"), for example with diffusers:

FROM quay.io/go-skynet/local-ai:master-ffmpeg-core

RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers

ENV EXTERNAL_GRPC_BACKENDS="diffusers:/build/backend/python/diffusers/run.sh"

Note

You can specify remote external backends or path to local files. The syntax is backend-name:/path/to/backend or backend-name:host:port.

In runtime

When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:

docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master-ffmpeg-core

Fine-tuning LLMs for text generation

Note

Section under construction

This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

Requirements

For this example you will need at least a 12GB VRAM of GPU and a Linux box.

Fine-tuning

Fine-tuning a language model is a process that requires a lot of computational power and time.

Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).

There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.

The steps involved are:

Preparing a dataset
Prepare the environment and install dependencies
Fine-tune the model
Merge the Lora base with the model
Convert the model to gguf
Use the model with LocalAI

Dataset preparation

We are going to need a dataset or a set of datasets.

Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.

A dataset for an instructor model (like Alpaca) can look like the following:

[
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 },
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 }
]

Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):

<System prompt>

## Instruction

<Question, instruction>

## Response

<Expected response from the LLM>

The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.

Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.

Install dependencies

# Install axolotl and dependencies
git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
pip install packaging
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd

# https://github.com/oobabooga/text-generation-webui/issues/4238
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Configure accelerate:

accelerate config default

Fine-tuning

We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.

If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:

# Optional pre-tokenize (run only if big dataset)
python -m axolotl.cli.preprocess axolotl.yaml

Now we are ready to start the fine-tuning process:

# Fine-tune
accelerate launch -m axolotl.cli.train axolotl.yaml

After we have finished the fine-tuning, we merge the Lora base with the model:

# Merge lora
python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

And we convert it to the gguf format that LocalAI can consume:


# Convert to gguf
git clone https://github.com/ggerganov/llama.cpp.git
pushd llama.cpp && make LLAMA_CUBLAS=1 && popd

# We need to convert the pytorch model into ggml for quantization
# It crates 'ggml-model-f16.bin' in the 'merged' directory.
pushd llama.cpp && python convert.py --outtype f16 \
    ../qlora-out/merged/pytorch_model-00001-of-00002.bin && popd

# Start off by making a basic q4_0 4-bit quantization.
# It's important to have 'ggml' in the name of the quant for some
# software to recognize it's file format.
pushd llama.cpp &&  ./quantize ../qlora-out/merged/ggml-model-f16.gguf \
    ../custom-model-q4_0.bin q4_0

Now you should have ended up with a custom-model-q4_0.bin file that you can copy in the LocalAI models directory and use it with LocalAI.

Development documentation

Note

This section is for developers and contributors. If you are looking for the user documentation, this is not the right place!

This section will collect how-to, notes and development documentation

Contributing

We use conventional commits and semantic versioning. Please follow the conventional commits specification when writing commit messages.

Creating a gRPC backend

LocalAI backends are gRPC servers.

In order to create a new backend you need:

If there are changes required to the protobuf code, modify the proto file and re-generate the code with make protogen.
Modify the Makefile to add your new backend and re-generate the client code with make protogen if necessary.
Create a new gRPC server in extra/grpc if it’s not written in go: link, and create the specific implementation.
- Golang gRPC servers should be added in the pkg/backend directory given their type. See piper as an example.
- Golang servers needs a respective cmd/grpc binary that must be created too, see also cmd/grpc/piper as an example, update also the Makefile accordingly to build the binary during build time.
Update the Dockerfile: if the backend is written in another language, update the Dockerfile default EXTERNAL_GRPC_BACKENDS variable by listing the new binary link.

Once you are done, you can either re-build LocalAI with your backend or you can try it out by running the gRPC server manually and specifying the host and IP to LocalAI with --external-grpc-backends or using (EXTERNAL_GRPC_BACKENDS environment variable, comma separated list of name:host:port tuples, e.g. my-awesome-backend:host:port):

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port" ...

Advanced

Advanced configuration with YAML files

Full config model file reference

Prompt templates

Install models using the API

Preloading models during startup

Automatic prompt caching

Configuring a specific backend for the model

Connect external backends

Environment variables

Build only a single backend

Extra backends

In runtime

Subsections of Advanced

Fine-tuning LLMs for text generation

Requirements

Fine-tuning

Dataset preparation

Install dependencies

Fine-tuning

Development documentation

Contributing

Creating a gRPC backend