LocalAI

LocalAI forks LocalAI stars LocalAI pull-requests

πŸ’‘ Get help - ❓FAQ ❓How tos πŸ’­Discussions πŸ’­Discord

πŸ’» Quickstart πŸ“£ News πŸ›« Examples πŸ–ΌοΈ Models πŸš€ Roadmap

LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families that are compatible with the ggml format. Does not require GPU. It is maintained by mudler.

Follow LocalAI

Follow LocalAI_API Join LocalAI Discord Community

Connect with the Creator

Follow mudler_it Follow on Github

Share LocalAI Repository

Follow _LocalAI Share on Telegram Share on Reddit Buy Me A Coffee


In a nutshell:

  • Local, OpenAI drop-in alternative REST API. You own your data.
  • NO GPU required. NO Internet access is required either
    • Optional, GPU Acceleration is available in llama.cpp-compatible LLMs. See also the build section.
  • Supports multiple models
  • πŸƒ Once loaded the first time, it keep models loaded in memory for faster inference
  • ⚑ Doesn’t shell-out, but uses C++ bindings for a faster inference and better performance.

LocalAI was created by Ettore Di Giacinto and is a community-driven project, focused on making the AI accessible to anyone. Any contribution, feedback and PR is welcome!

Note that this started just as a fun weekend project in order to try to create the necessary pieces for a full AI assistant like ChatGPT: the community is growing fast and we are working hard to make it better and more stable. If you want to help, please consider contributing (see below)!

πŸš€ Features

πŸ”₯πŸ”₯ Hot topics / Roadmap

Roadmap

πŸ†• New! LLM finetuning guide

Hot topics (looking for contributors):

If you want to help and contribute, issues up for grabs: https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3A%22up+for+grabs%22

How does it work?

LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.

LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, …). You can check the model compatibility table to learn about all the components of LocalAI.

localai localai

Contribute and help

To help the project you can:

  • If you have technological skills and want to contribute to development, have a look at the open issues. If you are new you can have a look at the good-first-issue and help-wanted labels.

  • If you don’t have technological skills you can still help improving documentation or add examples or share your user-stories with our community, any help and contribution is welcome!

🌟 Star history

LocalAI Star history Chart LocalAI Star history Chart

πŸ“– License

LocalAI is a community-driven project created by Ettore Di Giacinto.

MIT - Author Ettore Di Giacinto

πŸ™‡ Acknowledgements

LocalAI couldn’t have been built without the help of great software already available from the community. Thank you!

Backstory

As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.

But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.

Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.

As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!

Oh, and let’s not forget the real MVP hereβ€”llama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!

πŸ€— Contributors

This is a community project, a special thanks to our contributors! πŸ€—

Subsections of LocalAI

Getting started

LocalAI is available as a container image and binary. It can be used with docker, podman, kubernetes and any container engine. You can check out all the available images with corresponding tags here.

See also our How to section for end-to-end guided examples curated by the community.

How to get started

The easiest way to run LocalAI is by using docker compose or with Docker (to build locally, see the build section).

Note

To run with GPU Accelleration, see GPU acceleration.

# Prepare the models into the `model` directory
mkdir models

# copy your models to it
cp your-model.gguf models/

# run the LocalAI container
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
# You should see:
# 
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚                   Fiber v2.42.0                   β”‚
# β”‚               http://127.0.0.1:8080               β”‚
# β”‚       (bound on host 0.0.0.0 and port 8080)       β”‚
# β”‚                                                   β”‚
# β”‚ Handlers ............. 1  Processes ........... 1 β”‚
# β”‚ Prefork ....... Disabled  PID ................. 1 β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

# Try the endpoint with curl
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'
Note
  • If running on Apple Silicon (ARM) it is not suggested to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
  • If you are running Apple x86_64 you can use docker, there is no additional gain into building it from source.
# Clone LocalAI
git clone https://github.com/go-skynet/LocalAI

cd LocalAI

# (optional) Checkout a specific LocalAI tag
# git checkout -b build <TAG>

# copy your models to models/
cp your-model.gguf models/

# (optional) Edit the .env file to set things like context size and threads
# vim .env

# start with docker compose
docker compose up -d --pull always
# or you can build the images with:
# docker compose up -d --build

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"your-model.gguf","object":"model"}]}

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

Note: If you are on Windows, please make sure the project is on the Linux Filesystem, otherwise loading models might be slow. For more Info: Microsoft Docs

For installing LocalAI in Kubernetes, you can use the following helm chart:

# Install the helm repository
helm repo add go-skynet https://go-skynet.github.io/helm-charts/
# Update the repositories
helm repo update
# Get the values
helm show values go-skynet/local-ai > values.yaml

# Edit the values value if needed
# vim values.yaml ...

# Install the helm chart
helm install local-ai go-skynet/local-ai -f values.yaml

Container images

LocalAI has a set of images to support CUDA, ffmpeg and ‘vanilla’ (CPU-only). The image list is on quay:

  • master
  • latest
  • v2.0.0
  • v2.0.0-ffmpeg
  • v2.0.0-ffmpeg-core

Core Images - Smaller images without predownload python dependencies

  • master-cublas-cuda11
  • master-cublas-cuda11-core
  • v2.0.0-cublas-cuda11
  • v2.0.0-cublas-cuda11-core
  • v2.0.0-cublas-cuda11-ffmpeg
  • v2.0.0-cublas-cuda11-ffmpeg-core

Core Images - Smaller images without predownload python dependencies

  • master-cublas-cuda12
  • master-cublas-cuda12-core
  • v2.0.0-cublas-cuda12
  • v2.0.0-cublas-cuda12-core
  • v2.0.0-cublas-cuda12-ffmpeg
  • v2.0.0-cublas-cuda12-ffmpeg-core

Core Images - Smaller images without predownload python dependencies

Example:

  • Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:latest
  • FFmpeg: quay.io/go-skynet/local-ai:v2.0.0-ffmpeg
  • CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v2.0.0-cublas-cuda11-ffmpeg
  • CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v2.0.0-cublas-cuda12-ffmpeg
Note

Note: the binary inside the image is pre-compiled, and might not suite all CPUs. To enable CPU optimizations for the execution environment, the default behavior is to rebuild when starting the container. To disable this auto-rebuild behavior, set the environment variable REBUILD to false.

See docs on all environment variables for more info.

Example: Use luna-ai-llama2 model with docker

mkdir models

# Download luna-ai-llama2 to models/
wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2

# Use a template from the examples
cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "luna-ai-llama2",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'

# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}

To see other model configurations, see also the example section here.

From binaries

LocalAI binary releases are available in Github.

You can control LocalAI with command line arguments, to specify a binding address, or the number of threads.

CLI parameters

Parameter Environmental Variable Default Variable Description
–f16 $F16 false Enable f16 mode
–debug $DEBUG false Enable debug mode
–cors $CORS false Enable CORS support
–cors-allow-origins value $CORS_ALLOW_ORIGINS Specify origins allowed for CORS
–threads value $THREADS 4 Number of threads to use for parallel computation
–models-path value $MODELS_PATH ./models Path to the directory containing models used for inferencing
–preload-models value $PRELOAD_MODELS List of models to preload in JSON format at startup
–preload-models-config value $PRELOAD_MODELS_CONFIG A config with a list of models to apply at startup. Specify the path to a YAML config file
–config-file value $CONFIG_FILE Path to the config file
–address value $ADDRESS :8080 Specify the bind address for the API server
–image-path value $IMAGE_PATH Path to the directory used to store generated images
–context-size value $CONTEXT_SIZE 512 Default context size of the model
–upload-limit value $UPLOAD_LIMIT 15 Default upload limit in megabytes (audio file upload)
–galleries $GALLERIES Allows to set galleries from command line
–parallel-requests $PARALLEL_REQUESTS false Enable backends to handle multiple requests in parallel. This is for backends that supports multiple requests in parallel, like llama.cpp or vllm
–single-active-backend $SINGLE_ACTIVE_BACKEND false Allow only one backend to be running
–api-keys value $API_KEY empty List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys.
–enable-watchdog-idle $WATCHDOG_IDLE false Enable watchdog for stopping idle backends. This will stop the backends if are in idle state for too long. (default: false) [$WATCHDOG_IDLE]
–enable-watchdog-busy $WATCHDOG_BUSY false Enable watchdog for stopping busy backends that exceed a defined threshold.
–watchdog-busy-timeout value $WATCHDOG_BUSY_TIMEOUT 5m Watchdog timeout. This will restart the backend if it crashes.
–watchdog-idle-timeout value $WATCHDOG_IDLE_TIMEOUT 15m Watchdog idle timeout. This will restart the backend if it crashes.
–preload-backend-only $PRELOAD_BACKEND_ONLY false If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups.
–external-grpc-backends EXTERNAL_GRPC_BACKENDS none Comma separated list of external gRPC backends to use. Format: name:host:port or name:/path/to/file

Run LocalAI in Kubernetes

LocalAI can be installed inside Kubernetes with helm.

Requirements:

  • SSD storage class, or disable mmap to load the whole model in memory
By default, the helm chart will install LocalAI instance using the ggml-gpt4all-j model without persistent storage.
  1. Add the helm repo
    helm repo add go-skynet https://go-skynet.github.io/helm-charts/
    
  2. Install the helm chart:
    helm repo update
    helm install local-ai go-skynet/local-ai -f values.yaml
    

Note: For further configuration options, see the helm chart repository on GitHub.

Example values

Deploy a single LocalAI pod with 6GB of persistent storage serving up a ggml-gpt4all-j model with custom prompt.

### values.yaml

replicaCount: 1

deployment:
  image: quay.io/go-skynet/local-ai:latest ##(This is for CPU only, to use GPU change it to a image that supports GPU IE "v2.0.0-cublas-cuda12-core")
  env:
    threads: 4
    context_size: 512
  modelsPath: "/models"

resources:
  {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
  {}
  # ggml-gpt4all-j.tmpl: |
  #   The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
  #   ### Prompt:
  #   {{.Input}}
  #   ### Response:

# Models to download at runtime
models:
  # Whether to force download models even if they already exist
  forceDownload: false

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
  - url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
      # basicAuth: base64EncodedCredentials

  # Persistent storage for models and prompt templates.
  # PVC and HostPath are mutually exclusive. If both are enabled,
  # PVC configuration takes precedence. If neither are enabled, ephemeral
  # storage is used.
  persistence:
    pvc:
      enabled: false
      size: 6Gi
      accessModes:
        - ReadWriteOnce

      annotations: {}

      # Optional
      storageClass: ~

    hostPath:
      enabled: false
      path: "/models"

service:
  type: ClusterIP
  port: 80
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"

ingress:
  enabled: false
  className: ""
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nodeSelector: {}

tolerations: []

affinity: {}

Build from source

See the build section.

Other examples

Screenshot from 2023-04-26 23-59-55 Screenshot from 2023-04-26 23-59-55

To see other examples on how to integrate with other projects for instance for question answering or for using it with chatbot-ui, see: examples.

Clients

OpenAI clients are already compatible with LocalAI by overriding the basePath, or the target URL.

Javascript

https://github.com/openai/openai-node/

import { Configuration, OpenAIApi } from 'openai';

const configuration = new Configuration({
  basePath: `http://localhost:8080/v1`
});
const openai = new OpenAIApi(configuration);

Python

https://github.com/openai/openai-python

Set the OPENAI_API_BASE environment variable, or by code:

import openai

openai.api_base = "http://localhost:8080/v1"

# create a chat completion
chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])

# print the completion
print(completion.choices[0].message.content)

πŸ†• What's New

04-12-2023: v2.0.0

This release brings a major overhaul in some backends.

Breaking/important changes:

  • Backend rename: llama-stable renamed to llama-ggml 1287
  • Prompt template changes: 1254 (extra space in roles)
  • Apple metal bugfixes: 1365

New:

  • Added support for LLaVa and OpenAI Vision API support ( 1254 )
  • Python based backends are now using conda to track env dependencies ( 1144 )
  • Support for parallel requests ( 1290 )
  • Support for transformers-embeddings ( 1308 )
  • Watchdog for backends ( 1341 ). As https://github.com/ggerganov/llama.cpp/issues/3969 is hitting LocalAI’s llama-cpp implementation, we have now a watchdog that can be used to make sure backends are not stalling. This is a generic mechanism that can be enabled for all the backends now.
  • Whisper.cpp updates ( 1302 )
  • Petals backend ( 1350 )
  • Full LLM fine-tuning example to use with LocalAI: https://localai.io/advanced/fine-tuning/

Due to the python dependencies size of images grew in size. If you still want to use smaller images without python dependencies, you can use the corresponding images tags ending with -core.

Full changelog: https://github.com/mudler/LocalAI/releases/tag/v2.0.0

30-10-2023: v1.40.0

This release is a preparation before v2 - the efforts now will be to refactor, polish and add new backends. Follow up on: https://github.com/mudler/LocalAI/issues/1126

Hot topics

This release now brings the llama-cpp backend which is a c++ backend tied to llama.cpp. It follows more closely and tracks recent versions of llama.cpp. It is not feature compatible with the current llama backend but plans are to sunset the current llama backend in favor of this one. This one will be probably be the latest release containing the older llama backend written in go and c++. The major improvement with this change is that there are less layers that could be expose to potential bugs - and as well it ease out maintenance as well.

Support for ROCm/HIPBLAS

This release bring support for AMD thanks to @65a . See more details in 1100

More CLI commands

Thanks to @jespino now the local-ai binary has more subcommands allowing to manage the gallery or try out directly inferencing, check it out!

Release notes

25-09-2023: v1.30.0

This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!

Check out the documentation for vllm here and Vall-E-X here

Release notes

26-08-2023: v1.25.0

Hey everyone, Ettore here, I’m so happy to share this release out - while this summer is hot apparently doesn’t stop LocalAI development :)

This release brings a lot of new features, bugfixes and updates! Also a big shout out to the community, this was a great release!

Attention 🚨

From this release the llama backend supports only gguf files (see 943 ). LocalAI however still supports ggml files. We ship a version of llama.cpp before that change in a separate backend, named llama-stable to allow still loading ggml files. If you were specifying the llama backend manually to load ggml files from this release you should use llama-stable instead, or do not specify a backend at all (LocalAI will automatically handle this).

Image generation enhancements

The Diffusers backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the Diffusers documentation for more information.

Lora adapters

Now it’s possible to load lora adapters for llama.cpp. See 955 for more information.

Device management

It is now possible for single-devices with one GPU to specify --single-active-backend to allow only one backend active at the time 925 .

Community spotlight

2023_08_26_15_09_27 2023_08_26_15_09_27

Resources management

Thanks to the continous community efforts (another cool contribution from dave-gray101 ) now it’s possible to shutdown a backend programmatically via the API. There is an ongoing effort in the community to better handling of resources. See also the πŸ”₯Roadmap.

New how-to section

Thanks to the community efforts now we have a new how-to section with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to lunamidori5 from the community for the impressive efforts on this!

πŸ’‘ More examples!

LocalAGI in discord!

Did you know that we have now few cool bots in our Discord? come check them out! We also have an instance of LocalAGI ready to help you out!

Changelog summary

Breaking Changes πŸ› 

  • feat: bump llama.cpp, add gguf support by mudler in 943

Exciting New Features πŸŽ‰

  • feat(Makefile): allow to restrict backend builds by mudler in 890
  • feat(diffusers): various enhancements by mudler in 895
  • feat: make initializer accept gRPC delay times by mudler in 900
  • feat(diffusers): add DPMSolverMultistepScheduler++, DPMSolverMultistepSchedulerSDE++, guidance_scale by mudler in 903
  • feat(diffusers): overcome prompt limit by mudler in 904
  • feat(diffusers): add img2img and clip_skip, support more kernels schedulers by mudler in 906
  • Usage Features by dave-gray101 in 863
  • feat(diffusers): be consistent with pipelines, support also depthimg2img by mudler in 926
  • feat: add –single-active-backend to allow only one backend active at the time by mudler in 925
  • feat: add llama-stable backend by mudler in 932
  • feat: allow to customize rwkv tokenizer by dave-gray101 in 937
  • feat: backend monitor shutdown endpoint, process based by dave-gray101 in 938
  • feat: Allow to load lora adapters for llama.cpp by mudler in 955

Join our Discord community! our vibrant community is growing fast, and we are always happy to help! https://discord.gg/uJAeKSAGDy

The full changelog is available here.


πŸ”₯πŸ”₯πŸ”₯πŸ”₯ 12-08-2023: v1.24.0 πŸ”₯πŸ”₯πŸ”₯πŸ”₯

This is release brings four(!) new additional backends to LocalAI: 🐢 Bark, πŸ¦™ AutoGPTQ, 🧨 Diffusers, πŸ¦™ exllama and a lot of improvements!

Major improvements:

🐢 Bark

Bark is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it’s available in the container images by default.

It can also generate music, see the example: lion.webm

πŸ¦™ AutoGPTQ

AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

It is targeted mainly for GPU usage only. Check out the AutoGPTQ documentation for usage.

πŸ¦™ Exllama

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. It is a faster alternative to run LLaMA models on GPU.Check out the Exllama documentation for usage.

🧨 Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren’t tested yet. Check out the Diffusers documentation for usage.

πŸ”‘ API Keys

Thanks to the community contributions now it’s possible to specify a list of API keys that can be used to gate API requests.

API Keys can be specified with the API_KEY environment variable as a comma-separated list of keys.

πŸ–ΌοΈ Galleries

Now by default the model-gallery repositories are configured in the container images

πŸ’‘ New project

LocalAGI is a simple agent that uses LocalAI functions to have a full locally runnable assistant (with no API keys needed).

See it here in action planning a trip for San Francisco!

The full changelog is available here.


πŸ”₯πŸ”₯ 29-07-2023: v1.23.0 πŸš€

This release focuses mostly on bugfixing and updates, with just a couple of new features:

  • feat: add rope settings and negative prompt, drop grammar backend by mudler in 797
  • Added CPU information to entrypoint.sh by @finger42 in 794
  • feat: cancel stream generation if client disappears by @tmm1 in 792

Most notably, this release brings important fixes for CUDA (and not only):

  • fix: add rope settings during model load, fix CUDA by mudler in 821
  • fix: select function calls if ’name’ is set in the request by mudler in 827
  • fix: symlink libphonemize in the container by mudler in 831
Note

From this release OpenAI functions are available in the llama backend. The llama-grammar has been deprecated. See also OpenAI functions.

The full changelog is available here


πŸ”₯πŸ”₯πŸ”₯ 23-07-2023: v1.22.0 πŸš€

  • feat: add llama-master backend by mudler in 752
  • [build] pass build type to cmake on libtransformers.a build by @TonDar0n in 741
  • feat: resolve JSONSchema refs (planners) by mudler in 774
  • feat: backends improvements by mudler in 778
  • feat(llama2): add template for chat messages by dave-gray101 in 782
Note

From this release to use the OpenAI functions you need to use the llama-grammar backend. It has been added a llama backend for tracking llama.cpp master and llama-grammar for the grammar functionalities that have not been merged yet upstream. See also OpenAI functions. Until the feature is merged we will have two llama backends.

Huggingface embeddings

In this release is now possible to specify to LocalAI external gRPC backends that can be used for inferencing 778 . It is now possible to write internal backends in any language, and a huggingface-embeddings backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also Embeddings.

LLaMa 2 has been released!

Thanks to the community effort now LocalAI supports templating for LLaMa2! more at: 782 until we update the model gallery with LLaMa2 models!

Official langchain integration

Progress has been made to support LocalAI with langchain. See: https://github.com/langchain-ai/langchain/pull/8134


πŸ”₯πŸ”₯πŸ”₯ 17-07-2023: v1.21.0 πŸš€

  • [whisper] Partial support for verbose_json format in transcribe endpoint by @ldotlopez in 721
  • LocalAI functions by @mudler in 726
  • gRPC-based backends by @mudler in 743
  • falcon support (7b and 40b) with ggllm.cpp by @mudler in 743

LocalAI functions

This allows to run OpenAI functions as described in the OpenAI blog post and documentation: https://openai.com/blog/function-calling-and-other-api-updates.

This is a video of running the same example, locally with LocalAI: localai-functions-1 localai-functions-1

And here when it actually picks to reply to the user instead of using functions! functions-2 functions-2

Note: functions are supported only with llama.cpp-compatible models.

A full example is available here: https://github.com/go-skynet/LocalAI/tree/master/examples/functions

gRPC backends

This is an internal refactor which is not user-facing, however, it allows to ease out maintenance and addition of new backends to LocalAI!

falcon support

Now Falcon 7b and 40b models compatible with https://github.com/cmp-nct/ggllm.cpp are supported as well.

The former, ggml-based backend has been renamed to falcon-ggml.

Default pre-compiled binaries

From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile local-ai from scratch on start and switch back to the old behavior, you can set REBUILD=true in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the build section for more information.

Full release changelog


πŸ”₯πŸ”₯πŸ”₯ 28-06-2023: v1.20.0 πŸš€

Exciting New Features πŸŽ‰

Container images

  • Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.20.0
  • FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-ffmpeg
  • CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-cublas-cuda11-ffmpeg
  • CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-cublas-cuda12-ffmpeg

Updates

Updates to llama.cpp, go-transformers, gpt4all.cpp and rwkv.cpp.

The NUMA option was enabled by mudler in 684 , along with many new parameters (mmap,mmlock, ..). See advanced for the full list of parameters.

In this release there is support for gallery repositories. These are repositories that contain models, and can be used to install models. The default gallery which contains only freely licensed models is in Github: https://github.com/go-skynet/model-gallery, but you can use your own gallery by setting the GALLERIES environment variable. An automatic index of huggingface models is available as well.

For example, now you can start LocalAI with the following environment variable to use both galleries:

GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:ci-robbot/localai-huggingface-zoo/index.yaml","name":"huggingface"}]

And in runtime you can install a model from huggingface now with:

curl http://localhost:8000/models/apply -H "Content-Type: application/json" -d '{ "id": "huggingface@thebloke__open-llama-7b-open-instruct-ggml__open-llama-7b-open-instruct.ggmlv3.q4_0.bin" }'

or a tts voice with:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{ "id": "model-gallery@voice-en-us-kathleen-low" }'

See also models for a complete documentation.

Text to Audio

Now LocalAI uses piper and go-piper to generate audio from text. This is an experimental feature, and it requires GO_TAGS=tts to be set during build. It is enabled by default in the pre-built container images.

To setup audio models, you can use the new galleries, or setup the models manually as described in the API section of the documentation.

You can check the full changelog in Github


πŸ”₯πŸ”₯πŸ”₯ 19-06-2023: v1.19.0 πŸš€

  • Full CUDA GPU offload support ( PR by mudler. Thanks to chnyda for handing over the GPU access, and lu-zero to help in debugging )
  • Full GPU Metal Support is now fully functional. Thanks to Soleblaze to iron out the Metal Apple silicon support!

Container images:

  • Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.19.2
  • FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-ffmpeg
  • CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda11-ffmpeg
  • CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda12-ffmpeg

πŸ”₯πŸ”₯πŸ”₯ 06-06-2023: v1.18.0 πŸš€

This LocalAI release is plenty of new features, bugfixes and updates! Thanks to the community for the help, this was a great community release!

We now support a vast variety of models, while being backward compatible with prior quantization formats, this new release allows still to load older formats and new k-quants!

New features

  • ✨ Added support for falcon-based model families (7b) ( mudler )
  • ✨ Experimental support for Metal Apple Silicon GPU - ( mudler and thanks to Soleblaze for testing! ). See the build section.
  • ✨ Support for token stream in the /v1/completions endpoint ( samm81 )
  • ✨ Added huggingface backend ( Evilfreelancer )
  • πŸ“· Stablediffusion now can output 2048x2048 images size with esrgan! ( mudler )

Container images

  • πŸ‹ CUDA container images (arm64, x86_64) ( sebastien-prudhomme )
  • πŸ‹ FFmpeg container images (arm64, x86_64) ( mudler )

Dependencies updates

  • πŸ†™ Bloomz has been updated to the latest ggml changes, including new quantization format ( mudler )
  • πŸ†™ RWKV has been updated to the new quantization format( mudler )
  • πŸ†™ k-quants format support for the llama models ( mudler )
  • πŸ†™ gpt4all has been updated, incorporating upstream changes allowing to load older models, and with different CPU instruction set (AVX only, AVX2) from the same binary! ( mudler )

Generic

  • 🐧 Fully Linux static binary releases ( mudler )
  • πŸ“· Stablediffusion has been enabled on container images by default ( mudler ) Note: You can disable container image rebuilds with REBUILD=false

Examples

Two new projects offer now direct integration with LocalAI!

Full release changelog


29-05-2023: v1.17.0

Support for OpenCL has been added while building from sources.

You can now build LocalAI from source with BUILD_TYPE=clblas to have an OpenCL build. See also the build section.

For instructions on how to install OpenCL/CLBlast see here.

rwkv.cpp has been updated to the new ggml format commit.


27-05-2023: v1.16.0

Now it’s possible to automatically download pre-configured models before starting the API.

Start local-ai with the PRELOAD_MODELS containing a list of models from the gallery, for instance to install gpt4all-j as gpt-3.5-turbo:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]

llama.cpp models now can also automatically save the prompt cache state as well by specifying in the model YAML configuration file:

# Enable prompt caching

# This is a file that will be used to save/load the cache. relative to the models directory.
prompt_cache_path: "alpaca-cache"

# Always enable prompt cache
prompt_cache_all: true

See also the advanced section.

Media, Blogs, Social

Previous

  • 23-05-2023: v1.15.0 released. go-gpt2.cpp backend got renamed to go-ggml-transformers.cpp updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. This impacts RedPajama, GptNeoX, MPT(not gpt4all-mpt), Dolly, GPT2 and Starcoder based models. Binary releases available, various fixes, including 341 .
  • 21-05-2023: v1.14.0 released. Minor updates to the /models/apply endpoint, llama.cpp backend updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. gpt4all is still compatible with the old format.
  • 19-05-2023: v1.13.0 released! πŸ”₯πŸ”₯ updates to the gpt4all and llama backend, consolidated CUDA support ( 310 thanks to @bubthegreat and @Thireus ), preliminar support for installing models via API.
  • 17-05-2023: v1.12.0 released! πŸ”₯πŸ”₯ Minor fixes, plus CUDA ( 258 ) support for llama.cpp-compatible models and image generation ( 272 ).
  • 16-05-2023: πŸ”₯πŸ”₯πŸ”₯ Experimental support for CUDA ( 258 ) in the llama.cpp backend and Stable diffusion CPU image generation ( 272 ) in master.

Now LocalAI can generate images too:

mode=0 mode=1 (winograd/sgemm)
b6441997879 b6441997879 winograd2 winograd2
  • 14-05-2023: v1.11.1 released! rwkv backend patch release
  • 13-05-2023: v1.11.0 released! πŸ”₯ Updated llama.cpp bindings: This update includes a breaking change in the model files ( https://github.com/ggerganov/llama.cpp/pull/1405 ) - old models should still work with the gpt4all-llama backend.
  • 12-05-2023: v1.10.0 released! πŸ”₯πŸ”₯ Updated gpt4all bindings. Added support for GPTNeox (experimental), RedPajama (experimental), Starcoder (experimental), Replit (experimental), MosaicML MPT. Also now embeddings endpoint supports tokens arrays. See the langchain-chroma example! Note - this update does NOT include https://github.com/ggerganov/llama.cpp/pull/1405 which makes models incompatible.
  • 11-05-2023: v1.9.0 released! πŸ”₯ Important whisper updates ( 233 229 ) and extended gpt4all model families support ( 232 ). Redpajama/dolly experimental ( 214 )
  • 10-05-2023: v1.8.0 released! πŸ”₯ Added support for fast and accurate embeddings with bert.cpp ( 222 )
  • 09-05-2023: Added experimental support for transcriptions endpoint ( 211 )
  • 08-05-2023: Support for embeddings with models using the llama.cpp backend ( 207 )
  • 02-05-2023: Support for rwkv.cpp models ( 158 ) and for /edits endpoint
  • 01-05-2023: Support for SSE stream of tokens in llama.cpp backends ( 152 )

Subsections of Features

⚑ GPU acceleration

Note

Section under construction

This section contains instruction on how to use LocalAI with GPU acceleration.

Note

For accelleration for AMD or Metal HW there are no specific container images, see the build

CUDA

Requirement: nvidia-container-toolkit (installation instructions 1 2)

To use CUDA, use the images with the cublas tag.

The image list is on quay:

  • CUDA 11 tags: master-cublas-cuda11, v1.40.0-cublas-cuda11, …
  • CUDA 12 tags: master-cublas-cuda12, v1.40.0-cublas-cuda12, …
  • CUDA 11 + FFmpeg tags: master-cublas-cuda11-ffmpeg, v1.40.0-cublas-cuda11-ffmpeg, …
  • CUDA 12 + FFmpeg tags: master-cublas-cuda12-ffmpeg, v1.40.0-cublas-cuda12-ffmpeg, …

In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda12

If the GPU inferencing is working, you should be able to see something like:

5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

Model configuration

Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):

name: my-model-name
# Default model parameters
parameters:
  # Relative to the models path
  model: llama.cpp-model.ggmlv3.q5_K_M.bin

context_size: 1024
threads: 1

f16: true # enable with GPU acceleration
gpu_layers: 22 # GPU Layers (only used when built with cublas)

For diffusers instead, it might look like this instead:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"

🎨 Image generation

anime_girl anime_girl (Generated with AnimagineXL)

LocalAI supports generating images with Stable diffusion, running on CPU using a C++ implementation, Stable-Diffusion-NCNN (binding) and 🧨 Diffusers.

Usage

OpenAI docs: https://platform.openai.com/docs/api-reference/images/create

To generate an image you can send a POST request to the /v1/images/generations endpoint with the instruction as the request body:

# 512x512 is supported too
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "A cute baby sea otter",
  "size": "256x256"
}'

Available additional parameters: mode, step.

Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
  "size": "256x256"
}'

stablediffusion-cpp

mode=0 mode=1 (winograd/sgemm)
test test b643343452981 b643343452981
b6441997879 b6441997879 winograd2 winograd2
winograd winograd winograd3 winograd3

Note: image generator supports images up to 512x512. You can use other tools however to upscale the image, for instance: https://github.com/upscayl/upscayl.

Setup

Note: In order to use the images/generation endpoint with the stablediffusion C++ backend, you need to build LocalAI with GO_TAGS=stablediffusion. If you are using the container images, it is already enabled.

While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "url": "github:go-skynet/model-gallery/stablediffusion.yaml"
}'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]

or as arg:

local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:go-skynet/model-gallery/stablediffusion.yaml
  1. Create a model file stablediffusion.yaml in the models folder:
name: stablediffusion
backend: stablediffusion
parameters:
  model: stablediffusion_assets
  1. Create a stablediffusion_assets directory inside your models directory
  2. Download the ncnn assets from https://github.com/EdVince/Stable-Diffusion-NCNN#out-of-box and place them in stablediffusion_assets.

The models directory should look like the following:

models
β”œβ”€β”€ stablediffusion_assets
β”‚Β Β  β”œβ”€β”€ AutoencoderKL-256-256-fp16-opt.param
β”‚Β Β  β”œβ”€β”€ AutoencoderKL-512-512-fp16-opt.param
β”‚Β Β  β”œβ”€β”€ AutoencoderKL-base-fp16.param
β”‚Β Β  β”œβ”€β”€ AutoencoderKL-encoder-512-512-fp16.bin
β”‚Β Β  β”œβ”€β”€ AutoencoderKL-fp16.bin
β”‚Β Β  β”œβ”€β”€ FrozenCLIPEmbedder-fp16.bin
β”‚Β Β  β”œβ”€β”€ FrozenCLIPEmbedder-fp16.param
β”‚Β Β  β”œβ”€β”€ log_sigmas.bin
β”‚Β Β  β”œβ”€β”€ tmp-AutoencoderKL-encoder-256-256-fp16.param
β”‚Β Β  β”œβ”€β”€ UNetModel-256-256-MHA-fp16-opt.param
β”‚Β Β  β”œβ”€β”€ UNetModel-512-512-MHA-fp16-opt.param
β”‚Β Β  β”œβ”€β”€ UNetModel-base-MHA-fp16.param
β”‚Β Β  β”œβ”€β”€ UNetModel-MHA-fp16.bin
β”‚Β Β  └── vocab.txt
└── stablediffusion.yaml

Diffusers

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

# Force CPU usage - set to true for GPU
f16: false
diffusers:
  pipeline_type: StableDiffusionXLPipeline
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a

πŸ“– Text generation (GPT)

LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.

Note:

  • You can also specify the model name as part of the OpenAI token.
  • If only one model is available, the API will use it for all the requests.

Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:

curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens.

Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

List models

You can list all the models available with:

curl http://localhost:8080/v1/models

πŸ”ˆ Audio to text

The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint supports the audio formats supported by ffmpeg.

Usage

Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.

For instance, with cURL:

curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"

Example

Download one of the models from here in the models folder, and create a YAML file for your model:

name: whisper-1
backend: whisper
parameters:
  model: whisper-en

The transcriptions endpoint then can be tested like so:

## Get an example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"

## Result
{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}

πŸ”₯ OpenAI functions

LocalAI supports running OpenAI functions with llama.cpp compatible models.

localai-functions-1 localai-functions-1

To learn more about OpenAI functions, see the OpenAI API blog post.

πŸ’‘ Check out also LocalAGI for an example on how to use LocalAI functions.

Setup

OpenAI functions are available only with ggml or gguf models compatible with llama.cpp.

You don’t need to do anything specific - just use ggml or gguf models.

Usage example

You can configure a model manually with a YAML config file in the models directory, for example:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

To use the functions with the OpenAI client in python:

import openai
# ...
# Send the conversation and available functions to GPT
messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
functions = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    }
]
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=messages,
    functions=functions,
    function_call="auto",
)
# ...
Note

When running the python script, be sure to:

  • Set OPENAI_API_KEY environment variable to a random string (the OpenAI api key is NOT required!)
  • Set OPENAI_API_BASE to point to your LocalAI service, for example OPENAI_API_BASE=http://localhost:8080

Advanced

It is possible to also specify the full function signature (for debugging, or to use with other clients).

The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-4",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.1,
     "grammar_json_functions": {
        "oneOf": [
            {
                "type": "object",
                "properties": {
                    "function": {"const": "create_event"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "date": {"type": "string"},
                            "time": {"type": "string"}
                        }
                    }
                }
            },
            {
                "type": "object",
                "properties": {
                    "function": {"const": "search"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        }
                    }
                }
            }
        ]
    }
   }'

πŸ’‘ Examples

A full e2e example with docker-compose is available here.

πŸ†• GPT Vision

Note

Available only on master builds

LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.

llava llava

Usage

OpenAI docs: https://platform.openai.com/docs/guides/vision

To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions endpoint, for example with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Setup

To setup the LLaVa models, follow the full example in the configuration examples.

πŸ—£ Text to audio (TTS)

The /tts endpoint can be used to generate speech from text.

Input: input, model

For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts"
}'

Returns an audio/wav file.

Setup

LocalAI supports bark , piper and vall-e-x:

Note

The piper backend is used for onnx models and requires the modules to be downloaded first.

To install the piper audio models manually:

To use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "model":"it-riccardo_fasol-x-low.onnx",
  "backend": "piper",
  "input": "Ciao, sono Ettore"
}' | aplay

Note:

  • aplay is a Linux command. You can use other tools to play the audio file.
  • The model name is the filename with the extension.
  • The model name is case sensitive.
  • LocalAI must be compiled with the GO_TAGS=tts flag.

LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:

curl --request POST \
  --url http://localhost:8080/tts \
  --header 'Content-Type: application/json' \
  --data '{
    "backend": "transformers-musicgen",
    "model": "facebook/musicgen-medium",
    "input": "Cello Rave"
}' | aplay```

Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

#### Configuration

Audio models can be configured via `YAML` files. This allows to configure specific setting for each backend. For instance, backends might be specifying a voice or supports voice cloning which must be specified in the configuration file.

```yaml
name: tts
backend: vall-e-x
parameters: ...

🧠 Embeddings

LocalAI supports generating embeddings for text or list of tokens.

For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings

Model compatibility

The embedding endpoint is compatible with llama.cpp models, bert.cpp models and sentence-transformers models available in huggingface.

Manual Setup

Create a YAML config file in the models directory. Specify the backend and the model file.

name: text-embedding-ada-002 # The model name used in the API
parameters:
  model: <model_file>
backend: "<backend>"
embeddings: true
# .. other parameters

Bert embeddings

To use bert.cpp models you can use the bert embedding backend.

An example model config file:

name: text-embedding-ada-002
parameters:
  model: bert
backend: bert-embeddings
embeddings: true
# .. other parameters

The bert backend uses bert.cpp and uses ggml models.

For instance you can download the ggml quantized version of all-MiniLM-L6-v2 from https://huggingface.co/skeskinen/ggml:

wget https://huggingface.co/skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert

To test locally (LocalAI server running on localhost), you can use curl (and jq at the end to prettify):

curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
  "input": "Your text string goes here",
  "model": "text-embedding-ada-002"
}' | jq "."

Huggingface embeddings

To use sentence-transformers and models in huggingface you can use the sentencetransformers embedding backend.

name: text-embedding-ada-002
backend: sentencetransformers
embeddings: true
parameters:
  model: all-MiniLM-L6-v2

The sentencetransformers backend uses Python sentence-transformers. For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models

Note
  • The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
  • If you are running LocalAI manually you must install the python dependencies (make prepare-extra-conda-environments). This requires conda to be installed.
  • For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
    • Example: EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"
  • The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
  • No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.

Llama.cpp embeddings

Embeddings with llama.cpp are supported with the llama backend.

name: my-awesome-model
backend: llama
embeddings: true
parameters:
  model: ggml-file.bin
# ...

πŸ’‘ Examples

  • Example that uses LLamaIndex and LocalAI as embedding: here.

✍️ Constrained grammars

The chat endpoint accepts an additional grammar parameter which takes a BNF defined grammar.

This allows the LLM to constrain the output to a user-defined schema, allowing to generate JSON, YAML, and everything that can be defined with a BNF grammar.

Note

This feature works only with models compatible with the llama.cpp backend (see also Model compatibility). For details on how it works, see the upstream PRs: https://github.com/ggerganov/llama.cpp/pull/1773, https://github.com/ggerganov/llama.cpp/pull/1887

Setup

Follow the setup instructions from the LocalAI functions page.

πŸ’‘ Usage example

For example, to constrain the output to either yes, no:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Do you like apples?"}],
  "grammar": "root ::= (\"yes\" | \"no\")"
}'

Model compatibility

LocalAI is compatible with the models supported by llama.cpp supports also GPT4ALL-J and cerebras-GPT with ggml.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Hardware requirements

Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also here for gguf based backends. rwkv is less expensive on resources.

Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.

Backend and Bindings Compatible models Completion/Chat endpoint Capability Embeddings support Token stream support Acceleration
llama.cpp Vicuna, Alpaca, LLaMa yes GPT and Functions yes** yes CUDA, openCL, cuBLAS, Metal
gpt4all-llama Vicuna, Alpaca, LLaMa yes GPT no yes N/A
gpt4all-mpt MPT yes GPT no yes N/A
gpt4all-j GPT4ALL-J yes GPT no yes N/A
falcon-ggml (binding) Falcon (*) yes GPT no no N/A
gpt2 (binding) GPT2, Cerebras yes GPT no no N/A
dolly (binding) Dolly yes GPT no no N/A
gptj (binding) GPTJ yes GPT no no N/A
mpt (binding) MPT yes GPT no no N/A
replit (binding) Replit yes GPT no no N/A
gptneox (binding) GPT NeoX, RedPajama, StableLM yes GPT no no N/A
starcoder (binding) Starcoder yes GPT no no N/A
bloomz (binding) Bloom yes GPT no no N/A
rwkv (binding) rwkv yes GPT no yes N/A
bert (binding) bert no Embeddings only yes no N/A
whisper whisper no Audio no no N/A
stablediffusion (binding) stablediffusion no Image no no N/A
langchain-huggingface Any text generators available on HuggingFace through API yes GPT no no N/A
piper (binding) Any piper onnx model no Text to voice no no N/A
falcon (binding) Falcon *** yes GPT no yes CUDA
huggingface-embeddings sentence-transformers BERT no Embeddings only yes no N/A
bark bark no Audio generation no no yes
AutoGPTQ GPTQ yes GPT yes no N/A
exllama GPTQ yes GPT only no no N/A
diffusers SD,… no Image generation no no N/A
vall-e-x Vall-E no Audio generation and Voice cloning no no CPU/CUDA
vllm Various GPTs and quantization formats yes GPT no no CPU/CUDA

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

Tested with:

Note: You might need to convert some models from older models to the new format, for indications, see the README in llama.cpp for instance to run gpt4all.

Subsections of Model compatibility

RWKV

A full example on how to run a rwkv model is in the examples.

Note: rwkv models needs to specify the backend rwkv in the YAML config files and have an associated tokenizer along that needs to be provided with it:

36464540 -rw-r--r--  1 mudler mudler 1.2G May  3 10:51 rwkv_small
36464543 -rw-r--r--  1 mudler mudler 2.4M May  3 10:51 rwkv_small.tokenizer.json

πŸ¦™ llama.cpp

llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.

Note

The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use the llama-stable backend instead. If you are relying in automatic detection of the model, you should be fine. For gguf models, use the llama backend.

Features

The llama.cpp model supports the following features:

Setup

LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.

Manual setup

It is sufficient to copy the ggml or guf model files in the models folder. You can refer to the model in the model parameter in the API calls.

You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt.

Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.

For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

LocalAI will automatically download and configure the model in the model directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.

YAML configuration

To use the llama.cpp backend, specify llama as the backend in the YAML file:

name: llama
backend: llama
parameters:
  # Relative to the models path
  model: file.gguf.bin

In the example above we specify llama as the backend to restrict loading gguf models only.

For instance, to use the llama-stable backend for ggml models:

name: llama
backend: llama-stable
parameters:
  # Relative to the models path
  model: file.ggml.bin

Reference

πŸ¦™ Exllama

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”

Prerequisites

This is an extra backend - in the container images is already available and there is nothing to do for the setup.

If you are building LocalAI locally, you need to install exllama manually first.

Model setup

Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:

$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/                                                                 
.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml                                                     
name: exllama
parameters:
  model: WizardLM-7B-uncensored-GPTQ
backend: exllama
# ...

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "exllama",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

πŸ¦™ AutoGPTQ

AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Prerequisites

This is an extra backend - in the container images is already available and there is nothing to do for the setup.

If you are building LocalAI locally, you need to install AutoGPTQ manually.

Model setup

The models are automatically downloaded from huggingface if not present the first time. It is possible to define models via YAML config file, or just by querying the endpoint with the huggingface repository model name. For example, create a YAML config file in models/:

name: orca
backend: autogptq
model_base_name: "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
parameters:
  model: "TheBloke/orca_mini_v2_13b-GPTQ"
# ...

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "orca",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

🐢 Bark

Bark allows to generate audio from text prompts.

Setup

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.

Usage

Use the tts endpoint by specifying the bark backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!"
   }' | aplay

To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!",
     "model": "v2/en_speaker_4"
   }' | aplay

Vall-E-X

VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.

Setup

The backend will automatically download the required files in order to run the model.

This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.

Usage

Use the tts endpoint by specifying the vall-e-x backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "input":"Hello!"
   }' | aplay

Voice cloning

In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:

name: cloned-voice
backend: vall-e-x
parameters:
  model: "cloned-voice"
vall-e:
  # The path to the audio file to be cloned
  # relative to the models directory 
  audio_path: "path-to-wav-source.wav"

Then you can specify the model name in the requests:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "model": "cloned-voice",
     "input":"Hello!"
   }' | aplay

vLLM

vLLM is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.

Setup

Create a YAML file for the model you want to use with vllm.

To setup a model, you need to just specify the model name in the YAML config file:

name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

# Decomment to specify a quantization method (optional)
# quantization: "awq"

The backend will automatically download the required files in order to run the model.

Usage

Use the completions endpoint by specifying the vllm backend:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

🧨 Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.

anime_girl anime_girl (Generated with AnimagineXL)

Note: currently only the image generation is supported. It is experimental, so you might encounter some issues on models which weren’t tested yet.

Setup

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

# Force CPU usage - set to true for GPU
f16: false
diffusers:
  pipeline_type: StableDiffusionXLPipeline
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a

Local models

You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"
  cfg_scale: 8
  clip_skip: 11

Configuration parameters

The following parameters are available in the configuration file:

Parameter Description Default
f16 Force the usage of float16 instead of float32 false
step Number of steps to run the model for 30
cuda Enable CUDA acceleration false
enable_parameters Parameters to enable for the model negative_prompt,num_inference_steps,clip_skip
scheduler_type Scheduler type k_dpp_sde
cfg_scale Configuration scale 8
clip_skip Clip skip None
pipeline_type Pipeline type StableDiffusionPipeline

There are available several types of schedulers:

Scheduler Description
ddim DDIM
pndm PNDM
heun Heun
unipc UniPC
euler Euler
euler_a Euler a
lms LMS
k_lms LMS Karras
dpm_2 DPM2
k_dpm_2 DPM2 Karras
dpm_2_a DPM2 a
k_dpm_2_a DPM2 a Karras
dpmpp_2m DPM++ 2M
k_dpmpp_2m DPM++ 2M Karras
dpmpp_sde DPM++ SDE
k_dpmpp_sde DPM++ SDE Karras
dpmpp_2m_sde DPM++ 2M SDE
k_dpmpp_2m_sde DPM++ 2M SDE Karras

Pipelines types available:

Pipeline type Description
StableDiffusionPipeline Stable diffusion pipeline
StableDiffusionImg2ImgPipeline Stable diffusion image to image pipeline
StableDiffusionDepth2ImgPipeline Stable diffusion depth to image pipeline
DiffusionPipeline Diffusion pipeline
StableDiffusionXLPipeline Stable diffusion XL pipeline

Usage

Text to Image

Use the image generation endpoint with the model name from the configuration file:

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "<positive prompt>|<negative prompt>", 
      "model": "animagine-xl", 
      "step": 51,
      "size": "1024x1024" 
    }'

Image to Image

https://huggingface.co/docs/diffusers/using-diffusers/img2img

An example model (GPU):

name: stablediffusion-edit
parameters:
  model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25

f16: true
diffusers:
  pipeline_type: StableDiffusionImg2ImgPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,image"
IMAGE_PATH=/path/to/your/image
(echo -n '{"image": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

Depth to Image

https://huggingface.co/docs/diffusers/using-diffusers/depth2img

name: stablediffusion-depth
parameters:
  model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
# Force CPU usage
f16: true
diffusers:
  pipeline_type: StableDiffusionDepth2ImgPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,image"
  cfg_scale: 6
(echo -n '{"image": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

Build

Build locally

Requirements:

Either Docker/podman, or

  • Golang >= 1.21
  • Cmake/make
  • GCC

In order to build the LocalAI container image locally you can use docker:

# build the image
docker build -t localai .
docker run localai

Or you can build the manually binary with make:

git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build

To run: ./local-ai

Note

CPU flagset compatibility

LocalAI uses different backends based on ggml and llama.cpp to run models. If your CPU doesn’t support common instruction sets, you can disable them during build:

CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF" make build

To have effect on the container image, you need to set REBUILD=true:

docker run  quay.io/go-skynet/localai
docker run --rm -ti -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -e REBUILD=true -e CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF" -v $PWD/models:/models quay.io/go-skynet/local-ai:latest

Build on mac

Building on Mac (M1 or M2) works, but you may need to install some prerequisites using brew.

The below has been tested by one mac user and found to work. Note that this doesn’t use Docker to run the server:

# install build dependencies
brew install abseil cmake go grpc protobuf wget

# clone the repo
git clone https://github.com/go-skynet/LocalAI.git

cd LocalAI

# build the binary
make build

# Download gpt4all-j to models/
wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j

# Use a template from the examples
cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/

# Run LocalAI
./local-ai --models-path=./models/ --debug=true

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-j",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

Build with Image generation support

Requirements: OpenCV, Gomp

Image generation is experimental and requires GO_TAGS=stablediffusion to be set during build:

make GO_TAGS=stablediffusion build

Build with Text to audio support

Requirements: piper-phonemize

Text to audio support is experimental and requires GO_TAGS=tts to be set during build:

make GO_TAGS=tts build

Acceleration

List of the variables available to customize the build:

Variable Default Description
BUILD_TYPE None Build type. Available: cublas, openblas, clblas, metal,hipblas
GO_TAGS tts stablediffusion Go tags. Available: stablediffusion, tts
CLBLAST_DIR Specify a CLBlast directory
CUDA_LIBPATH Specify a CUDA library path

OpenBLAS

Software acceleration.

Requirements: OpenBLAS

make BUILD_TYPE=openblas build

CuBLAS

Nvidia Acceleration.

Requirement: Nvidia CUDA toolkit

Note: CuBLAS support is experimental, and has not been tested on real HW. please report any issues you find!

make BUILD_TYPE=cublas build

More informations available in the upstream PR: https://github.com/ggerganov/llama.cpp/pull/1412

Hipblas (AMD GPU with ROCm on Arch Linux)

Packages:

pacman -S base-devel git rocm-hip-sdk rocm-opencl-sdk opencv clblast grpc

Library links:

export CGO_CFLAGS="-I/usr/include/opencv4"
export CGO_CXXFLAGS="-I/usr/include/opencv4"
export CGO_LDFLAGS="-L/opt/rocm/hip/lib -lamdhip64 -L/opt/rocm/lib -lOpenCL -L/usr/lib -lclblast -lrocblas -lhipblas -lrocrand -lomp -O3 --rtlib=compiler-rt -unwindlib=libgcc -lhipblas -lrocblas --hip-link"

Build:

make BUILD_TYPE=hipblas GPU_TARGETS=gfx1030

ClBLAS

AMD/Intel GPU acceleration.

Requirement: OpenCL, CLBlast

make BUILD_TYPE=clblas build

To specify a clblast dir set: CLBLAST_DIR

Metal (Apple Silicon)

make BUILD_TYPE=metal build

# Set `gpu_layers: 1` to your YAML model config file and `f16: true`
# Note: only models quantized with q4_0 are supported!

Windows compatibility

Make sure to give enough resources to the running container. See https://github.com/go-skynet/LocalAI/issues/2

Advanced

Advanced configuration with YAML files

In order to define default prompts, model parameters (such as custom default top_p or top_k), LocalAI can be configured to serve user-defined models with a set of default parameters and templates.

You can create multiple yaml files in the models path or either specify a single YAML configuration file. Consider the following models folder in the example/chatbot-ui:

base ❯ ls -liah examples/chatbot-ui/models 
36487587 drwxr-xr-x 2 mudler mudler 4.0K May  3 12:27 .
36487586 drwxr-xr-x 3 mudler mudler 4.0K May  3 10:42 ..
36465214 -rw-r--r-- 1 mudler mudler   10 Apr 27 07:46 completion.tmpl
36464855 -rw-r--r-- 1 mudler mudler   ?G Apr 27 00:08 luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
36464537 -rw-r--r-- 1 mudler mudler  245 May  3 10:42 gpt-3.5-turbo.yaml
36467388 -rw-r--r-- 1 mudler mudler  180 Apr 27 07:46 chat.tmpl

In the gpt-3.5-turbo.yaml file it is defined the gpt-3.5-turbo model which is an alias to use luna-ai-llama2 with pre-defined options.

For instance, consider the following that declares gpt-3.5-turbo backed by the luna-ai-llama2 model:

name: gpt-3.5-turbo
# Default model parameters
parameters:
  # Relative to the models path
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..

# Default context size
context_size: 512
threads: 10
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv

# Enable prompt caching
prompt_cache_path: "alpaca-cache"
prompt_cache_all: true

# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# define chat roles
roles:
  assistant: '### Response:'
  system: '### System Instruction:'
  user: '### Instruction:'
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: chat

Specifying a config-file via CLI allows to declare models in a single file as a list, for instance:

- name: list1
  parameters:
    model: testmodel
  context_size: 512
  threads: 10
  stopwords:
  - "HUMAN:"
  - "### Response:"
  roles:
    user: "HUMAN:"
    system: "GPT:"
  template:
    completion: completion
    chat: chat
- name: list2
  parameters:
    model: testmodel
  context_size: 512
  threads: 10
  stopwords:
  - "HUMAN:"
  - "### Response:"
  roles:
    user: "HUMAN:"
    system: "GPT:"
  template:
    completion: completion
   chat: chat

See also chatbot-ui as an example on how to use config files.

Full config model file reference

# Model name.
# The model name is used to identify the model in the API calls.
name: gpt-3.5-turbo

# Default model parameters.
# These options can also be specified in the API calls
parameters:
  # Relative to the models path
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..
  top_k: 
  top_p: 
  max_tokens:
  ignore_eos: true
  n_keep: 10
  seed: 
  mode: 
  step:
  negative_prompt:
  typical_p:
  tfz:
  frequency_penalty:
  mirostat_eta:
  mirostat_tau:
  mirostat: 
  rope_freq_base:
  rope_freq_scale:
  negative_prompt_scale:

# Default context size
context_size: 512
# Default number of threads
threads: 10
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv
# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# string to trim space to
trimspace:
- string
# Strings to cut from the response
cutstrings:
- "string"

# Directory used to store additional assets
asset_dir: ""

# define chat roles
roles:
  user: "HUMAN:"
  system: "GPT:"
  assistant: "ASSISTANT:"
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: chat
  edit: edit_template
  function: function_template

function:
   disable_no_action: true
   no_action_function_name: "reply"
   no_action_description_name: "Reply to the AI assistant"

system_prompt:
rms_norm_eps:
# Set it to 8 for llama2 70b
ngqa: 1
## LLAMA specific options
# Enable F16 if backend supports it
f16: true
# Enable debugging
debug: true
# Enable embeddings
embeddings: true
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: 22
# Enable memory lock
mmlock: true
# GPU setting to split the tensor in multiple parts and define a main GPU
# see llama.cpp for usage
tensor_split: ""
main_gpu: ""
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Enable mmap
mmap: true
# Enable low vram mode (GPU only)
low_vram: true
# Set NUMA mode (CPU only)
numa: true
# Lora settings
lora_adapter: "/path/to/lora/adapter"
lora_base: "/path/to/lora/base"
# Disable mulmatq (CUDA)
no_mulmatq: true

# Diffusers/transformers
cuda: true

Prompt templates

The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:
The below instruction describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

See the prompt-templates directory in this repository for templates for some of the most popular models.

For the edit endpoint, an example template for alpaca-based models can be:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{{.Instruction}}

### Input:
{{.Input}}

### Response:

Install models using the API

Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.

A curated collection of model files is in the model-gallery (work in progress!). The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.

To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
    "name": "lunademo"
}'

Preloading models during startup

In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai

PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.

Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):

- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
  name: gpt4all-j
# ...

Automatic prompt caching

LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.

To enable prompt caching, you can control the settings in the model config YAML file:


# Enable prompt caching
prompt_cache_path: "cache"
prompt_cache_all: true

prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.

Configuring a specific backend for the model

By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

The available backends are listed in the model compatibility table.

In order to specify a backend for your models, create a model config file in your models directory specifying the backend:

name: gpt-3.5-turbo

# Default model parameters
parameters:
  # Relative to the models path
  model: ...

backend: llama-stable
# ...

Connect external backends

LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.

The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.

So for instance, to register a new backend which is a local file:

./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"

Or a remote URI:

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"

Environment variables

When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:

Environment variable Default Description
REBUILD false Rebuild LocalAI on startup
BUILD_TYPE Build type. Available: cublas, openblas, clblas
GO_TAGS Go tags. Available: stablediffusion
HUGGINGFACEHUB_API_TOKEN Special token for interacting with HuggingFace Inference API, required only when using the langchain-huggingface backend
EXTRA_BACKENDS A space separated list of backends to prepare. For example EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers" prepares the conda environment on start

Here is how to configure these variables:

# Option 1: command line
docker run --env REBUILD=true localai
# Option 2: set within an env file
docker run --env-file .env localai

Build only a single backend

You can control the backends that are built by setting the GRPC_BACKENDS environment variable. For instance, to build only the llama-cpp backend only:

make GRPC_BACKENDS=backend-assets/grpc/llama-cpp build

By default, all the backends are built.

Extra backends

LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. The container images that are built and published on quay.io contain a set of images split in core and extra. By default Images bring all the dependencies and backends supported by LocalAI (we call those extra images). The -core images instead bring only the strictly necessary dependencies to run LocalAI without only a core set of backends.

If you wish to build a custom container image with extra backends, you can use the core images and build only the backends you are interested into or prepare the environment on startup by using the EXTRA_BACKENDS environment variable. For instance, to use the diffusers backend:

FROM quay.io/go-skynet/local-ai:master-ffmpeg-core

RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers

Remember also to set the EXTERNAL_GRPC_BACKENDS environment variable (or --external-grpc-backends as CLI flag) to point to the backends you are using (EXTERNAL_GRPC_BACKENDS="backend_name:/path/to/backend"), for example with diffusers:

FROM quay.io/go-skynet/local-ai:master-ffmpeg-core

RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers

ENV EXTERNAL_GRPC_BACKENDS="diffusers:/build/backend/python/diffusers/run.sh"
Note

You can specify remote external backends or path to local files. The syntax is backend-name:/path/to/backend or backend-name:host:port.

In runtime

When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:

docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master-ffmpeg-core

Subsections of Advanced

Fine-tuning LLMs for text generation

Note

Section under construction

This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

Open In Colab Open In Colab

Requirements

For this example you will need at least a 12GB VRAM of GPU and a Linux box.

Fine-tuning

Fine-tuning a language model is a process that requires a lot of computational power and time.

Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).

There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.

The steps involved are:

  • Preparing a dataset
  • Prepare the environment and install dependencies
  • Fine-tune the model
  • Merge the Lora base with the model
  • Convert the model to gguf
  • Use the model with LocalAI

Dataset preparation

We are going to need a dataset or a set of datasets.

Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.

A dataset for an instructor model (like Alpaca) can look like the following:

[
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 },
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 }
]

Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):

<System prompt>

## Instruction

<Question, instruction>

## Response

<Expected response from the LLM>

The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.

Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.

Install dependencies

# Install axolotl and dependencies
git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
pip install packaging
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd

# https://github.com/oobabooga/text-generation-webui/issues/4238
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Configure accelerate:

accelerate config default

Fine-tuning

We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.

If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:

# Optional pre-tokenize (run only if big dataset)
python -m axolotl.cli.preprocess axolotl.yaml

Now we are ready to start the fine-tuning process:

# Fine-tune
accelerate launch -m axolotl.cli.train axolotl.yaml

After we have finished the fine-tuning, we merge the Lora base with the model:

# Merge lora
python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

And we convert it to the gguf format that LocalAI can consume:


# Convert to gguf
git clone https://github.com/ggerganov/llama.cpp.git
pushd llama.cpp && make LLAMA_CUBLAS=1 && popd

# We need to convert the pytorch model into ggml for quantization
# It crates 'ggml-model-f16.bin' in the 'merged' directory.
pushd llama.cpp && python convert.py --outtype f16 \
    ../qlora-out/merged/pytorch_model-00001-of-00002.bin && popd

# Start off by making a basic q4_0 4-bit quantization.
# It's important to have 'ggml' in the name of the quant for some
# software to recognize it's file format.
pushd llama.cpp &&  ./quantize ../qlora-out/merged/ggml-model-f16.gguf \
    ../custom-model-q4_0.bin q4_0

Now you should have ended up with a custom-model-q4_0.bin file that you can copy in the LocalAI models directory and use it with LocalAI.

Development documentation

Note

This section is for developers and contributors. If you are looking for the user documentation, this is not the right place!

This section will collect how-to, notes and development documentation

Contributing

We use conventional commits and semantic versioning. Please follow the conventional commits specification when writing commit messages.

Creating a gRPC backend

LocalAI backends are gRPC servers.

In order to create a new backend you need:

  • If there are changes required to the protobuf code, modify the proto file and re-generate the code with make protogen.
  • Modify the Makefile to add your new backend and re-generate the client code with make protogen if necessary.
  • Create a new gRPC server in extra/grpc if it’s not written in go: link, and create the specific implementation.
    • Golang gRPC servers should be added in the pkg/backend directory given their type. See piper as an example.
    • Golang servers needs a respective cmd/grpc binary that must be created too, see also cmd/grpc/piper as an example, update also the Makefile accordingly to build the binary during build time.
  • Update the Dockerfile: if the backend is written in another language, update the Dockerfile default EXTERNAL_GRPC_BACKENDS variable by listing the new binary link.

Once you are done, you can either re-build LocalAI with your backend or you can try it out by running the gRPC server manually and specifying the host and IP to LocalAI with --external-grpc-backends or using (EXTERNAL_GRPC_BACKENDS environment variable, comma separated list of name:host:port tuples, e.g. my-awesome-backend:host:port):

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port" ...

πŸ–ΌοΈ Model gallery




The model gallery is a (experimental!) collection of models configurations for LocalAI.

LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API to configure, download and verify the model assets for you. As the UI is still a work in progress, you will find here the documentation about the API Endpoints.

Note

The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.

Note

GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.

  • Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the gguf format.

Model repositories

You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.

To install a model in runtime you will need to use the /models/apply LocalAI API endpoint.

To enable the model-gallery repository you need to start local-ai with the GALLERIES environment variable:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]

For example, to enable the model-gallery repository, start local-ai with:

GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

where github:go-skynet/model-gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/go-skynet/model-gallery/main/index.yaml.

Note

As this feature is experimental, you need to run local-ai with a list of GALLERIES. Currently there are two galleries:

  • An official one, containing only definitions and models with a clear LICENSE to avoid any dmca infringment. As I’m not sure what’s the best action to do in this case, I’m not going to include any model that is not clearly licensed in this repository which is offically linked to LocalAI.
  • A “community” one that contains an index of huggingface models that are compatible with the ggml format and lives in the localai-huggingface-zoo repository.

To enable the two repositories, start LocalAI with the GALLERIES environment variable:

GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]

If running with docker-compose, simply edit the .env file and uncomment the GALLERIES variable, and add the one you want to use.

Note

You might not find all the models in this gallery. Automated CI updates the gallery automatically. You can find however most of the models on huggingface (https://huggingface.co/), generally it should be available ~24h after upload.

By under any circumstances LocalAI and any developer is not responsible for the models in this gallery, as CI is just indexing them and providing a convenient way to install with an automatic configuration with a consistent API. Don’t install models from authors you don’t trust, and, check the appropriate license for your use case. Models are automatically indexed and hosted on huggingface (https://huggingface.co/). For any issue with the models, please open an issue on the model gallery repository if it’s a LocalAI misconfiguration, otherwise refer to the huggingface repository. If you think a model should not be listed, please reach to us and we will remove it from the gallery.

Note

There is no documentation yet on how to build a gallery or a repository - but you can find an example in the model-gallery repository.

List Models

To list all the available models, use the /models/available endpoint:

curl http://localhost:8080/models/available

To search for a model, you can use jq:

# Get all information about models with a name that contains "replit"
curl http://localhost:8080/models/available | jq '.[] | select(.name | contains("replit"))'

# Get the binary name of all local models (not hosted on Hugging Face)
curl http://localhost:8080/models/available | jq '.[] | .name | select(contains("localmodels"))'

# Get all of the model URLs that contains "orca"
curl http://localhost:8080/models/available | jq '.[] | .urls | select(. != null) | add | select(contains("orca"))'

How to install a model from the repositories

Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.

To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "model-gallery@bert-embeddings"
   }'  

where:

  • model-gallery is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
  • bert-embeddings is the model name in the gallery (read its config here).
Note

If the huggingface model gallery is enabled (it’s enabled by default), and the model has an entry in the model gallery’s associated YAML config (for huggingface, see model-gallery/huggingface.yaml), you can install models by specifying directly the model’s id. For example, to install wizardlm superhot:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "huggingface@TheBloke/WizardLM-13B-V1-0-Uncensored-SuperHOT-8K-GGML/wizardlm-13b-v1.0-superhot-8k.ggmlv3.q4_K_M.bin"
   }'  

Note that the id can be used similarly when pre-loading models at start.

If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.

In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>"
   }' 
# or if from a repository
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "<GALLERY>@<MODEL_NAME>"
   }' 

An example that installs openllama can be:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "https://github.com/go-skynet/model-gallery/blob/main/openllama_3b.yaml"
   }'  

The API will return a job uuid that you can use to track the job progress:

{"uuid":"1059474d-f4f9-11ed-8d99-c4cbe106d571","status":"http://localhost:8080/models/jobs/1059474d-f4f9-11ed-8d99-c4cbe106d571"}

For instance, a small example bash script that waits a job to complete can be (requires jq):

response=$(curl -s http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"url": "$model_url"}')

job_id=$(echo "$response" | jq -r '.uuid')

while [ "$(curl -s http://localhost:8080/models/jobs/"$job_id" | jq -r '.processed')" != "true" ]; do 
  sleep 1
done

echo "Job completed"

To preload models on start instead you can use the PRELOAD_MODELS environment variable.

To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:

PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'

Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.

For example:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]

or as arg:

local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:go-skynet/model-gallery/stablediffusion.yaml
Note

You can find already some open licensed models in the model gallery.

If you don’t find the model in the gallery you can try to use the “base” model and provide an URL to LocalAI:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "github:go-skynet/model-gallery/base.yaml",
     "name": "model-name",
     "files": [
        {
            "uri": "<URL>",
            "sha256": "<SHA>",
            "filename": "model"
        }
     ]
   }'

Installing a model with a different name

To install a model with a different name, specify a name parameter in the request body.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>"
   }'  

For example, to install a model as gpt-3.5-turbo:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
      "url": "github:go-skynet/model-gallery/gpt4all-j.yaml",
      "name": "gpt-3.5-turbo"
   }'  

Additional Files

To download additional files with the model, use the files parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file_url>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        }
     ]
   }'  

Overriding configuration files

To override portions of the configuration file, such as the backend or the model file, use the overrides parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "overrides": {
        "backend": "llama",
        "f16": true,
        ...
     }
   }'  

Examples

Embeddings: Bert

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "github:go-skynet/model-gallery/bert-embeddings.yaml",
     "name": "text-embedding-ada-002"
   }'  

To test it:

LOCALAI=http://localhost:8080
curl $LOCALAI/v1/embeddings -H "Content-Type: application/json" -d '{
    "input": "Test",
    "model": "text-embedding-ada-002"
  }'

Image generation: Stable diffusion

URL: https://github.com/EdVince/Stable-Diffusion-NCNN

While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:go-skynet/model-gallery/stablediffusion.yaml"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]

or as arg:

local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:go-skynet/model-gallery/stablediffusion.yaml

Test it:

curl $LOCALAI/v1/images/generations -H "Content-Type: application/json" -d '{
            "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
            "mode": 2,  "seed":9000,
            "size": "256x256", "n":2
}'

Audio transcription: Whisper

URL: https://github.com/ggerganov/whisper.cpp

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:go-skynet/model-gallery/whisper-base.yaml",
     "name": "whisper-1"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/whisper-base.yaml", "name": "whisper-1"}]

or as arg:

local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/whisper-base.yaml", "name": "whisper-1"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:go-skynet/model-gallery/whisper-base.yaml
  name: whisper-1

GPTs

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "github:go-skynet/model-gallery/gpt4all-j.yaml",
     "name": "gpt4all-j"
   }'  

To test it:

curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt4all-j", 
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.1 
   }'

Note

LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.

Input: url or id (required), name (optional), files (optional)

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_DEFINITION_URL>",
     "id": "<GALLERY>@<MODEL_NAME>",
     "name": "<INSTALLED_MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        },
      "overrides": { "backend": "...", "f16": true }
     ]
   }

An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.

The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml). The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.

Returns an uuid and an url to follow up the state of the process:

{ "uuid":"251475c9-f666-11ed-95e0-9a8a4480ac58", "status":"http://localhost:8080/models/jobs/251475c9-f666-11ed-95e0-9a8a4480ac58"}

To see a collection example of curated models definition files, see the model-gallery.

Get model job state /models/jobs/<uid>

This endpoint returns the state of the batch job associated to a model installation.

curl http://localhost:8080/models/jobs/<JOB_ID>

Returns a json containing the error, and if the job is being processed:

{"error":null,"processed":true,"message":"completed"}

Integrations

The following softwares has out-of-the-box integrations with LocalAI

LocalAI can be used as a drop-in replacement, however, the projects in this folder provides specific integrations with LocalAI:

Feel free to open up a issue to get a page for your project made or if you see a error on one of the pages.!

Subsections of Integrations

AIKit

GitHub Link - https://github.com/sozercan/aikit

AIKit is a quick, easy, and local or cloud-agnostic way to get started to host and deploy large language models (LLMs) for inference. No GPU, internet access or additional tools are needed to get started except for Docker!

AIKit uses LocalAI under-the-hood to run inference. LocalAI provides a drop-in replacement REST API that is OpenAI API compatible, so you can use any OpenAI API compatible client, such as Kubectl AI, Chatbot-UI and many more, to send requests to open-source LLMs powered by AIKit!

At this time, AIKit is tested with LocalAI llama backend. Other backends may work but are not tested. Please open an issue if you’d like to see support for other backends.

Features

  • 🐳 No GPU, Internet access or additional tools needed except for Docker!
  • 🀏 Minimal image size, resulting in less vulnerabilities and smaller attack surface with a custom distroless-based image
  • πŸš€ Easy to use declarative configuration
  • ✨ OpenAI API compatible to use with any OpenAI API compatible client
  • 🚒 Kubernetes deployment ready
  • πŸ“¦ Supports multiple models with a single image
  • πŸ–₯️ Supports GPU-accelerated inferencing with NVIDIA GPUs
  • πŸ” Signed images for aikit and pre-made models

Pre-made Models

AIKit comes with pre-made models that you can use out-of-the-box!

CPU

  • πŸ¦™ Llama 2 7B Chat: ghcr.io/sozercan/llama2:7b
  • πŸ¦™ Llama 2 13B Chat: ghcr.io/sozercan/llama2:13b
  • 🐬 Orca 2 13B: ghcr.io/sozercan/orca2:13b

NVIDIA CUDA

  • πŸ¦™ Llama 2 7B Chat (CUDA): ghcr.io/sozercan/llama2:7b-cuda
  • πŸ¦™ Llama 2 13B Chat (CUDA): ghcr.io/sozercan/llama2:13b-cuda
  • 🐬 Orca 2 13B (CUDA): ghcr.io/sozercan/orca2:13b-cuda

CUDA models includes CUDA v12. They are used with NVIDIA GPU acceleration.

Quick Start

Creating an image

This section shows how to create a custom image with models of your choosing. If you want to use one of the pre-made models, skip to running models.

Please see models folder for pre-made model definitions. You can find more model examples at go-skynet/model-gallery.

Create an aikitfile.yaml with the following structure:

#syntax=ghcr.io/sozercan/aikit:latest
apiVersion: v1alpha1
models:
  - name: llama-2-7b-chat
    source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

This is the simplest way to get started to build an image. For full aikitfile specification, see specs.

First, create a buildx buildkit instance. Alternatively, if you are using Docker v24 with containerd image store enabled, you can skip this step.

docker buildx create --use --name aikit-builder

Then build your image with:

docker buildx build . -t my-model -f aikitfile.yaml --load

This will build a local container image with your model(s). You can see the image with:

docker images
REPOSITORY    TAG       IMAGE ID       CREATED             SIZE
my-model      latest    e7b7c5a4a2cb   About an hour ago   5.51GB

Running models

You can start the inferencing server for your models with:

# for pre-made models, replace "my-model" with the image name
docker run -d --rm -p 8080:8080 my-model

You can then send requests to localhost:8080 to run inference from your models. For example:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-7b-chat",
     "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
   }'
{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Kubernetes Deployment

It is easy to get started to deploy your models to Kubernetes!

Make sure you have a Kubernetes cluster running and kubectl is configured to talk to it, and your model images are accessible from the cluster.

You can use kind to create a local Kubernetes cluster for testing purposes.

# create a deployment
# for pre-made models, replace "my-model" with the image name
kubectl create deployment my-llm-deployment --image=my-model

# expose it as a service
kubectl expose deployment my-llm-deployment --port=8080 --target-port=8080 --name=my-llm-service

# easy to scale up and down as needed
kubectl scale deployment my-llm-deployment --replicas=3

# port-forward for testing locally
kubectl port-forward service/my-llm-service 8080:8080

# send requests to your model
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-7b-chat",
     "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
   }'
{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

For an example Kubernetes deployment and service YAML, see kubernetes folder. Please note that these are examples, you may need to customize them (such as properly configured resource requests and limits) based on your needs.

GPU Acceleration Support

At this time, only NVIDIA GPU acceleration is supported. Please open an issue if you’d like to see support for other GPU vendors.

NVIDIA

AIKit supports GPU accelerated inferencing with NVIDIA Container Toolkit. You must also have NVIDIA Drivers installed on your host machine.

For Kubernetes, NVIDIA GPU Operator provides a streamlined way to install the NVIDIA drivers and container toolkit to configure your cluster to use GPUs.

To get started with GPU-accelerated inferencing, make sure to set the following in your aikitfile and build your model.

runtime: cuda         # use NVIDIA CUDA runtime
f16: true             # use float16 precision
gpu_layers: 35        # number of layers to offload to GPU
low_vram: true        # for devices with low VRAM

Make sure to customize these values based on your model and GPU specs.

After building the model, you can run it with --gpus all flag to enable GPU support:

# for pre-made models, replace "my-model" with the image name
docker run --rm --gpus all -p 8080:8080 my-model

If GPU acceleration is working, you’ll see output that is similar to following in the debug logs:

5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr ggml_init_cublas: found 1 CUDA devices:
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr   Device 0: Tesla T4, compute capability 7.5
...
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: using CUDA for GPU acceleration
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: mem required  =   70.41 MB (+ 2048.00 MB per state)
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading 32 repeating layers to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading non-repeating layers to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading v cache to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading k cache to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloaded 35/35 layers to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: VRAM used: 5869 MB

AnythingLLM

AnythingLLM is an open source ChatGPT equivalent tool for chatting with documents and more in a secure environment by Mintplex Labs Inc.

image image

⭐ Star on Github - https://github.com/Mintplex-Labs/anything-llm

  • Chat with your LocalAI models (or hosted models like OpenAi, Anthropic, and Azure)
  • Embed documents (txt, pdf, json, and more) using your LocalAI Sentence Transformers
  • Select any vector database you want (Chroma, Pinecone, qDrant, Weaviate ) or use the embedded on-instance vector database (LanceDB)
  • Supports single or multi-user tenancy with built-in permissions
  • Full developer API
  • Locally running SQLite db for minimal setup.

AnythingLLM is a fully transparent tool to deliver a customized, white-label ChatGPT equivalent experience using only the models and services you or your organization are comfortable using.

Why AnythingLLM?

AnythingLLM aims to enable you to quickly and comfortably get a ChatGPT equivalent experience using your proprietary documents for your organization with zero compromise on security or comfort.

What does AnythingLLM include?

  • Full UI
  • Full admin console and panel for managing users, chats, model selection, vector db connection, and embedder selection
  • Multi-user support and logins
  • Supports both desktop and mobile view ports
  • Built in vector database where no data leaves your instance at all
  • Docker support

Install

Local via docker

Running via docker and integrating with your LocalAI instance is a breeze.

First, pull in the latest AnythingLLM Docker image docker pull mintplexlabs/anythingllm:master

Next, run the image on a container exposing port 3001. docker run -d -p 3001:3001 mintplexlabs/anythingllm:master

Now open http://localhost:3001 and you will start on-boarding for setting up your AnythingLLM instance to your comfort level

Integration with your LocalAI instance.

There are two areas where you can leverage your models loaded into LocalAI - LLM and Embedding. Any LLM models should be ready to run a chat completion.

LLM model selection

During onboarding and from the sidebar setting you can select LocalAI as your LLM. Here you can set both the model and token limit of the specific model. The dropdown will automatically populate once your url is set.

The URL should look like http://localhost:8000/v1 or wherever your LocalAI instance is being served from. Non-localhost URLs are permitted if hosting LocalAI on cloud services.

localai-setup localai-setup

LLM embedding model selection

During onboarding and from the sidebar setting you can select LocalAI as your preferred embedding engine. This model will be the model used when you upload any kind of document via AnythingLLM. Here you can set the model from available models via the LocalAI API. The dropdown will automatically populate once your url is set.

The URL should look like http://localhost:8000/v1 or wherever your LocalAI instance is being served from. Non-localhost URLs are permitted if hosting LocalAI on cloud services.

localai-setup localai-setup

AutoGPT4all

AutoGPT4All provides you with both bash and python scripts to set up and configure AutoGPT running with the GPT4All model on the LocalAI server. This setup allows you to run queries against an open-source licensed model without any limits, completely free and offline.

photo photo

Github Link - https://github.com/aorumbayev/autogpt4all

πŸš€ Quickstart

Using Bash Script:

git clone https://github.com/aorumbayev/autogpt4all.git
cd autogpt4all
chmod +x autogpt4all.sh
./autogpt4all.sh

Using Python Script:

Make sure you have Python installed on your machine.

git clone https://github.com/aorumbayev/autogpt4all.git
cd autogpt4all
python autogpt4all.py

❗️ Please note this script has been primarily tested on MacOS with an M1 processor. It should work on Linux and Windows, but it has not been thoroughly tested on these platforms. If not on MacOS install git, go and make before running the script.

πŸŽ›οΈ Script Options

For the bash script:

--custom_model_url - Specify a custom URL for the model download step. By default, the script will use https://gpt4all.io/models/ggml-gpt4all-l13b-snoozy.bin.

Example:

./autogpt4all.sh --custom_model_url "https://example.com/path/to/model.bin"

--uninstall - Uninstall the projects from your local machine by deleting the LocalAI and Auto-GPT directories.

Example:

./autogpt4all.sh --uninstall

To recap the commands, a –help flag is also available for the bash script.

For the Python Script:

You can use similar options as the bash script:

--custom_model_url - Specify a custom URL for the model download step.

Example:

python autogpt4all.py --custom_model_url "https://example.com/path/to/model.bin"

--uninstall - Uninstall the projects from your local machine.

Example:

python autogpt4all.py --uninstall

BionicGPT

an on-premise replacement for ChatGPT, offering the advantages of Generative AI while maintaining strict data confidentiality, BionicGPT can run on your laptop or scale into the data center.

BionicGPT Homepage - https://bionic-gpt.com Github link - https://github.com/purton-tech/bionicgpt

Try it out

Cut and paste the following into a docker-compose.yaml file and run docker-compose up -d access the user interface on http://localhost:7800/auth/sign_up This has been tested on an AMD 2700x with 16GB of ram. The included ggml-gpt4all-j model runs on CPU only. Warning - The images in this docker-compose are large due to having the model weights pre-loaded for convenience.

services:

  # LocalAI with pre-loaded ggml-gpt4all-j
  local-ai:
    image: ghcr.io/purton-tech/bionicgpt-model-api:llama-2-7b-chat

  # Handles parsing of multiple documents types.
  unstructured:
    image: downloads.unstructured.io/unstructured-io/unstructured-api:db264d8
    ports:
      - "8000:8000"

  # Handles routing between the application, barricade and the LLM API
  envoy:
    image: ghcr.io/purton-tech/bionicgpt-envoy:1.1.10
    ports:
      - "7800:7700"

  # Postgres pre-loaded with pgVector
  db:
    image: ankane/pgvector
    environment:
      POSTGRES_PASSWORD: testpassword
      POSTGRES_USER: postgres
      POSTGRES_DB: finetuna
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Sets up our database tables
  migrations:
    image: ghcr.io/purton-tech/bionicgpt-db-migrations:1.1.10
    environment:
      DATABASE_URL: postgresql://postgres:testpassword@db:5432/postgres?sslmode=disable
    depends_on:
      db:
        condition: service_healthy

  # Barricade handles all /auth routes for user sign up and sign in.
  barricade:
    image: purtontech/barricade
    environment:
        # This secret key is used to encrypt cookies.
        SECRET_KEY: 190a5bf4b3cbb6c0991967ab1c48ab30790af876720f1835cbbf3820f4f5d949
        DATABASE_URL: postgresql://postgres:testpassword@db:5432/postgres?sslmode=disable
        FORWARD_URL: app
        FORWARD_PORT: 7703
        REDIRECT_URL: /app/post_registration
    depends_on:
      db:
        condition: service_healthy
      migrations:
        condition: service_completed_successfully
  
  # Our axum server delivering our user interface
  embeddings-job:
    image: ghcr.io/purton-tech/bionicgpt-embeddings-job:1.1.10
    environment:
      APP_DATABASE_URL: postgresql://ft_application:testpassword@db:5432/postgres?sslmode=disable
    depends_on:
      db:
        condition: service_healthy
      migrations:
        condition: service_completed_successfully
  
  # Our axum server delivering our user interface
  app:
    image: ghcr.io/purton-tech/bionicgpt:1.1.10
    environment:
      APP_DATABASE_URL: postgresql://ft_application:testpassword@db:5432/postgres?sslmode=disable
    depends_on:
      db:
        condition: service_healthy
      migrations:
        condition: service_completed_successfully

Kubernetes Ready

BionicGPT is optimized to run on Kubernetes and implements the full pipeline of LLM fine tuning from data acquisition to user interface.

BMO Chatbo

Generate and brainstorm ideas while creating your notes using Large Language Models (LLMs) such as OpenAI’s “gpt-3.5-turbo” and “gpt-4” for Obsidian.

Github Link - https://github.com/longy2k/obsidian-bmo-chatbot

Features

  • Chat from anywhere in Obsidian: Chat with your bot from anywhere within Obsidian.
  • Chat with current note: Use your chatbot to reference and engage within your current note.
  • Chatbot responds in Markdown: Receive formatted responses in Markdown for consistency.
  • Customizable bot name: Personalize the chatbot’s name.
  • System role prompt: Configure the chatbot to prompt for user roles before responding to messages.
  • Set Max Tokens and Temperature: Customize the length and randomness of the chatbot’s responses with Max Tokens and Temperature settings.
  • System theme color accents: Seamlessly matches the chatbot’s interface with your system’s color scheme.
  • Interact with self-hosted Large Language Models (LLMs): Use the REST API URL provided to interact with self-hosted Large Language Models (LLMs) using LocalAI.

Requirements

To use this plugin, with LocalAI, you will need to have the self-hosted API set up and running. You can follow the instructions provided by the self-hosted API provider to get it up and running. Once you have the REST API URL for your self-hosted API, you can use it with this plugin to interact with your models. Explore some GGUF models at theBloke.

How to activate the plugin

Two methods:

Obsidian Community plugins (Recommended):

  1. Search for “BMO Chatbot” in the Obsidian Community plugins.
  2. Enable “BMO Chatbot” in the settings.

To activate the plugin from this repo:

  1. Navigate to the plugin’s folder in your terminal.
  2. Run npm install to install any necessary dependencies for the plugin.
  3. Once the dependencies have been installed, run npm run build to build the plugin.
  4. Once the plugin has been built, it should be ready to activate.

Getting Started

To start using the plugin, enable it in your settings menu and enter your OpenAPI key. After completing these steps, you can access the bot panel by clicking on the bot icon in the left sidebar. If you want to clear the chat history, simply click on the bot icon again in the left ribbon bar.

Supported Models

  • OpenAI
    • gpt-3.5-turbo
    • gpt-3.5-turbo-16k
    • gpt-4
  • Anthropic
    • claude-instant-1.2
    • claude-2.0
  • Any self-hosted models using LocalAI

Other Notes

“BMO” is a tag name for this project, inspired by the character BMO from the animated TV show “Adventure Time.”

Flowise

Build LLM Apps Easily

Flowise Flowise

Github Link - https://github.com/FlowiseAI/Flowise

⚑Local Install

Download and Install NodeJS >= 18.15.0

  1. Install Flowise

    npm install -g flowise
    
  2. Start Flowise

    npx flowise start
    
  3. Open http://localhost:3000

🐳 Docker

Docker Compose

  1. Go to docker folder at the root of the project
  2. Copy .env.example file, paste it into the same location, and rename to .env
  3. docker-compose up -d
  4. Open http://localhost:3000
  5. You can bring the containers down by docker-compose stop --rmi all

Docker Compose (Flowise + LocalAI)

  1. In a command line Run git clone https://github.com/go-skynet/LocalAI
  2. Then run cd LocalAI/examples/flowise
  3. Then run docker-compose up -d --pull always
  4. Open http://localhost:3000
  5. You can bring the containers down by docker-compose stop --rmi all

🌱 Env Variables

Flowise support different environment variables to configure your instance. You can specify the following variables in the .env file inside packages/server folder. Read more

πŸ“– Documentation

Flowise Docs

k8sgpt

a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English.

It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.

Github Link - https://github.com/k8sgpt-ai/k8sgpt

CLI Installation

Linux/Mac via brew

brew tap k8sgpt-ai/k8sgpt
brew install k8sgpt
RPM-based installation (RedHat/CentOS/Fedora)

32 bit:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.18/k8sgpt_386.rpm
sudo rpm -ivh k8sgpt_386.rpm

64 bit:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.18/k8sgpt_amd64.rpm
sudo rpm -ivh -i k8sgpt_amd64.rpm
DEB-based installation (Ubuntu/Debian)

32 bit:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.18/k8sgpt_386.deb
sudo dpkg -i k8sgpt_386.deb

64 bit:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.18/k8sgpt_amd64.deb
sudo dpkg -i k8sgpt_amd64.deb
APK-based installation (Alpine)

32 bit:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.18/k8sgpt_386.apk
apk add k8sgpt_386.apk

64 bit:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.18/k8sgpt_amd64.apk
apk add k8sgpt_amd64.apk
x
Failing Installation on WSL or Linux (missing gcc) When installing Homebrew on WSL or Linux, you may encounter the following error:
==> Installing k8sgpt from k8sgpt-ai/k8sgpt Error: The following formula cannot be installed from a bottle and must be
built from the source. k8sgpt Install Clang or run brew install gcc.

If you install gcc as suggested, the problem will persist. Therefore, you need to install the build-essential package.

   sudo apt-get update
   sudo apt-get install build-essential

Windows

  • Download the latest Windows binaries of k8sgpt from the Release tab based on your system architecture.
  • Extract the downloaded package to your desired location. Configure the system path variable with the binary location

Operator Installation

To install within a Kubernetes cluster please use our k8sgpt-operator with installation instructions available here

This mode of operation is ideal for continuous monitoring of your cluster and can integrate with your existing monitoring such as Prometheus and Alertmanager.

Quick Start

  • Currently the default AI provider is OpenAI, you will need to generate an API key from OpenAI
    • You can do this by running k8sgpt generate to open a browser link to generate it
  • Run k8sgpt auth add to set it in k8sgpt.
    • You can provide the password directly using the --password flag.
  • Run k8sgpt filters to manage the active filters used by the analyzer. By default, all filters are executed during analysis.
  • Run k8sgpt analyze to run a scan.
  • And use k8sgpt analyze --explain to get a more detailed explanation of the issues.
  • You also run k8sgpt analyze --with-doc (with or without the explain flag) to get the official documentation from kubernetes.

Analyzers

K8sGPT uses analyzers to triage and diagnose issues in your cluster. It has a set of analyzers that are built in, but you will be able to write your own analyzers.

Built in analyzers

Enabled by default

  • podAnalyzer
  • pvcAnalyzer
  • rsAnalyzer
  • serviceAnalyzer
  • eventAnalyzer
  • ingressAnalyzer
  • statefulSetAnalyzer
  • deploymentAnalyzer
  • cronJobAnalyzer
  • nodeAnalyzer
  • mutatingWebhookAnalyzer
  • validatingWebhookAnalyzer

Optional

  • hpaAnalyzer
  • pdbAnalyzer
  • networkPolicyAnalyzer

Examples

Run a scan with the default analyzers

k8sgpt generate
k8sgpt auth add
k8sgpt analyze --explain
k8sgpt analyze --explain --with-doc

Filter on resource

k8sgpt analyze --explain --filter=Service

Filter by namespace

k8sgpt analyze --explain --filter=Pod --namespace=default

Output to JSON

k8sgpt analyze --explain --filter=Service --output=json

Anonymize during explain

k8sgpt analyze --explain --filter=Service --output=json --anonymize
Using filters

List filters

k8sgpt filters list

Add default filters

k8sgpt filters add [filter(s)]

Examples :

  • Simple filter : k8sgpt filters add Service
  • Multiple filters : k8sgpt filters add Ingress,Pod

Remove default filters

k8sgpt filters remove [filter(s)]

Examples :

  • Simple filter : k8sgpt filters remove Service
  • Multiple filters : k8sgpt filters remove Ingress,Pod
Additional commands

List configured backends

k8sgpt auth list

Update configured backends

k8sgpt auth update $MY_BACKEND1,$MY_BACKEND2..

Remove configured backends

k8sgpt auth remove $MY_BACKEND1,$MY_BACKEND2..

List integrations

k8sgpt integrations list

Activate integrations

k8sgpt integrations activate [integration(s)]

Use integration

k8sgpt analyze --filter=[integration(s)]

Deactivate integrations

k8sgpt integrations deactivate [integration(s)]

Serve mode

k8sgpt serve

Analysis with serve mode

curl -X GET "http://localhost:8080/analyze?namespace=k8sgpt&explain=false"

Key Features

LocalAI provider

To run local models, it is possible to use OpenAI compatible APIs, for instance LocalAI which uses llama.cpp to run inference on consumer-grade hardware. Models supported by LocalAI for instance are Vicuna, Alpaca, LLaMA, Cerebras, GPT4ALL, GPT4ALL-J, Llama2 and koala.

To run local inference, you need to download the models first, for instance you can find gguf compatible models in huggingface.com (for example vicuna, alpaca and koala).

Start the API server

To start the API server, follow the instruction in LocalAI.

Run k8sgpt

To run k8sgpt, run k8sgpt auth add with the localai backend:

k8sgpt auth add --backend localai --model <model_name> --baseurl http://localhost:8080/v1 --temperature 0.7

Now you can analyze with the localai backend:

k8sgpt analyze --explain --backend localai
Setting a new default AI provider

There may be scenarios where you wish to have K8sGPT plugged into several default AI providers. In this case you may wish to use one as a new default, other than OpenAI which is the project default.

To view available providers

k8sgpt auth list
Default:
> openai
Active:
> openai
> azureopenai
Unused:
> localai
> noopai

To set a new default provider

k8sgpt auth default -p azureopenai
Default provider set to azureopenai

With this option, the data is anonymized before being sent to the AI Backend. During the analysis execution, k8sgpt retrieves sensitive data (Kubernetes object names, labels, etc.). This data is masked when sent to the AI backend and replaced by a key that can be used to de-anonymize the data when the solution is returned to the user.

Anonymization
  1. Error reported during analysis:
Error: HorizontalPodAutoscaler uses StatefulSet/fake-deployment as ScaleTargetRef which does not exist.
  1. Payload sent to the AI backend:
Error: HorizontalPodAutoscaler uses StatefulSet/tGLcCRcHa1Ce5Rs as ScaleTargetRef which does not exist.
  1. Payload returned by the AI:
The Kubernetes system is trying to scale a StatefulSet named tGLcCRcHa1Ce5Rs using the HorizontalPodAutoscaler, but it cannot find the StatefulSet. The solution is to verify that the StatefulSet name is spelled correctly and exists in the same namespace as the HorizontalPodAutoscaler.
  1. Payload returned to the user:
The Kubernetes system is trying to scale a StatefulSet named fake-deployment using the HorizontalPodAutoscaler, but it cannot find the StatefulSet. The solution is to verify that the StatefulSet name is spelled correctly and exists in the same namespace as the HorizontalPodAutoscaler.

Note: Anonymization does not currently apply to events.

Further Details

Anonymization does not currently apply to events.

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

  • The following is the list of analysers in which data is being masked:-

    • Statefulset
    • Service
    • PodDisruptionBudget
    • Node
    • NetworkPolicy
    • Ingress
    • HPA
    • Deployment
    • Cronjob
  • The following is the list of analysers in which data is not being masked:-

    • RepicaSet
    • PersistentVolumeClaim
    • Pod
    • *Events

*Note:

  • k8gpt will not mask the above analysers because they do not send any identifying information except Events analyser.

  • Masking for Events analyzer is scheduled in the near future as seen in this issue. Further research has to be made to understand the patterns and be able to mask the sensitive parts of an event like pod name, namespace etc.

  • The following is the list of fields which are not being masked:-

    • Describe
    • ObjectStatus
    • Replicas
    • ContainerStatus
    • *Event Message
    • ReplicaStatus
    • Count (Pod)

*Note:

  • It is quite possible the payload of the event message might have something like “super-secret-project-pod-X crashed” which we don’t currently redact (scheduled in the near future as seen in this issue).

Proceed with care

  • The K8gpt team recommends using an entirely different backend (a local model) in critical production environments. By using a local model, you can rest assured that everything stays within your DMZ, and nothing is leaked.
  • If there is any uncertainty about the possibility of sending data to a public LLM (open AI, Azure AI) and it poses a risk to business-critical operations, then, in such cases, the use of public LLM should be avoided based on personal assessment and the jurisdiction of risks involved.
Configuration management

k8sgpt stores config data in the $XDG_CONFIG_HOME/k8sgpt/k8sgpt.yaml file. The data is stored in plain text, including your OpenAI key.

Config file locations:

OS Path
MacOS ~/Library/Application Support/k8sgpt/k8sgpt.yaml
Linux ~/.config/k8sgpt/k8sgpt.yaml
Windows %LOCALAPPDATA%/k8sgpt/k8sgpt.yaml
There may be scenarios where caching remotely is preferred. In these scenarios K8sGPT supports AWS S3 Integration. Remote caching

As a prerequisite AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are required as environmental variables.

Adding a remote cache

Note: this will create the bucket if it does not exist

k8sgpt cache add --region <aws region> --bucket <name>

Listing cache items

k8sgpt cache list

Removing the remote cache Note: this will not delete the bucket

k8sgpt cache remove --bucket <name>

Documentation

Find our official documentation available here

Kairos

Kairos Logo Kairos Logo

Kairos - Kubernetes-focused, Cloud Native Linux meta-distribution

The immutable Linux meta-distribution for edge Kubernetes.

Github Link - https://github.com/kairos-io/kairos

Intro

With Kairos you can build immutable, bootable Kubernetes and OS images for your edge devices as easily as writing a Dockerfile. Optional P2P mesh with distributed ledger automates node bootstrapping and coordination. Updating nodes is as easy as CI/CD: push a new image to your container registry and let secure, risk-free A/B atomic upgrades do the rest. Kairos is part of the Secure Edge-Native Architecture (SENA) to securely run workloads at the Edge (whitepaper).

Kairos (formerly c3os) is an open-source project which brings Edge, cloud, and bare metal lifecycle OS management into the same design principles with a unified Cloud Native API.

At-a-glance:

  • :bowtie: Community Driven
  • :octocat: Open Source
  • :lock: Linux immutable, meta-distribution
  • :key: Secure
  • :whale: Container-based
  • :penguin: Distribution agnostic

Kairos can be used to:

  • Easily spin-up a Kubernetes cluster, with the Linux distribution of your choice :penguin:
  • Create your Immutable infrastructure, no more infrastructure drift! :lock:
  • Manage the cluster lifecycle with Kubernetesβ€”from building to provisioning, and upgrading :rocket:
  • Create a multipleβ€”node, a single cluster that spans up across regions :earth_africa:

For comprehensive docs, tutorials, and examples see our documentation.

LinGoose

LinGoose (Lingo + Go + Goose πŸͺΏ) aims to be a complete Go framework for creating LLM apps. πŸ€– βš™οΈ

lin lin

Github Link - https://github.com/henomis/lingoose

Overview

LinGoose is a powerful Go framework for developing Large Language Model (LLM) based applications using pipelines. It is designed to be a complete solution and provides multiple components, including Prompts, Templates, Chat, Output Decoders, LLM, Pipelines, and Memory. With LinGoose, you can interact with LLM AI through prompts and generate complex templates. Additionally, it includes a chat feature, allowing you to create chatbots. The Output Decoders component enables you to extract specific information from the output of the LLM, while the LLM interface allows you to send prompts to various AI, such as the ones provided by OpenAI. You can chain multiple LLM steps together using Pipelines and store the output of each step in Memory for later retrieval. LinGoose also includes a Document component, which is used to store text, and a Loader component, which is used to load Documents from various sources. Finally, it includes TextSplitters, which are used to split text or Documents into multiple parts, Embedders, which are used to embed text or Documents into embeddings, and Indexes, which are used to store embeddings and documents and to perform searches.

Components

LinGoose is composed of multiple components, each one with its own purpose.

Component Package Description
Prompt prompt Prompts are the way to interact with LLM AI. They can be simple text, or more complex templates. Supports Prompt Templates and Whisper prompt
Chat Prompt chat Chat is the way to interact with the chat LLM AI. It can be a simple text prompt, or a more complex chatbot.
Decoders decoder Output decoders are used to decode the output of the LLM. They can be used to extract specific information from the output. Supports JSONDecoder and RegExDecoder
LLMs llm LLM is an interface to various AI such as the ones provided by OpenAI. It is responsible for sending the prompt to the AI and retrieving the output. Supports LocalAI, HuggingFace and Llama.cpp.
Pipelines pipeline Pipelines are used to chain multiple LLM steps together.
Memory memory Memory is used to store the output of each step. It can be used to retrieve the output of a previous step. Supports memory in Ram
Document document Document is used to store a text
Loaders loader Loaders are used to load Documents from various sources. Supports TextLoader, DirectoryLoader, PDFToTextLoader and PubMedLoader .
TextSplitters textsplitter TextSplitters are used to split text or Documents into multiple parts. Supports RecursiveTextSplitter.
Embedders embedder Embedders are used to embed text or Documents into embeddings. Supports OpenAI
Indexes index Indexes are used to store embeddings and documents and to perform searches. Supports SimpleVectorIndex, Pinecone and Qdrant

Usage

Please refer to the documentation at lingoose.io to understand how to use LinGoose. If you prefer the πŸ‘‰ examples directory contains a lot of examples πŸš€. However, here is a powerful example of what LinGoose is capable of:

Talk is cheap. Show me the code. - Linus Torvalds

package main

import (
	"context"

	openaiembedder "github.com/henomis/lingoose/embedder/openai"
	"github.com/henomis/lingoose/index/option"
	simplevectorindex "github.com/henomis/lingoose/index/simpleVectorIndex"
	"github.com/henomis/lingoose/llm/openai"
	"github.com/henomis/lingoose/loader"
	qapipeline "github.com/henomis/lingoose/pipeline/qa"
	"github.com/henomis/lingoose/textsplitter"
)

func main() {
	docs, _ := loader.NewPDFToTextLoader("./kb").WithPDFToTextPath("/opt/homebrew/bin/pdftotext").WithTextSplitter(textsplitter.NewRecursiveCharacterTextSplitter(2000, 200)).Load(context.Background())
	index := simplevectorindex.New("db", ".", openaiembedder.New(openaiembedder.AdaEmbeddingV2))
	index.LoadFromDocuments(context.Background(), docs)
	qapipeline.New(openai.NewChat().WithVerbose(true)).WithIndex(index).Query(context.Background(), "What is the NATO purpose?", option.WithTopK(1))
}

This is the famous 4-lines lingoose knowledge base chatbot. πŸ€–

Installation

Be sure to have a working Go environment, then run the following command:

go get github.com/henomis/lingoose

LLMStack

LLMStack LLMStack

LLMStack - LLMStack is a no-code platform for building generative AI applications, chatbots, agents and connecting them to your data and business processes.

Github Link - https://github.com/trypromptly/LLMStack

Overview

Build tailor-made generative AI applications, chatbots and agents that cater to your unique needs by chaining multiple LLMs. Seamlessly integrate your own data and GPT-powered models without any coding experience using LLMStack’s no-code builder. Trigger your AI chains from Slack or Discord. Deploy to the cloud or on-premise.

llmstack-quickstart llmstack-quickstart

Getting Started

LLMStack deployment comes with a default admin account whose credentials are admin and promptly. Be sure to change the password from admin panel after logging in.

Features

πŸ”— Chain multiple models: LLMStack allows you to chain multiple LLMs together to build complex generative AI applications.

πŸ“Š Use generative AI on your Data: Import your data into your accounts and use it in AI chains. LLMStack allows importing various types (CSV, TXT, PDF, DOCX, PPTX etc.,) of data from a variety of sources (gdrive, notion, websites, direct uploads etc.,). Platform will take care of preprocessing and vectorization of your data and store it in the vector database that is provided out of the box.

πŸ› οΈ No-code builder: LLMStack comes with a no-code builder that allows you to build AI chains without any coding experience. You can chain multiple LLMs together and connect them to your data and business processes.

☁️ Deploy to the cloud or on-premise: LLMStack can be deployed to the cloud or on-premise. You can deploy it to your own infrastructure or use our cloud offering at Promptly.

πŸš€ API access: Apps or chatbots built with LLMStack can be accessed via HTTP API. You can also trigger your AI chains from Slack or Discord.

🏒 Multi-tenant: LLMStack is multi-tenant. You can create multiple organizations and add users to them. Users can only access the data and AI chains that belong to their organization.

What can you build with LLMStack?

Using LLMStack you can build a variety of generative AI applications, chatbots and agents. Here are some examples:

πŸ“ Text generation: You can build apps that generate product descriptions, blog posts, news articles, tweets, emails, chat messages, etc., by using text generation models and optionally connecting your data. Check out this marketing content generator for example

πŸ€– Chatbots: You can build chatbots trained on your data powered by ChatGPT like Promptly Help that is embedded on Promptly website

🎨 Multimedia generation: Build complex applications that can generate text, images, videos, audio, etc. from a prompt. This story generator is an example

πŸ—£οΈ Conversational AI: Build conversational AI systems that can have a conversation with a user. Check out this Harry Potter character chatbot

πŸ” Search augmentation: Build search augmentation systems that can augment search results with additional information using APIs. Sharebird uses LLMStack to augment search results with AI generated answer from their content similar to Bing’s chatbot

πŸ’¬ Discord and Slack bots: Apps built on LLMStack can be triggered from Slack or Discord. You can easily connect your AI chains to Slack or Discord from LLMStack’s no-code app editor. Check out our Discord server to interact with one such bot.

Administration

Login to http://localhost:3000/admin using the admin account. You can add users and assign them to organizations in the admin panel.

Documentation

Check out our documentation at llmstack.ai/docs to learn more about LLMStack.

LocalAGI

LocalAGI is a small πŸ€– virtual assistant that you can run locally, made by the LocalAI author and powered by it.

localagi localagi

AutoGPT, babyAGI, … and now LocalAGI!

Github Link - https://github.com/mudler/LocalAGI

Info

The goal is:

  • Keep it simple, hackable and easy to understand
  • No API keys needed, No cloud services needed, 100% Local. Tailored for Local use, however still compatible with OpenAI.
  • Smart-agent/virtual assistant that can do tasks
  • Small set of dependencies
  • Run with Docker/Podman/Containers
  • Rather than trying to do everything, provide a good starting point for other projects

Note: Be warned! It was hacked in a weekend, and it’s just an experiment to see what can be done with local LLMs.

Screenshot from 2023-08-05 22-40-40 Screenshot from 2023-08-05 22-40-40

πŸš€ Features

  • 🧠 LLM for intent detection
  • 🧠 Uses functions for actions
    • πŸ“ Write to long-term memory
    • πŸ“– Read from long-term memory
    • 🌐 Internet access for search
    • :card_file_box: Write files
    • πŸ”Œ Plan steps to achieve a goal
  • πŸ€– Avatar creation with Stable Diffusion
  • πŸ—¨οΈ Conversational
  • πŸ—£οΈ Voice synthesis with TTS

:book: Quick start

No frills, just run docker-compose and start chatting with your virtual assistant:

# Modify the configuration
# nano .env
docker-compose run -i --rm localagi

How to use it

By default localagi starts in interactive mode

Examples

Road trip planner by limiting searching to internet to 3 results only:

docker-compose run -i --rm localagi \
  --skip-avatar \
  --subtask-context \
  --postprocess \
  --search-results 3 \
  --prompt "prepare a plan for my roadtrip to san francisco"

Limit results of planning to 3 steps:

docker-compose run -i --rm localagi \
  --skip-avatar \
  --subtask-context \
  --postprocess \
  --search-results 1 \
  --prompt "do a plan for my roadtrip to san francisco" \
  --plan-message "The assistant replies with a plan of 3 steps to answer the request with a list of subtasks with logical steps. The reasoning includes a self-contained, detailed and descriptive instruction to fullfill the task."

Advanced

localagi has several options in the CLI to tweak the experience:

  • --system-prompt is the system prompt to use. If not specified, it will use none.
  • --prompt is the prompt to use for batch mode. If not specified, it will default to interactive mode.
  • --interactive is the interactive mode. When used with --prompt will drop you in an interactive session after the first prompt is evaluated.
  • --skip-avatar will skip avatar creation. Useful if you want to run it in a headless environment.
  • --re-evaluate will re-evaluate if another action is needed or we have completed the user request.
  • --postprocess will postprocess the reasoning for analysis.
  • --subtask-context will include context in subtasks.
  • --search-results is the number of search results to use.
  • --plan-message is the message to use during planning. You can override the message for example to force a plan to have a different message.
  • --tts-api-base is the TTS API base. Defaults to http://api:8080.
  • --localai-api-base is the LocalAI API base. Defaults to http://api:8080.
  • --images-api-base is the Images API base. Defaults to http://api:8080.
  • --embeddings-api-base is the Embeddings API base. Defaults to http://api:8080.
  • --functions-model is the functions model to use. Defaults to functions.
  • --embeddings-model is the embeddings model to use. Defaults to all-MiniLM-L6-v2.
  • --llm-model is the LLM model to use. Defaults to gpt-4.
  • --tts-model is the TTS model to use. Defaults to en-us-kathleen-low.onnx.
  • --stablediffusion-model is the Stable Diffusion model to use. Defaults to stablediffusion.
  • --stablediffusion-prompt is the Stable Diffusion prompt to use. Defaults to DEFAULT_PROMPT.
  • --force-action will force a specific action.
  • --debug will enable debug mode.

Customize

To use a different model, you can see the examples in the config folder. To select a model, modify the .env file and change the PRELOAD_MODELS_CONFIG variable to use a different configuration file.

Caveats

The “goodness” of a model has a big impact on how LocalAGI works. Currently 13b models are powerful enough to actually able to perform multi-step tasks or do more actions. However, it is quite slow when running on CPU (no big surprise here).

The context size is a limitation - you can find in the config examples to run with superhot 8k context size, but the quality is not good enough to perform complex tasks.

What is LocalAGI?

It is a dead simple experiment to show how to tie the various LocalAI functionalities to create a virtual assistant that can do tasks. It is simple on purpose, trying to be minimalistic and easy to understand and customize for everyone.

It is different from babyAGI or AutoGPT as it uses LocalAI functions - it is a from scratch attempt built on purpose to run locally with LocalAI (no API keys needed!) instead of expensive, cloud services. It sets apart from other projects as it strives to be small, and easy to fork on.

How it works?

LocalAGI just does the minimal around LocalAI functions to create a virtual assistant that can do generic tasks. It works by an endless loop of intent detection, function invocation, self-evaluation and reply generation (if it decides to reply! :)). The agent is capable of planning complex tasks by invoking multiple functions, and remember things from the conversation.

In a nutshell, it goes like this:

  • Decide based on the conversation history if it needs to take an action by using functions. It uses the LLM to detect the intent from the conversation.
  • if it need to take an action (e.g. “remember something from the conversation” ) or generate complex tasks ( executing a chain of functions to achieve a goal ) it invokes the functions
  • it re-evaluates if it needs to do any other action
  • return the result back to the LLM to generate a reply for the user

Under the hood LocalAI converts functions to llama.cpp BNF grammars. While OpenAI fine-tuned a model to reply to functions, LocalAI constrains the LLM to follow grammars. This is a much more efficient way to do it, and it is also more flexible as you can define your own functions and grammars. For learning more about this, check out the LocalAI documentation and my tweet that explains how it works under the hoods: https://twitter.com/mudler_it/status/1675524071457533953.

Agent functions

The intention of this project is to keep the agent minimal, so can be built on top of it or forked. The agent is capable of doing the following functions:

  • remember something from the conversation
  • recall something from the conversation
  • search something from the internet
  • plan a complex task by invoking multiple functions
  • write files to disk

Roadmap

  • 100% Local, with Local AI. NO API KEYS NEEDED!
  • Create a simple virtual assistant
  • Make the virtual assistant do functions like store long-term memory and autonomously search between them when needed
  • Create the assistant avatar with Stable Diffusion
  • Give it a voice
  • Use weaviate instead of Chroma
  • Get voice input (push to talk or wakeword)
  • Make a REST API (OpenAI compliant?) so can be plugged by e.g. a third party service
  • Take a system prompt so can act with a “character” (e.g. “answer in rick and morty style”)

Development

Run docker-compose with main.py checked-out:

docker-compose run -v main.py:/app/main.py -i --rm localagi

Notes

  • a 13b model is enough for doing contextualized research and search/retrieve memory
  • a 30b model is enough to generate a roadmap trip plan ( so cool! )
  • With superhot models looses its magic, but maybe suitable for search
  • Context size is your enemy. --postprocess some times helps, but not always
  • It can be silly!
  • It is slow on CPU, don’t expect 7b models to perform good, and 13b models perform better but on CPU are quite slow.

Mattermost-OpenOps

OpenOps is an open source platform for applying generative AI to workflows in secure environments.

image image

Github Link - https://github.com/mattermost/openops

  • Enables AI exploration with full data control in a multi-user pilot.
  • Supports broad ecosystem of AI models from OpenAI and Microsoft to open source LLMs from Hugging Face.
  • Speeds development of custom security, compliance and data custody policy from early evaluation to future scale.

Unliked closed source, vendor-controlled environments where data controls cannot be audited, OpenOps provides a transparent, open source, customer-controlled platform for developing, securing and auditing AI-accelerated workflows.

Why Open Ops?

Everyone is in a race to deploy generative AI solutions, but need to do so in a responsible and safe way. OpenOps lets you run powerful models in a safe sandbox to establish the right safety protocols before rolling out to users. Here’s an example of an evaluation, implementation, and iterative rollout process:

  • Phase 1: Set up the OpenOps collaboration sandbox, a self-hosted service providing multi-user chat and integration with GenAI. (this repository)

  • Phase 2: Evaluate different GenAI providers, whether from public SaaS services like OpenAI or local open source models, based on your security and privacy requirements.

  • Phase 3: Invite select early adopters (especially colleagues focusing on trust and safety) to explore and evaluate the GenAI based on their workflows. Observe behavior, and record user feedback, and identify issues. Iterate on workflows and usage policies together in the sandbox. Consider issues such as data leakage, legal/copyright, privacy, response correctness and appropriateness as you apply AI at scale.

  • Phase 4: Set and implement policies as availability is incrementally rolled out to your wider organization.

What does OpenOps include?

Deploying the OpenOps sandbox includes the following components:

  • 🏰 Mattermost Server - Open source, self-hosted alternative to Discord and Slack for strict security environments with playbooks/workflow automation, tools integration, real time 1-1 and group messaging, audio calling and screenshare.
  • πŸ“™ PostgreSQL - Database for storing private data from multi-user, chat collaboration discussions and audit history.
  • πŸ€– Mattermost AI plugin - Extension of Mattermost platform for AI bot and generative AI integration.
  • πŸ¦™ Open Source, Self-Hosted LLM models - Models for evaluation and use case development from Hugging Face and other sources, including GPT4All (runs on a laptop in 4.2 GB) and Falcon LLM (example of leading scaled self-hosted models). Uses LocalAI.
  • πŸ”ŒπŸ§  (Configurable) Closed Source, Vendor-Hosted AI models - SaaS-based GenAI models from Azure AI, OpenAI, & Anthropic.
  • πŸ”ŒπŸ“± (Configurable) Mattermost Mobile and Desktop Apps - End-user apps for future production deployment.

Install

Local

Rather watch a video? πŸ“½οΈ Check out our YouTube tutorial video for getting started with OpenOps: https://www.youtube.com/watch?v=20KSKBzZmik

Rather read a blog post? πŸ“ Check out our Mattermost blog post for getting started with OpenOps: https://mattermost.com/blog/open-source-ai-framework/

  1. Clone the repository: git clone https://github.com/mattermost/openops && cd openops
  2. Start docker services and configure plugin
    • If using OpenAI:
      • Run env backend=openai ./init.sh
      • Run ./configure_openai.sh sk-<your openai key> to add your API credentials or use the Mattermost system console to configure the plugin
    • If using LocalAI:
      • Run env backend=localai ./init.sh
      • Run env backend=localai ./download_model.sh to download one or supply your own gguf formatted model in the models directory.
  3. Access Mattermost and log in with the credentials provided in the terminal.

When you log in, you will start out in a direct message with your AI Assistant bot. Now you can start exploring AI usages.

Gitpod

Open in Gitpod Open in Gitpod

  1. Click the above badge and start your Gitpod workspace
  2. You will see VSCode interface and the workspace will configure itself automatically. Wait for the services to start and for your root login for Mattermost to be generated in the terminal
  3. Run ./configure_openai.sh sk-<your openai key> to add your API credentials or use the Mattermost system console to configure the plugin
  4. Access Mattermost and log in with the credentials supplied in the terminal.

When you log in, you will start out in a direct message with your AI Assistant bot. Now you can start exploring AI usages.

Usage

There many ways to integrate generative AI into confidential, self-hosted workplace discussions. To help you get started, here are some examples provided in OpenOps:

Title Image Description
Streaming Conversation Streaming Conversation Streaming Conversation The OpenOps platform reproduces streamed replies from popular GenAI chatbots creating a sense of responsiveness and conversational engagement, while masking actual wait times.
Thread Summarization Thread Summarization Thread Summarization Use the “Summarize Thread” menu option or the /summarize command to get a summary of the thread in a Direct Message from an AI bot. AI-generated summaries can be created from private, chat-based discussions to speed information flows and decision-making while reducing the time and cost required for organizations to stay up-to-date.
Contextual Interrogation Contextual Interrogation Contextual Interrogation Users can ask follow-up questions to discussion summaries generated by AI bots to learn more about the underlying information without reviewing the raw input.
Meeting Summarization Meeting Summarization Meeting Summarization Create meeting summaries! Designed to work with the Mattermost Calls plugin recording feature.
Chat with AI Bots Chat with AI Bots Chat with AI Bots End users can interact with the AI bot in any discussion thread by mentioning AI bot with an @ prefix, as they would get the attention of a human user. The bot will receive the thread information as context for replying.
Sentiment Analysis React for me React for me Use the “React for me” menu option to have the AI bot analyze the sentiment of messages use its conclusion to deliver an emoji reaction on the user’s behalf.
Reinforcement Learning from Human Feedback RLHF RLHF Bot posts are distinguished from human posts by having πŸ‘ πŸ‘Ž icons available for human end users to signal whether the AI response was positive or problematic. The history of responses can be used in future to fine-tune the underlying AI models, as well as to potentially evaluate the responses of new models based on their correlation to positive and negative user ratings for past model responses.

Mods

Mods product art and type treatment

AI for the command line, built for pipelines.

a GIF of mods running

LLM based AI is really good at interpreting the output of commands and returning the results in CLI friendly text formats like Markdown. Mods is a simple tool that makes it super easy to use AI on the command line and in your pipelines. Mods works with OpenAI and LocalAI

To get started, install Mods and check out some of the examples below. Since Mods has built-in Markdown formatting, you may also want to grab Glow to give the output some pizzazz.

Github Link - https://github.com/charmbracelet/mods

What Can It Do?

Mods works by reading standard in and prefacing it with a prompt supplied in the mods arguments. It sends the input text to an LLM and prints out the result, optionally asking the LLM to format the response as Markdown. This gives you a way to “question” the output of a command. Mods will also work on standard in or an argument supplied prompt individually.

Installation

Mods works with OpenAI compatible endpoints. By default, Mods is configured to support OpenAI’s official API and a LocalAI installation running on port 8080. You can configure additional endpoints in your settings file by running mods --settings.

LocalAI

LocalAI allows you to run a multitude of models locally. Mods works with the GPT4ALL-J model as setup in this tutorial. You can define more LocalAI models and endpoints with mods --settings.

Install Mods

# macOS or Linux
brew install charmbracelet/tap/mods

# Arch Linux (btw)
yay -S mods

# Debian/Ubuntu
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://repo.charm.sh/apt/gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/charm.gpg
echo "deb [signed-by=/etc/apt/keyrings/charm.gpg] https://repo.charm.sh/apt/ * *" | sudo tee /etc/apt/sources.list.d/charm.list
sudo apt update && sudo apt install mods

# Fedora/RHEL
echo '[charm]
name=Charm
baseurl=https://repo.charm.sh/yum/
enabled=1
gpgcheck=1
gpgkey=https://repo.charm.sh/yum/gpg.key' | sudo tee /etc/yum.repos.d/charm.repo
sudo yum install mods

Or, download it:

  • Packages are available in Debian and RPM formats
  • Binaries are available for Linux, macOS, and Windows

Or, just install it with go:

go install github.com/charmbracelet/mods@latest

Saving conversations

Conversations save automatically. They are identified by their latest prompt. Similar to Git, conversations have a SHA-1 identifier and a title. Conversations can be updated, maintaining their SHA-1 identifier but changing their title.

a GIF listing and showing saved conversations.

Settings

--settings

Mods lets you tune your query with a variety of settings. You can configure Mods with mods --settings or pass the settings as environment variables and flags.

Model

-m, --model, MODS_MODEL

Mods uses gpt-4 with OpenAI by default but you can specify any model as long as your account has access to it or you have installed locally with LocalAI.

You can add new models to the settings with mods --settings. You can also specify a model and an API endpoint with -m and -a to use models not in the settings file.

Title

-t, --title

Set a custom save title for the conversation.

Continue last

-C, --continue-last

Continues the previous conversation.

Continue

-c, --continue

Continue from the last response or a given title or SHA1.

List

-l, --list

Lists all saved conversations.

Show

-s, --show

Show the saved conversation the given title or SHA1.

Delete

--delete

Deletes the saved conversation with the given title or SHA1.

Format As Markdown

-f, --format, MODS_FORMAT

Ask the LLM to format the response as markdown. You can edit the text passed to the LLM with mods --settings then changing the format-text value.

Raw

-r, --raw, MODS_RAW

Print the raw response without syntax highlighting, even when connect to a TTY.

Max Tokens

--max-tokens, MODS_MAX_TOKENS

Max tokens tells the LLM to respond in less than this number of tokens. LLMs are better at longer responses so values larger than 256 tend to work best.

Temperature

--temp, MODS_TEMP

Sampling temperature is a number between 0.0 and 2.0 and determines how confident the model is in its choices. Higher values make the output more random and lower values make it more deterministic.

TopP

--topp, MODS_TOPP

Top P is an alternative to sampling temperature. It’s a number between 0.0 and 2.0 with smaller numbers narrowing the domain from which the model will create its response.

No Limit

--no-limit, MODS_NO_LIMIT

By default Mods attempts to size the input to the maximum size the allowed by the model. You can potentially squeeze a few more tokens into the input by setting this but also risk getting a max token exceeded error from the OpenAI API.

Include Prompt

-P, --prompt, MODS_INCLUDE_PROMPT

Include prompt will preface the response with the entire prompt, both standard in and the prompt supplied by the arguments.

Include Prompt Args

-p, --prompt-args, MODS_INCLUDE_PROMPT_ARGS

Include prompt args will include only the prompt supplied by the arguments. This can be useful if your standard in content is long and you just a want a summary before the response.

Max Retries

--max-retries, MODS_MAX_RETRIES

The maximum number of retries to failed API calls. The retries happen with an exponential backoff.

Fanciness

--fanciness, MODS_FANCINESS

Your desired level of fanciness.

Quiet

-q, --quiet, MODS_QUIET

Output nothing to standard err.

Reset Settings

--reset-settings

Backup your old settings file and reset everything to the defaults.

No Cache

--no-cache, MODS_NO_CACHE

Disables conversation saving.

HTTP Proxy

-x, --http-proxy, MODS_HTTP_PROXY

Use the HTTP proxy to the connect the API endpoints.

Spark

an LLM-powered autonomous agent platform

AI Spark AI Spark

A framework for autonomous agents who can work together to accomplish tasks using LocalAI.

Github Link - https://github.com/cedriking/spark

Setup

You will need at least Node 10.

Download the repository, then install dependencies: yarn or npm install.

Rename the .env.template file at the root of the project to .env and add your secrets to it:

# the following are needed for the agent to be able to search the web:
GOOGLE_SEARCH_ENGINE_ID=... # create a custom search engine at https://cse.google.com/cse/all
GOOGLE_API_KEY=... # obtain from https://console.cloud.google.com/apis/credentials
AGENT_DELAY=... # optionally, a delay in milliseconds following every agent action
MODEL=... # any Llama.cpp LLM model
SERVER=... # optionally, a server to connect to (default http://localhost:8080)

You’ll also need to enable the Google Custom Search API for your Google Cloud account, e.g. https://console.cloud.google.com/apis/library/customsearch.googleapis.com

Running

Start the program:

yarn dev [# of agents]

or:

npm run dev [# of agents]

Interact with the agents through the console. Anything you type will be sent as a message to all agents currently.

Action errors

After spinning up a new agent, you will often see them make some mistakes which generate errors:

  • Trying to use an action before they’ve asked for help on it to know what its parameters are
  • Trying to just use a raw text response instead of a correctly-formatted action (or raw text wrapping a code block which contains a valid action)
  • Trying to use a multi-line parameter value without wrapping it in the multiline delimiter (% ff9d7713-0bb0-40d4-823c-5a66de48761b)

This is a normal period of adjustment as they learn to operate themselves. They generally will learn from these mistakes and recover, although agents sometimes devolve into endless error loops and can’t figure out what the problem is. It’s highly advised to never leave an agent unattended.

Agent state

Each agent stores its state under the .store directory. Agent 1, for example, has

.store/1/memory
.store/1/goals
.store/1/notes

You can simply delete any of these things, or the whole agent folder (or the whole .store) to selectively wipe whatever state you want between runs. Otherwise, agents will pick up where you left off on restart.

A nice aspect of this is that when you want to debug a problem you ran into with a particular agent, you can delete the events in their memory subsequent to the point where the problem occurred, make changes to the code, and restart them to effectively replay that moment until you’ve fixed the bug. You can also ask an agent to implement a feature, and once they’ve done so you can restart, tell them that you’ve loaded the feature, and ask them to try it out.

Code based on ai-legion.

FAQ

Frequently asked questions

Here are answers to some of the most common questions.

How do I get models?

Most gguf-based models should work, but newer models may require additions to the API. If a model doesn’t work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.

What’s the difference with Serge, or XXX?

LocalAI is a multi-model solution that doesn’t focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference, easy to set up locally and deploy to Kubernetes.

Everything is slow, how come?

There are few situation why this could occur. Some tips are:

  • Don’t use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable mmap in the model config file so it loads everything in memory.
  • Watch out CPU overbooking. Ideally the --threads should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate <= 4 threads to a model.
  • Run LocalAI with DEBUG=true. This gives more information, including stats on the token inference speed.
  • Check that you are actually getting an output: run a simple curl request with "stream": true to see how fast the model is responding.

Can I use it with a Discord bot, or XXX?

Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!

Can this leverage GPUs?

There is partial GPU support, see build instructions above.

Where is the webUI?

There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on github, and should be compatible with LocalAI already (as it mimics the OpenAI API)

Does it work with AutoGPT?

Yes, see the examples!

How can I troubleshoot when something is wrong?

Enable the debug mode by setting DEBUG=true in the environment variables. This will give you more information on what’s going on. You can also specify --debug in the command line.

I’m getting ‘invalid pitch’ error when running with CUDA, what’s wrong?

This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.

I’m getting a ‘SIGILL’ error, what’s wrong?

Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting REBUILD=true and disable the CPU instructions that are not compatible with your CPU. For instance: CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" make build

Subsections of How-tos

Easy Demo - Full Chat Python AI

  • You will need about 10gb of RAM Free
  • You will need about 15gb of space free on C drive for Docker-compose

This is for Linux, Mac OS, or Windows Hosts. - Docker Desktop, Python 3.11, Git

Linux Hosts:

There is a Full_Auto installer compatible with some types of Linux distributions, feel free to use them, but note that they may not fully work. If you need to install something, please use the links at the top.

git clone https://github.com/lunamidori5/localai-lunademo.git

cd localai-lunademo

#Pick your type of linux for the Full Autos, if you already have python, docker, and docker-compose installed skip this chmod. But make sure you chmod the setup_linux file.

chmod +x Full_Auto_setup_Debian.sh or chmod +x Full_Auto_setup_Ubutnu.sh

chmod +x Setup_Linux.sh

#Make sure to install cuda to your host OS and to Docker if you plan on using GPU

./(the setupfile you wish to run)

Windows Hosts:

REM Make sure you have git, docker-desktop, and python 3.11 installed

git clone https://github.com/lunamidori5/localai-lunademo.git

cd localai-lunademo

call Setup.bat

MacOS Hosts:

  • I need some help working on a MacOS Setup file, if you are willing to help out, please contact Luna Midori on discord or put in a PR on Luna Midori’s github.

Video How Tos

  • Ubuntu - COMING SOON
  • Debian - COMING SOON
  • Windows - COMING SOON
  • MacOS - PLANED - NEED HELP

Enjoy localai! (If you need help contact Luna Midori on Discord)

  • Trying to run Setup.bat or Setup_Linux.sh from Git Bash on Windows is not working. (Somewhat fixed)
  • Running over SSH or other remote command line based apps may bug out, load slowly, or crash.

Easy Model Setup

Lets learn how to setup a model, for this How To we are going to use the Dolphin 2.2.1 Mistral 7B model.

To download the model to your models folder, run this command in a commandline of your picking.

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/dolphin-2.2.1-mistral-7B-GGUF/dolphin-2.2.1-mistral-7b.Q4_0.gguf"
}'

Each model needs at least 5 files, with out these files, the model will run raw, what that means is you can not change settings of the model.

File 1 - The model's GGUF file
File 2 - The model's .yaml file
File 3 - The Chat API .tmpl file
File 4 - The Chat API helper .tmpl file
File 5 - The Completion API .tmpl file

So lets fix that! We are using lunademo name for this How To but you can name the files what ever you want! Lets make blank files to start with

touch lunademo-chat.tmpl
touch lunademo-chat-block.tmpl
touch lunademo-completion.tmpl
touch lunademo.yaml

Now lets edit the "lunademo-chat.tmpl", This is the template that model “Chat” trained models use, but changed for LocalAI

<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
{{if .Content}}{{.Content}}{{end}}
<|im_end|>

For the "lunademo-chat-block.tmpl", Looking at the huggingface repo, this model uses the <|im_start|>assistant tag for when the AI replys, so lets make sure to add that to this file. Do not add the user as we will be doing that in our yaml file!

{{.Input}}
<|im_start|>assistant

Now in the "lunademo-completion.tmpl" file lets add this. (This is a hold over from OpenAI V0)

{{.Input}}

For the "lunademo.yaml" file. Lets set it up for your computer or hardware. (If you want to see advanced yaml configs - Link)

We are going to 1st setup the backend and context size.

backend: llama
context_size: 2000

What this does is tell LocalAI how to load the model. Then we are going to add our settings in after that. Lets add the models name and the models settings. The models name: is what you will put into your request when sending a OpenAI request to LocalAI

name: lunademo
parameters:
  model: dolphin-2.2.1-mistral-7b.Q4_0.gguf

Now that LocalAI knows what file to load with our request, lets add the template files to our models yaml file now.

template:
  chat: lunademo-chat-block
  chat_message: lunademo-chat
  completion: lunademo-completion

If you are running on GPU or want to tune the model, you can add settings like (higher the GPU Layers the more GPU used)

f16: true
gpu_layers: 4

To fully tune the model to your like. But be warned, you must restart LocalAI after changing a yaml file

docker compose restart

If you want to check your models yaml, here is a full copy!

backend: llama
context_size: 2000
##Put settings right here for tunning!! Before name but after Backend!
name: lunademo
parameters:
  model: dolphin-2.2.1-mistral-7b.Q4_0.gguf
template:
  chat: lunademo-chat-block
  chat_message: lunademo-chat
  completion: lunademo-completion

Now that we got that setup, lets test it out but sending a request to Localai!

—– Adv Stuff —–

(Please do not run these steps if you have already done the setup) Alright now that we have learned how to set up our own models, here is how to use the gallery to do alot of this for us. This command will download and set up (mostly, we will always need to edit our yaml file to fit our computer / hardware)

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
     "id": "model-gallery@lunademo"
   }'  

This will setup the model, models yaml, and both template files (you will see it only did one, as completions is out of date and not supported by OpenAI if you need one, just follow the steps from before to make one. If you would like to download a raw model using the gallery api, you can run this command. You will need to set up the 3 files needed to run the model tho!

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "NAME_OFF_HUGGINGFACE/REPO_NAME/MODENAME.gguf",
    "name": "REQUSTNAME"
}'

Easy Request - All

Curl Request

Curl Chat API -

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "lunademo",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

This is for Python, OpenAI=>V1

OpenAI Chat API Python -

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-xxx")

messages = [
{"role": "system", "content": "You are LocalAI, a helpful, but really confused ai, you will only reply with confused emotes"},
{"role": "user", "content": "Hello How are you today LocalAI"}
]
completion = client.chat.completions.create(
  model="lunademo",
  messages=messages,
)

print(completion.choices[0].message)

See OpenAI API for more info!

This is for Python, OpenAI=0.28.1

OpenAI Chat API Python -

import os
import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "sx-xxx"
OPENAI_API_KEY = "sx-xxx"
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

completion = openai.ChatCompletion.create(
  model="lunademo",
  messages=[
    {"role": "system", "content": "You are LocalAI, a helpful, but really confused ai, you will only reply with confused emotes"},
    {"role": "user", "content": "How are you?"}
  ]
)

print(completion.choices[0].message.content)

OpenAI Completion API Python -

import os
import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "sx-xxx"
OPENAI_API_KEY = "sx-xxx"
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

completion = openai.Completion.create(
  model="lunademo",
  prompt="function downloadFile(string url, string outputPath) ",
  max_tokens=256,
  temperature=0.5)

print(completion.choices[0].text)

Easy Setup - CPU Docker

  • You will need about 10gb of RAM Free
  • You will need about 15gb of space free on C drive for Docker compose

We are going to run LocalAI with docker compose for this set up.

Lets setup our folders for LocalAI

mkdir "LocalAI"
cd LocalAI
mkdir "models"
mkdir "images"
mkdir -p "LocalAI"
cd LocalAI
mkdir -p "models"
mkdir -p "images"

At this point we want to set up our .env file, here is a copy for you to use if you wish, Make sure this is in the LocalAI folder.

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=2

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Define galleries.
## models will to install will be visible in `/models/available`
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]

## Default path for models
MODELS_PATH=/models

## Enable debug mode
# DEBUG=true

## Disables COMPEL (Lets Stable Diffuser work, uncomment if you plan on using it)
# COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
# SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
# REBUILD=true

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
#GO_TAGS=tts

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

# HUGGINGFACEHUB_API_TOKEN=Token here

Now that we have the .env set lets set up our docker-compose file. It will use a container from quay.io. Also note this docker-compose file is for CPU only.

version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:v2.0.0
    tty: true # enable colorized logs
    restart: always # should this be on-failure ?
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]

Make sure to save that in the root of the LocalAI folder. Then lets spin up the Docker run this in a CMD or BASH

docker compose up -d --pull always

Now we are going to let that set up, once it is done, lets check to make sure our huggingface / localai galleries are working (wait until you see this screen to do this)

You should see:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Fiber v2.42.0                   β”‚
β”‚               http://127.0.0.1:8080               β”‚
β”‚       (bound on host 0.0.0.0 and port 8080)       β”‚
β”‚                                                   β”‚
β”‚ Handlers ............. 1  Processes ........... 1 β”‚
β”‚ Prefork ....... Disabled  PID ................. 1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
curl http://localhost:8080/models/available

Output will look like this:

Now that we got that setup, lets go setup a model

Easy Setup - Embeddings

To install an embedding model, run the following command

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
     "id": "model-gallery@bert-embeddings"
   }'  

Now we need to make a bert.yaml in the models folder

backend: bert-embeddings
embeddings: true
name: text-embedding-ada-002
parameters:
  model: bert

Restart LocalAI after you change a yaml file

When you would like to request the model from CLI you can do

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The food was delicious and the waiter...",
    "model": "text-embedding-ada-002"
  }'

See OpenAI Embedding for more info!

Easy Setup - GPU Docker

  • You will need about 10gb of RAM Free
  • You will need about 15gb of space free on C drive for Docker compose

We are going to run LocalAI with docker compose for this set up.

Lets Setup our folders for LocalAI

mkdir "LocalAI"
cd LocalAI
mkdir "models"
mkdir "images"
mkdir -p "LocalAI"
cd LocalAI
mkdir -p "models"
mkdir -p "images"

At this point we want to set up our .env file, here is a copy for you to use if you wish, Make sure this is in the LocalAI folder.

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=2

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Define galleries.
## models will to install will be visible in `/models/available`
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]

## Default path for models
MODELS_PATH=/models

## Enable debug mode
# DEBUG=true

## Disables COMPEL (Lets Stable Diffuser work, uncomment if you plan on using it)
# COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
# SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
# REBUILD=true

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
#GO_TAGS=tts

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

# HUGGINGFACEHUB_API_TOKEN=Token here

Now that we have the .env set lets set up our docker-compose file. It will use a container from quay.io. Also note this docker-compose file is for CUDA only.

Please change the image to what you need.

  • master-cublas-cuda11
  • master-cublas-cuda11-core
  • v2.0.0-cublas-cuda11
  • v2.0.0-cublas-cuda11-core
  • v2.0.0-cublas-cuda11-ffmpeg
  • v2.0.0-cublas-cuda11-ffmpeg-core

Core Images - Smaller images without predownload python dependencies

  • master-cublas-cuda12
  • master-cublas-cuda12-core
  • v2.0.0-cublas-cuda12
  • v2.0.0-cublas-cuda12-core
  • v2.0.0-cublas-cuda12-ffmpeg
  • v2.0.0-cublas-cuda12-ffmpeg-core

Core Images - Smaller images without predownload python dependencies

version: '3.6'

services:
  api:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    image: quay.io/go-skynet/local-ai:[CHANGEMETOIMAGENEEDED]
    tty: true # enable colorized logs
    restart: always # should this be on-failure ?
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]

Make sure to save that in the root of the LocalAI folder. Then lets spin up the Docker run this in a CMD or BASH

docker compose up -d --pull always

Now we are going to let that set up, once it is done, lets check to make sure our huggingface / localai galleries are working (wait until you see this screen to do this)

You should see:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Fiber v2.42.0                   β”‚
β”‚               http://127.0.0.1:8080               β”‚
β”‚       (bound on host 0.0.0.0 and port 8080)       β”‚
β”‚                                                   β”‚
β”‚ Handlers ............. 1  Processes ........... 1 β”‚
β”‚ Prefork ....... Disabled  PID ................. 1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
curl http://localhost:8080/models/available

Output will look like this:

Now that we got that setup, lets go setup a model

Easy Setup - Stable Diffusion

To set up a Stable Diffusion model is super easy. In your models folder make a file called stablediffusion.yaml, then edit that file with the following. (You can change Linaqruf/animagine-xl with what ever sd-lx model you would like.

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

# Force CPU usage - set to true for GPU
f16: false
diffusers:
  pipeline_type: StableDiffusionXLPipeline
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: dpm_2_a

If you are using docker, you will need to run in the localai folder with the docker-compose.yaml file in it

docker-compose down #windows
docker compose down #linux/mac

Then in your .env file uncomment this line.

COMPEL=0

After that we can reinstall the LocalAI docker VM by running in the localai folder with the docker-compose.yaml file in it

docker-compose up #windows
docker compose up #linux/mac

Then to download and setup the model, Just send in a normal OpenAI request! LocalAI will do the rest!

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "Two Boxes, 1blue, 1red",
  "size": "256x256"
}'