App: Local large language model (llm2)
The llm2 app is one of the apps that provide text processing functionality using Large language models in Nextcloud and act as a text processing backend for the Nextcloud Assistant app, the mail app and other apps making use of the core Text Processing API. The llm2 app specifically runs only open source models and does so entirely on-premises. Nextcloud can provide customer support upon request, please talk to your account manager for the possibilities.
This app uses llama.cpp under the hood and is thus compatible with any model in gguf format.
However, we only test with Llama 3.1. Output quality will differ depending on which model you use and downstream tasks like summarization or Context Chat may not work on other models. We thus recommend the following models:
- Llama3.1 8b Instruct (reasonable quality; fast; good acclaim; comes shipped with the app) 
- Llama3.1 70B Instruct (good quality; good acclaim) 
Multilinguality
This app supports input and output in languages other than English if the underlying model supports the language.
Llama 3.1 supports the following languages:
- English 
- Portuguese 
- Spanish 
- Italian 
- German 
- French 
- Hindi 
- Thai 
Note, that other languages may work as well, but only the above languages are guaranteed to work.
Requirements
- This app is built as an External App and thus depends on AppAPI v3.1.0 or higher 
- Nextcloud AIO is supported 
- We currently support NVIDIA GPUs and x86_64 CPUs 
- CPU that supports AVX and AVX2 instruction 
- CUDA >= v12.4 on your host system 
- GPU Sizing - A NVIDIA GPU with at least 8GB VRAM 
- At least 12GB of system RAM 
 
- CPU Sizing - At least 12GB of system RAM 
- The more cores you have and the more powerful the CPU the better, we recommend 10-20 cores 
- The app will hog all cores by default, so it is usually better to run it on a separate machine 
 
Installation
- Make sure the Nextcloud Assistant app is installed 
- Install the “Local large language model” ExApp via the “Apps” page in the Nextcloud web admin user interface 
Supplying alternate models
This app allows supplying alternate LLM models as gguf files in the /nc_app_llm2_data directory of the docker container.
- Download a gguf model e.g. from huggingface 
- Copy the gguf file to - /nc_app_llm2_datainside the docker container
- Restart the llm2 ExApp 
- Select the new model in the Nextcloud AI admin settings 
Configuring alternate models
Since every model requires slightly different inference parameters, you can pass along a configuration file for the alternate model files you supply.
The configuration file for a model file must have the same name as the model file but must end in .json instead of .gguf.
The strings {system_prompt} and {user_prompt} are variables that will be filled in by the app, so they must be part of your prompt template.
Here is an example config file for Llama 2:
{
  "prompt": "<|im_start|> system\n{system_prompt}\n<|im_end|>\n<|im_start|> user\n{user_prompt}\n<|im_end|>\n<|im_start|> assistant\n",
  "loader_config": {
     "n_ctx": 4096,
     "max_tokens": 2048,
     "stop": ["<|im_end|>"]
  }
}
Here is an example configuration for Llama 3:
{
  "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n{user_prompt}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n",
  "loader_config": {
      "n_ctx": 8000,
      "max_tokens": 4000,
      "stop": ["<|eot_id|>"],
      "temperature": 0.3
  }
}
Scaling
It is currently not possible to scale this app, we are working on this. Based on our calculations an instance has a rough capacity of 1000 user requests per hour. However, this number is based on theory and we do appreciate real-world feedback on this. If you would like to scale up your language model usage, we recommend using an AI as a Service provider or hosting a service compatible with the OpenAI API yourself that can be scaled up and connecting nextcloud to it via the integration_openai app.
App store
You can also find the app in our app store, where you can write a review: https://apps.nextcloud.com/apps/llm2
Repository
You can find the app’s code repository on GitHub where you can report bugs and contribute fixes and features: https://github.com/nextcloud/llm2
Nextcloud customers should file bugs directly with our Support system.
Known Limitations
- We currently only support languages that the underlying model supports; correctness of language use in languages other than English may be poor depending on the language’s coverage in the model’s training data (We recommended model Llama 3 or other models explicitly trained on multiple languages) 
- Language models can be bad at reasoning tasks 
- Language models can be bad at math 
- Language models are likely to generate false information and should thus only be used in situations that are not critical. It’s recommended to only use AI at the beginning of a creation process and not at the end, so that outputs of AI serve as a draft for example and not as final product. Always check the output of language models before using it. 
- Make sure to test the language model you are using it for whether it meets the use-case’s quality requirements 
- Language models notoriously have a high energy consumption, if you want to reduce load on your server you can choose smaller models or quantized models in exchange for lower accuracy 
- Customer support is available upon request, however we can’t solve false or problematic output, most performance issues, or other problems caused by the underlying model. Support is thus limited only to bugs directly caused by the implementation of the app (connectors, API, front-end, AppAPI) 
Addendum: Running with a fully open model
If you would like to use a fully open model that scores a green score on our Ethical AI rating, we recommend the following model:
- OLMo 2 (either in 7B or 13B): https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct-GGUF 
What makes OLMo a fully open model?
- The code for training, fine-tuning and inference of the model is publicly available and fully open source 
- The training data with which the model is pretrained is publicly available 
- The model itself is publicly available and fully open source 
- The instruction tuning data is publicly available 
- The Reinforcement learning model is publicly available and fully open source 
Limitations
- OLMo currently only works well with English language input 
- In our tests it sometimes produced hallucinated or garbled output; make sure to thoroughly test the model for your use case 
- It cannot use tools, so cannot be used in conjunction with Context Agent