Today I installed some AI tools, specifically llama.cpp and privateGPT. This process took several turns and on my aging(?) laptop (i7-9750H with Nvidia 2070-MaxQ, 32GB ram, and SSDs) the queries are painfully slow. I’m using the llama-65b.ggmlv3.q2_K.bin language model from huggnigface.co, which by itself wants a considerable amount of memory but did configure llama.cpp to use cuda and am offloading about 6G of info to my GPU which helps.
Excitingly, one of the first things I did when getting llama.cpp compiled with CUDA support and fetching the model was ask it what the capitol of Spain was. It answered “Madrid” – terse, accurate, and to the point. Admittedly that answer took about 2m to retrieve, which is less than ideal, but it was an arbitrary question and didn’t go to the internet for an answer, so win!
Llama.cpp Installation
Setting up Llama.cpp to run on its own (no private data) was pretty simple. First clone the git repository from:
https://github.com/ggerganov/llama.cpp
Then install cuda-tools for your Nvidia card:
sudo pacman -S cuda-tools cuda
Then just build with CUBLAS enabled:
make LLAMA_CUBLAS=1
You can then run llama.cpp using “./main” from this directory. Replace your ggml model with whichever you happen to download and change other settings to your liking (‘./main –help’ for info). My current command is:
./main -t 6 -ngl 16 -m llama-65b.ggmlv3.q2_K.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins
At this point, running this command, you should be able to chat with the model. Note that it is slow, at least on my hardware, and I haven’t tested the extent of the information but it seems to be a good start to me.
privateGPT Installation
Installing privateGPT was more concerning to me as it is built in python, and requires a number of python dependencies. While there is nothing inherently wrong with this, I don’t like it as much as the other. It does have the distinct advantage of having a very easy method of parsing your own files as part of the dataset. I used many specifications for my day job and loaded them in. After doing so I spent some time asking it questions. It’s still quite slow (1-2m per query) when using a relatively small model (1B) but was did reasonably well. I’m in the process of loading the data using the 65B model from above and am hoping for more accurate results. To setup privateGPT using llama.cpp I followed this guide: https://hackernoon.com/how-to-install-privategpt-a-local-chatgpt-like-instance-with-no-internet-required/
The primary difference I took was to install llama.cpp with GPU support by using:
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; pip3 install llama-cpp-python
when installing llama-cpp-python. Otherwise, I followed the guide exactly. I’m currently experimenting with the 65b model frmo above using the all-mpnet-base-v2 embeddings for import parsing. We’ll see how it goes. My .env file is:
PERSIST_DIRECTORY=db
# MODEL_TYPE=GPT4All
MODEL_TYPE=LlamaCpp
#MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
#MODEL_PATH=models/ggml-alpaca-7b-q4.bin
MODEL_PATY=models/llama-65b-ggmlv3.q2_K.bin
# EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2
MODEL_N_CTX=1000
#MODEL_N_BATCH=8
MODE_N_BATCH=1024
TARGET_SOURCE_CHUNKS=4