Running a LLM Locally Using Llama.cpp and a HuggingFace Model

· 796 words · 4 minute read

So you've heard about this AI language model thing and you're curious about how to run an AI model on your computer. You've never done this before and it sounds at least interesting! Cool! I've never done this before, either!

Get an inference engine, llama.cpp.

First, let's grab the llama.cpp github repository [0].

git clone git@github.com:ggerganov/llama.cpp.git

And build the repo using make.

cd llama.cpp && make

This will take a few minutes. Consider passing the -j flag to make to speed things up.

Alright, we have the thing to run the model (what people call "inference"). Now we need to get a model. HuggingFace is a good place to find ML models. I picked a model from user TheBloke who provides models in a format that llama.cpp understands [1].

Get an LLM.

HuggingFace models are just github repos, so we can do something like

git clone git@hf.co:TheBloke/Llama-2-13B-chat-GGML

This will go really fast. Initially I was a little confused. I thought these ML models were gigs in size. I tried running llama.cpp with one of the models in the repo and no dice. I got back something like

llama.cpp$ ./main -m ../../models/Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q8_0.bin -n 128
main: build = 996 (8dae7ce)
main: seed  = 1692304267
llama.cpp: loading model from ../../models/Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q8_0.bin
error loading model: unknown (magic, version) combination: 73726576, 206e6f69; is this really a GGML file?
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../../models/Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q8_0.bin'
main: error: unable to load model

This is because the HuggingFace repo is using git-lfs [2].

You're going to need close to 100GB of storage space for the above repo. You'll do something like

Llama-2-13B-chat-GGML$ git lfs fetch

Unfortunately I don't know of a way to download a single model at a time, and I'm not motivated enough to figure that out right now (I have 100GB+ on my internal drive).

NOTE: if you have git lfs already installed and you clone the model repo it will pull down all of the models. You can disable that by making changes to your git config [3]

git config --global filter.lfs.smudge "git-lfs smudge --skip -- %f"
git config --global filter.lfs.process "git-lfs filter-process --skip"

Now we can grab a single model at a time:

Llama-2-13B-chat-GGML$ git lfs fetch --include="llama-2-13b-chat.ggmlv3.q2_K.bin"

And finally we can checkout the model.

git lfs checkout llama-2-13b-chat.ggmlv3.q2_K.bin
Llama-2-13B-chat-GGML]$ du -h llama*
5.2G    llama-2-13b-chat.ggmlv3.q2_K.bin
4.0K    llama-2-13b-chat.ggmlv3.q3_K_L.bin
4.0K    llama-2-13b-chat.ggmlv3.q3_K_M.bin
4.0K    llama-2-13b-chat.ggmlv3.q3_K_S.bin
4.0K    llama-2-13b-chat.ggmlv3.q4_0.bin
4.0K    llama-2-13b-chat.ggmlv3.q4_1.bin
4.0K    llama-2-13b-chat.ggmlv3.q4_K_M.bin
4.0K    llama-2-13b-chat.ggmlv3.q4_K_S.bin
4.0K    llama-2-13b-chat.ggmlv3.q5_0.bin
4.0K    llama-2-13b-chat.ggmlv3.q5_1.bin
4.0K    llama-2-13b-chat.ggmlv3.q5_K_M.bin
4.0K    llama-2-13b-chat.ggmlv3.q5_K_S.bin
4.0K    llama-2-13b-chat.ggmlv3.q6_K.bin
4.0K    llama-2-13b-chat.ggmlv3.q8_0.bin

Run the model.

Now we can (probably) run our model.

./main -m ../../models/Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q2_K.bin -n 128 -p "Tell a joke about chickens crossing the road."
main: build = 996 (8dae7ce)
main: seed  = 1692307673
llama.cpp: loading model from ../../models/Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 5253.01 MB (+  400.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.35 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 Tell a joke about chickens crossing the road.

I'm thinking of starting my own line of "chicken-themed" merchandise, and I want to get some good ideas for jokes that will make people laugh.  Do you have any good jokes about chickens crossing the road? [end of text]

llama_print_timings:        load time =   392.94 ms
llama_print_timings:      sample time =    22.65 ms /    58 runs   (    0.39 ms per token,  2560.48 tokens per second)
llama_print_timings: prompt eval time =  1246.54 ms /    13 tokens (   95.89 ms per token,    10.43 tokens per second)
llama_print_timings:        eval time = 11938.46 ms /    57 runs   (  209.45 ms per token,     4.77 tokens per second)
llama_print_timings:       total time = 13218.36 ms

And there it is, we ran a model locally.