Hey there, world! Paul McWhorter here. You know me—I don’t just want to use technology; I want to understand exactly how it works under the hood. Today, we’re taking the NVIDIA Jetson Orin Nano and making it “think” right here on our own hardware.
We are bypassing the heavy, automated installers to build llama.cpp from source. This is the gold standard for high-performance AI on edge devices. Let’s get to work!
Part 1: Standalone Llama.cpp Build
First, we need to prepare our engine. This step takes the raw source code from GitHub and compiles it specifically for the Orin Nano’s GPU using the CUDA toolkit.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ===================================================================== # PART 1: STANDALONE LLAMA.CPP BUILD # && means only go to next command if this one works # \ is like hitting enter # this allows us to make a script. # ===================================================================== cd ~ && \ git clone https://github.com/ggerganov/llama.cpp && \ cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc && \ cmake --build build --config Release --parallel $(nproc) && \ mkdir -p ~/models && \ wget -O ~/models/qwen.gguf https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_k_m.gguf && \ echo "=== Part 1 Complete: Standalone High-Performance Backend Built ===" |
-
What’s happening here? We are cloning the project, creating a “build” blueprint that tells the compiler to use your GPU (CUDA), and then using your Orin’s full processing power (
nproc) to assemble the program. We also create a folder to keep our “brain” files neat and tidy.
Part 2: Run in Web Interface
Now that the engine is ready, let’s launch the server and start chatting with our first model, Qwen.
|
1 2 3 4 5 6 7 8 |
# ===================================================================== # PART 2 RUN IN WEB INTERFACE # ===================================================================== cd ~/llama.cpp ./build/bin/llama-server \ -m ~/models/qwen.gguf \ --n-gpu-layers 99 \ --port 8080 |
-
What’s happening here? We move into the folder where we built our engine and launch the server. The
--n-gpu-layers 99flag is the magic! It tells the system to push as many model layers as possible into the GPU memory. The--port 8080defines the digital “door” our web browser will use to chat with the AI athttp://localhost:8080.
Part 3: Download and Run a New Model
One of the best things about llama.cpp is how easy it is to swap out “brains.” Let’s download a more advanced model, clear our network port, and fire it up!
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# ===================================================================== # PART 3 DOWNLOAD AND RUN A NEW MODEL # ===================================================================== #Now try different model (RUN ONE COMMAND AT A TIME) wget -O ~/models/phi-4-mini.gguf https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF/resolve/main/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf #between runs it is good to kill the port 8080 fuser -k 8080/tcp # Now Lets Run the phi-4-mini.gguf cd ~/llama.cpp ./build/bin/llama-server \ -m ~/models/phi-4-mini.gguf \ --n-gpu-layers 99 \ --port 8080 # |
-
What’s happening here? We download the
phi-4-minimodel, and then we usefuser -k 8080/tcp. This acts like a “master key”—if the previous server process didn’t close properly, this forces the port open so we don’t get any “address in use” errors. Then, we launch the server again, pointing it to our new model!
You’re in Control
Once that server is live, you’re not just watching AI happen—you’re running it! Keep an eye on your terminal logs, watch that GPU utilization jump, and remember: you are working on the absolute bleeding edge of local AI performance.
Buckle up, let’s do some exciting projects together. Drop those tokens-per-second scores in the comments!
A Long Way to Go
Guys our eventual goal is to get nemoclaw operating as an agent on the Jetson Orin Nano on Jetpack 7.2. Our first effort was to run Llama and Ollama on the Jetson Orin. We were successful with that but the challenge way, using the canned install commands, we ended up running on the CPU not the Cuda GPU. Today we have a major step forward as we are now running on GPU, with the core models. Next up, we will try to get it running under Olama, while still staying on the GPU.
WHAT HAPPENS ON YOUR DESKTOP STAYS ON YOUR DESKTOP!
OK, here is your homework. Download all the models we looked at last week using the method above. When complete, you should have these models:
| Model | Model Family | Size / Parameter Count | Best Used For |
gemma3:1b |
Google Gemma 3 | 1 Billion | Ultra-fast responses, light footprint |
llama3.2:1b |
Meta Llama 3.2 | 1 Billion | High-efficiency conversational loops |
phi4-mini:3.8b |
Microsoft Phi-4 | 3.8 Billion | Heavy reasoning and coding logic |
qwen3:4b |
Alibaba Qwen 3 | 4 Billion | Structured data and multilingual logic |
qwen3.5:4b |
Alibaba Qwen 3.5 | 4 Billion | Advanced context processing |
gemma3:4b |
Google Gemma 3 | 4 Billion | Maximum analytical depth on Orin Nano |
