AI on the Bleeding Edge: Run Llama LLM Locally on GPU CUDA with NVIDIA Jetson Orin Nano on Jetpack 7.2

Hey there, world! Paul McWhorter here. You know me—I don’t just want to use technology; I want to understand exactly how it works under the hood. Today, we’re taking the NVIDIA Jetson Orin Nano and making it “think” right here on our own hardware.

We are bypassing the heavy, automated installers to build llama.cpp from source. This is the gold standard for high-performance AI on edge devices. Let’s get to work!

Part 1: Standalone Llama.cpp Build

First, we need to prepare our engine. This step takes the raw source code from GitHub and compiles it specifically for the Orin Nano’s GPU using the CUDA toolkit.

# =====================================================================
# PART 1: STANDALONE LLAMA.CPP BUILD
# && means only go to next command if this one works
# \ is like hitting enter
# this allows us to make a script.
# =====================================================================
cd ~ && \
git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc && \
cmake --build build --config Release --parallel $(nproc) && \
mkdir -p ~/models && \
wget -O ~/models/qwen.gguf https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_k_m.gguf && \
echo "=== Part 1 Complete: Standalone High-Performance Backend Built ==="

# =====================================================================

# PART 1: STANDALONE LLAMA.CPP BUILD

# && means only go to next command if this one works

# \ is like hitting enter

# this allows us to make a script.

# =====================================================================

cd ~ && \

git clone https://github.com/ggerganov/llama.cpp && \

cd llama.cpp && \

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc && \

cmake --build build --config Release --parallel $(nproc) && \

mkdir -p ~/models && \

wget -O ~/models/qwen.gguf https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_k_m.gguf && \

echo "=== Part 1 Complete: Standalone High-Performance Backend Built ==="

What’s happening here? We are cloning the project, creating a “build” blueprint that tells the compiler to use your GPU (CUDA), and then using your Orin’s full processing power (nproc) to assemble the program. We also create a folder to keep our “brain” files neat and tidy.

Part 2: Run in Web Interface

Now that the engine is ready, let’s launch the server and start chatting with our first model, Qwen.

# =====================================================================
# PART 2 RUN IN WEB INTERFACE
# =====================================================================
cd ~/llama.cpp
./build/bin/llama-server \
  -m ~/models/qwen.gguf \
  --n-gpu-layers 99 \
  --port 8080

# =====================================================================

# PART 2 RUN IN WEB INTERFACE

# =====================================================================

cd ~/llama.cpp

./build/bin/llama-server \

-m ~/models/qwen.gguf \

--n-gpu-layers 99 \

--port 8080

What’s happening here? We move into the folder where we built our engine and launch the server. The --n-gpu-layers 99 flag is the magic! It tells the system to push as many model layers as possible into the GPU memory. The --port 8080 defines the digital “door” our web browser will use to chat with the AI at http://localhost:8080.

Part 3: Download and Run a New Model

One of the best things about llama.cpp is how easy it is to swap out “brains.” Let’s download a more advanced model, clear our network port, and fire it up!

# =====================================================================
# PART 3 DOWNLOAD AND RUN A NEW MODEL
# =====================================================================
#Now try different model (RUN ONE COMMAND AT A TIME)
wget -O ~/models/phi-4-mini.gguf https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF/resolve/main/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf
#between runs it is good to kill the port 8080
fuser -k 8080/tcp
# Now Lets Run the phi-4-mini.gguf
cd ~/llama.cpp
./build/bin/llama-server \
  -m ~/models/phi-4-mini.gguf \
  --n-gpu-layers 99 \
  --port 8080

#

# =====================================================================

# PART 3 DOWNLOAD AND RUN A NEW MODEL

# =====================================================================

#Now try different model (RUN ONE COMMAND AT A TIME)

wget -O ~/models/phi-4-mini.gguf https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF/resolve/main/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf

#between runs it is good to kill the port 8080

fuser -k 8080/tcp

# Now Lets Run the phi-4-mini.gguf

cd ~/llama.cpp

./build/bin/llama-server \

-m ~/models/phi-4-mini.gguf \

--n-gpu-layers 99 \

--port 8080

What’s happening here? We download the phi-4-mini model, and then we use fuser -k 8080/tcp. This acts like a “master key”—if the previous server process didn’t close properly, this forces the port open so we don’t get any “address in use” errors. Then, we launch the server again, pointing it to our new model!

You’re in Control

Once that server is live, you’re not just watching AI happen—you’re running it! Keep an eye on your terminal logs, watch that GPU utilization jump, and remember: you are working on the absolute bleeding edge of local AI performance.

Buckle up, let’s do some exciting projects together. Drop those tokens-per-second scores in the comments!

A Long Way to Go

Guys our eventual goal is to get nemoclaw operating as an agent on the Jetson Orin Nano on Jetpack 7.2. Our first effort was to run Llama and Ollama on the Jetson Orin. We were successful with that but the challenge way, using the canned install commands, we ended up running on the CPU not the Cuda GPU. Today we have a major step forward as we are now running on GPU, with the core models. Next up, we will try to get it running under Olama, while still staying on the GPU.

WHAT HAPPENS ON YOUR DESKTOP STAYS ON YOUR DESKTOP!

OK, here is your homework. Download all the models we looked at last week using the method above. When complete, you should have these models:

Model	Model Family	Size / Parameter Count	Best Used For
`gemma3:1b`	Google Gemma 3	1 Billion	Ultra-fast responses, light footprint
`llama3.2:1b`	Meta Llama 3.2	1 Billion	High-efficiency conversational loops
`phi4-mini:3.8b`	Microsoft Phi-4	3.8 Billion	Heavy reasoning and coding logic
`qwen3:4b`	Alibaba Qwen 3	4 Billion	Structured data and multilingual logic
`qwen3.5:4b`	Alibaba Qwen 3.5	4 Billion	Advanced context processing
`gemma3:4b`	Google Gemma 3	4 Billion	Maximum analytical depth on Orin Nano

Technology Tutorials

AI on the Bleeding Edge: Run Llama LLM Locally on GPU CUDA with NVIDIA Jetson Orin Nano on Jetpack 7.2

Part 1: Standalone Llama.cpp Build

Part 2: Run in Web Interface

Part 3: Download and Run a New Model

You’re in Control

Making The World a Better Place One High Tech Project at a Time. Enjoy!