Tag Archives: LLM

Hey guys, Paul McWhorter here from toptechboy.com. Today, we are going to look at how to stop fighting your hardware and start running local AI like an absolute boss.

The NVIDIA Jetson Orin Nano is an absolute masterpiece of edge-compute hardware. But it has one major design constraint that catches almost every beginner off guard: Unified Memory. On the Orin, your CPU and your GPU share the exact same physical pool of 8GB of LPDDR5 RAM. When you boot into that pretty Ubuntu GNOME desktop, the system instantly steals over 1.5 GB of your precious VRAM just to draw a GUI you aren’t even looking at while your code is running.

In this lesson, we are going to reclaim that stolen memory, optimize our storage, and run a massive 8-billion parameter model (LLaMA 3.1 8B) smoothly on the Orin Nano by learning how to properly run a clean, headless configuration.

Step 1: Find Your Jetson’s IP Address

Before we turn off the monitor, we need to know how to talk to the Orin over the network. If you don’t know its IP address, you can’t SSH in once the screen goes dark. While you are still in the graphical terminal, run this command:

ifconfig

ifconfig

Look under your active connection interface (usually eth0 for Ethernet or wlan0 for Wi-Fi) for the inet address. It will look something like 192.168.1.15. Write this down! You will need it to remote in later.

Step 2: Disable and Remove the Default Swap File

By default, JetPack configures a slow, disk-based swap file on your NVMe drive. While swap space is great for general computing, it is an absolute performance killer for LLMs. If your model spillover starts paging to a disk-based swap file, your tokens-per-second will drop to a crawl, and the high-frequency writes will prematurely wear out your SSD.

We want our models running purely in ultra-fast LPDDR5 RAM. Let’s cleanly turn off and remove the swap file:

# 1. Turn off the active swap space
sudo swapoff -a

# 2. Delete the physical swap file from your drive
sudo rm /swapfile

# 3. Prevent it from mounting on next boot
# Open your fstab file:
sudo nano /etc/fstab

# Find the line containing '/swapfile' and add a '#' at the beginning to comment it out.
# Save and exit (Ctrl+O, Enter, Ctrl+X).

# 1. Turn off the active swap space

sudo swapoff -a

# 2. Delete the physical swap file from your drive

sudo rm /swapfile

# 3. Prevent it from mounting on next boot

# Open your fstab file:

sudo nano /etc/fstab

# Find the line containing '/swapfile' and add a '#' at the beginning to comment it out.

# Save and exit (Ctrl+O, Enter, Ctrl+X).

Step 3: Configure a Clean Boot into the Terminal

Many people will tell you to run sudo systemctl isolate multi-user.target to turn off the GUI. Do not do this! That command aggressively tears down active background services (including Ollama, network managers, and local development scripts) because it forces a state isolation.

Instead, we want to tell the Orin’s bootloader to cleanly start up in command-line mode from a fresh boot. This allows all your network drivers, background scripts, and Ollama to initialize perfectly without a display manager eating your memory:

sudo systemctl set-default multi-user.target

1	sudo systemctl set-default multi-user.target

Once you run this, restart your Orin to let the changes take effect cleanly:

sudo reboot

1	sudo reboot

How to Boot Back to the GUI (If Needed)

We are developers, which means we want to write and debug our scripts comfortably under the graphical desktop, and then deploy them headlessly. If you ever need to turn your monitor back on and return to the GNOME desktop, simply run this command over SSH:

sudo systemctl set-default graphical.target

1	sudo systemctl set-default graphical.target

Followed by a quick reboot (sudo reboot), and your desktop interface will return exactly as it was.

Step 5: The Test — Running LLaMA 3.1 8B in the GUI

To prove why this matters, let’s look at what happens when you try to force a large model to run while your monitor is plugged in and the graphical desktop is active. Open your terminal in the GUI and run:

ollama run llama3.1:8b --verbose

1	ollama run llama3.1:8b --verbose

The Result: The model will either completely crash with an “Out of Memory” (OOM) error, or it will run painfully slow, chugging out less than 2 tokens per second.

The “Why”: Where Did Your Memory Go?

An 8-billion parameter model quantized to 4-bits requires roughly 4.7 GB of static memory just to fit its weights. When you add the Context Window (KV Cache), that memory requirement quickly balloons to over 5.5 GB.

Here is exactly how your 8GB Orin Nano’s memory is divided when you run a GUI:

System State	Memory Allocation (Approximate)
OS Kernel & System Daemons	~1.2 GB
GNOME Desktop GUI (Monitor Active)	~1.6 GB
Available VRAM for AI	~5.2 GB (Not enough for 8B models + Context!)

Because the GUI steals 1.6 GB, your available memory drops below the critical threshold required to run LLaMA 3.1 8B. The moment your context grows, the system runs out of room, hits a bottleneck, or crashes.

Step 6: Reclaiming the Hardware (Headless Memory Profile)

Now let’s look at the memory profile when we boot the Orin Nano cleanly into the terminal without GDM3 starting up. If you SSH in and run free -h or check jtop, this is what you get:

System State	Memory Allocation (Approximate)
OS Kernel & System Daemons	~1.2 GB
GNOME Desktop GUI	0.0 GB (COMPLETELY RECLAIMED!)
Available VRAM for AI	~6.8 GB (Plenty of headroom for 8B models!)

By going headless, we instantly reclaimed **1.6 GB of ultra-fast VRAM**. That is the difference between night and day when deploying edge AI models.

Step 7: Connect from Windows PowerShell

Now that your Orin is booted headlessly, unplug the monitor, keyboard, and mouse. Walk back to your main Windows development machine, open up **PowerShell**, and SSH directly into the Orin over your local network using the IP address you saved in Step 1:

ssh pjm@192.168.1.15

1	ssh pjm@192.168.1.15

(Be sure to replace “pjm” with your actual Orin username and use your specific IP address!)

Step 8: Run LLaMA 3.1 8B Like a Boss

With your GUI safely dead and your memory completely optimized, run the exact same model command inside your PowerShell session:

ollama run llama3.1:8b --verbose

1	ollama run llama3.1:8b --verbose

The Payoff: Because the system now has a massive 6.8 GB of free, continuous VRAM, the model loads entirely into the Orin’s hardware engines. You will see prompt evaluations complete instantly, and the text will output at an extremely usable speed without a single memory warning or system hiccup.

That is how you cleanly manage your hardware resources, develop efficiently, and run large local LLMs on the edge like an absolute boss.

If you enjoyed this write-up, leave a comment below, subscribe to the channel, and I will see you guys in the next lesson!

🎓 Homework: Show Your Work!

Alright guys, no excuses! If you want to truly master this hardware, you cannot just sit there and watch me do it—you have to get your hands dirty. For your homework today, I want to see you running your own LLaMA 3.1 8B model headlessly on your Orin Nano. Show what tokens per second you are getting on this big modal. Create your own favorite query to show how well the model works. Show me that terminal proof and the memory savings!

Here is the plan:

Record a video of your setup successfully running the model headlessly.
Upload your video to YouTube.
In the description of your YouTube video, you must include a link back to this main tutorial video at the very top of your description.
Post a link to your homework video in the comments section on the video above, running your models like a boss.

Now, get to work! I am looking forward to seeing what you guys build.

AI On the Edge, NVIDIA

AI on the Bleeding Edge: Run Llama LLM Locally on GPU CUDA with NVIDIA Jetson Orin Nano on Jetpack 7.2

June 10, 2026 admin

Hey there, world! Paul McWhorter here. You know me—I don’t just want to use technology; I want to understand exactly how it works under the hood. Today, we’re taking the NVIDIA Jetson Orin Nano and making it “think” right here on our own hardware.

We are bypassing the heavy, automated installers to build llama.cpp from source. This is the gold standard for high-performance AI on edge devices. Let’s get to work!

Part 1: Standalone Llama.cpp Build

First, we need to prepare our engine. This step takes the raw source code from GitHub and compiles it specifically for the Orin Nano’s GPU using the CUDA toolkit.

# =====================================================================
# PART 1: STANDALONE LLAMA.CPP BUILD
# && means only go to next command if this one works
# \ is like hitting enter
# this allows us to make a script.
# =====================================================================
cd ~ && \
git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc && \
cmake --build build --config Release --parallel $(nproc) && \
mkdir -p ~/models && \
wget -O ~/models/qwen.gguf https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_k_m.gguf && \
echo "=== Part 1 Complete: Standalone High-Performance Backend Built ==="

# =====================================================================

# PART 1: STANDALONE LLAMA.CPP BUILD

# && means only go to next command if this one works

# \ is like hitting enter

# this allows us to make a script.

# =====================================================================

cd ~ && \

git clone https://github.com/ggerganov/llama.cpp && \

cd llama.cpp && \

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc && \

cmake --build build --config Release --parallel $(nproc) && \

mkdir -p ~/models && \

wget -O ~/models/qwen.gguf https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_k_m.gguf && \

echo "=== Part 1 Complete: Standalone High-Performance Backend Built ==="

What’s happening here? We are cloning the project, creating a “build” blueprint that tells the compiler to use your GPU (CUDA), and then using your Orin’s full processing power (nproc) to assemble the program. We also create a folder to keep our “brain” files neat and tidy.

Part 2: Run in Web Interface

Now that the engine is ready, let’s launch the server and start chatting with our first model, Qwen.

# =====================================================================
# PART 2 RUN IN WEB INTERFACE
# =====================================================================
cd ~/llama.cpp
./build/bin/llama-server \
  -m ~/models/qwen.gguf \
  --n-gpu-layers 99 \
  --port 8080

# =====================================================================

# PART 2 RUN IN WEB INTERFACE

# =====================================================================

cd ~/llama.cpp

./build/bin/llama-server \

-m ~/models/qwen.gguf \

--n-gpu-layers 99 \

--port 8080

What’s happening here? We move into the folder where we built our engine and launch the server. The --n-gpu-layers 99 flag is the magic! It tells the system to push as many model layers as possible into the GPU memory. The --port 8080 defines the digital “door” our web browser will use to chat with the AI at http://localhost:8080.

Part 3: Download and Run a New Model

One of the best things about llama.cpp is how easy it is to swap out “brains.” Let’s download a more advanced model, clear our network port, and fire it up!

# =====================================================================
# PART 3 DOWNLOAD AND RUN A NEW MODEL
# =====================================================================
#Now try different model (RUN ONE COMMAND AT A TIME)
wget -O ~/models/phi-4-mini.gguf https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF/resolve/main/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf
#between runs it is good to kill the port 8080
fuser -k 8080/tcp
# Now Lets Run the phi-4-mini.gguf
cd ~/llama.cpp
./build/bin/llama-server \
  -m ~/models/phi-4-mini.gguf \
  --n-gpu-layers 99 \
  --port 8080

#

# =====================================================================

# PART 3 DOWNLOAD AND RUN A NEW MODEL

# =====================================================================

#Now try different model (RUN ONE COMMAND AT A TIME)

wget -O ~/models/phi-4-mini.gguf https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF/resolve/main/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf

#between runs it is good to kill the port 8080

fuser -k 8080/tcp

# Now Lets Run the phi-4-mini.gguf

cd ~/llama.cpp

./build/bin/llama-server \

-m ~/models/phi-4-mini.gguf \

--n-gpu-layers 99 \

--port 8080

What’s happening here? We download the phi-4-mini model, and then we use fuser -k 8080/tcp. This acts like a “master key”—if the previous server process didn’t close properly, this forces the port open so we don’t get any “address in use” errors. Then, we launch the server again, pointing it to our new model!

You’re in Control

Once that server is live, you’re not just watching AI happen—you’re running it! Keep an eye on your terminal logs, watch that GPU utilization jump, and remember: you are working on the absolute bleeding edge of local AI performance.

Buckle up, let’s do some exciting projects together. Drop those tokens-per-second scores in the comments!

A Long Way to Go

Guys our eventual goal is to get nemoclaw operating as an agent on the Jetson Orin Nano on Jetpack 7.2. Our first effort was to run Llama and Ollama on the Jetson Orin. We were successful with that but the challenge way, using the canned install commands, we ended up running on the CPU not the Cuda GPU. Today we have a major step forward as we are now running on GPU, with the core models. Next up, we will try to get it running under Olama, while still staying on the GPU.

WHAT HAPPENS ON YOUR DESKTOP STAYS ON YOUR DESKTOP!

OK, here is your homework. Download all the models we looked at last week using the method above. When complete, you should have these models:

Model	Model Family	Size / Parameter Count	Best Used For
`gemma3:1b`	Google Gemma 3	1 Billion	Ultra-fast responses, light footprint
`llama3.2:1b`	Meta Llama 3.2	1 Billion	High-efficiency conversational loops
`phi4-mini:3.8b`	Microsoft Phi-4	3.8 Billion	Heavy reasoning and coding logic
`qwen3:4b`	Alibaba Qwen 3	4 Billion	Structured data and multilingual logic
`qwen3.5:4b`	Alibaba Qwen 3.5	4 Billion	Advanced context processing
`gemma3:4b`	Google Gemma 3	4 Billion	Maximum analytical depth on Orin Nano

NVIDIA

No Cloud. No Internet. No Problem. Two Commands for Local LLM on Jetson Orin Nano

June 7, 2026 admin

Hey guys, welcome back to the channel. Paul McWhorter here from TopTechBoy.com. Today, we aren’t just messing around with simple circuits or basic scripts—we are going to take that NVIDIA Jetson Orin Nano we rescued from the brink of destruction in the last video, and we are going to turn it into a completely sovereign, local thinking machine.

I don’t know about you, but I am tired of Big Tech telling me I need a credit card, a monthly subscription, and a constant high-speed internet connection just to make an AI model reply to a prompt. Today, we are going to do it completely naked. We are going to cut the cord, pull the ethernet, and run cutting-edge Large Language Models entirely on the local physical silicon of your Jetson Orin Nano.

And we are going to do it in exactly two commands. One to build the engine room, and one to fire up the mind.

Let’s get started.

The Hardware Architecture

Before we drop the code into the terminal, let’s understand exactly what we are building today. We are dealing with three core components working together in a unified system.

The Model (The Fuel): This is your raw neural network file (like Google Gemma or Meta Llama). It contains the weights, vocabulary, and potential intelligence. On its own, it’s just a massive, inert file sitting on your storage drive.
Ollama (The Engine Room): This is the heavy lifter. Ollama is a local execution framework that takes that raw model file and boots it directly into the Jetson’s unified RAM and CUDA cores. It handles the brutal mathematical calculations required to generate tokens.
The Terminal Chat (The Dashboard): This is your interface. It provides the clean command-line text box for you to type your prompts and prints the model’s responses back to you in real time.

The Two-Command Installation

Go ahead and fire up your Jetson Orin Nano, open a fresh terminal window, and get ready to type. Remember: copying and pasting makes you weak. Type these out like a real engineer so your hands learn the muscle memory.

Command 1: Install the Ollama Engine

This command fetches the official automated bootstrapper script from Ollama and executes it locally to configure the background system service on your host OS.

curl -fsSL https://ollama.com/install.sh | sh

1	curl -fsSL https://ollama.com/install.sh \| sh

Command 2: Fire Up the Local Model

Once the installation script finishes, your engine room is live. Now, tell Ollama to pull down the optimized 1-billion parameter Google Gemma model and launch an interactive local dialog loop instantly:

ollama run gemma3:1b

1	ollama run gemma3:1b

The moment you hit enter, your Jetson will download the model weights directly to your local drive, load them straight into the VRAM, and drop you into a clean prompt box. Type a question, hit enter, and watch your local silicon generate answers with zero cloud dependencies.

Choosing the Right Mind for Your Machine

The beautiful part about setting up Ollama is that you aren’t locked into just one model. Different models have different parameter sizes and strengths. On the 8GB Jetson Orin Nano, you want to balance model size against your available hardware headroom to keep your generation speeds crisp.

Here are the verified, hardware-accelerated local models you can experiment with right out of the box:

Launch Command	Model Family	Size / Parameter Count	Best Used For
`ollama run gemma3:1b`	Google Gemma 3	1 Billion	Ultra-fast responses, light footprint
`ollama run llama3.2:1b`	Meta Llama 3.2	1 Billion	High-efficiency conversational loops
`ollama run phi4-mini:3.8b`	Microsoft Phi-4	3.8 Billion	Heavy reasoning and coding logic
`ollama run qwen3:4b`	Alibaba Qwen 3	4 Billion	Structured data and multilingual logic
`ollama run qwen3.5:4b`	Alibaba Qwen 3.5	4 Billion	Advanced context processing
`ollama run gemma3:4b`	Google Gemma 3	4 Billion	Maximum analytical depth on Orin Nano

⚠️ Paul’s Engineering Note on Headroom

The 1B (1-Billion parameter) models are incredibly light and will run at lightning speed on the Orin Nano. If you want to push the machine harder for more complex reasoning, step up to the 3.8B or 4B models. Just keep an eye on your system resources—running a 4B model pushes close to the limits of the Orin Nano’s 8GB unified memory architecture, especially if you are running a heavy graphical desktop environment in the background!

To exit out of any active terminal chat session and return to your standard command prompt, simply type:

/exit

/exit

Homework Assignment

Alright, you have the hardware running, you have the engine installed, and you know how to switch out the minds of your machine. Now it’s time for your homework.

I want you to install both the gemma3:1b model and the heavier gemma3:4b model on your Jetson Orin Nano. Run them both through a test sequence: ask them to write a simple Python script, and then ask them a complex logic riddle.

I want you to observe the difference in quality of thought versus speed of generation. Is the 4-billion parameter model smart enough to justify the extra computation time on your hardware, or does the 1-billion parameter model give you the snappy responsiveness you need for a real-time edge application?

Leave a comment down under the video showing your results, tell me which model you prefer running natively on your bench, and I will see you guys in the next lesson!

Technology Tutorials

Tag Archives: LLM

Running Headless on the NVIDIA Jetson Orin Nano on Jetpack 7.2: Run Big Local LLM’s Like a Boss

Step 1: Find Your Jetson’s IP Address

Step 2: Disable and Remove the Default Swap File

Step 3: Configure a Clean Boot into the Terminal

How to Boot Back to the GUI (If Needed)

Step 5: The Test — Running LLaMA 3.1 8B in the GUI

The “Why”: Where Did Your Memory Go?

Step 6: Reclaiming the Hardware (Headless Memory Profile)

Step 7: Connect from Windows PowerShell

Step 8: Run LLaMA 3.1 8B Like a Boss

🎓 Homework: Show Your Work!

AI on the Bleeding Edge: Run Llama LLM Locally on GPU CUDA with NVIDIA Jetson Orin Nano on Jetpack 7.2

Part 1: Standalone Llama.cpp Build

Part 2: Run in Web Interface

Part 3: Download and Run a New Model

You’re in Control

No Cloud. No Internet. No Problem. Two Commands for Local LLM on Jetson Orin Nano

The Hardware Architecture

The Two-Command Installation

Command 1: Install the Ollama Engine

Command 2: Fire Up the Local Model

Choosing the Right Mind for Your Machine

⚠️ Paul’s Engineering Note on Headroom

Homework Assignment

Making The World a Better Place One High Tech Project at a Time. Enjoy!