NVIDIA Jetson Orin Nano: How to Create and Use Swap Space for Larger Local LLM Models

In our first LLM class, we learned that we could run several of the smaller LLM models on the Jetson Orin, and then in the last class, we saw that we could make those small models run on the GPU, and we did notice faster performance as we began to move the workload to the GPU. The next challenge we found was that the larger models could give EOF errors, which were End of File errors, which usually means we have crashed due to not enough memory.  So, we need to work through this more methodically, and we need to run real benchmarks. Remember, we are running with the Big Dogs, and we are finding that we face tradeoffs between running on the CPU more slowly, or switching to the GPU and facing unpredictable throttling.

Our approach today is to deal with the memory issue. We will begin by turning off the GPU modifications, and just operate on the CPU. We will address the memory issue by creating swap space, and then we will benchmark our models running on the CPU, and complete a spreadsheet. We will need to begin by removing our configuration file that pointed us to the GPU:

Now lets create some swap space by allowing the system to use the SD card or the NVME for memory. Note, I am booting on a NVME. If you are using a SC card, you will notice more performance degradation by using swap space.

This command will create a swapfile:

This command will activate the swapfile:

This command will turnoff use of the swap space

This command shows you if the swapfile is being used

Now lets activate the swapfile:

Now each of these models should run on our system, as the swapfile will prevent the EOF error. Larger models will take a hit because they are bigger, and hence run slower, and then they bigger models will take a further hit in speed because swap memory will be slower than system RAM,

 

Model

Model Family Size / Parameter Count Best Used For
gemma3:1b Google Gemma 3 1 Billion Ultra-fast responses, light footprint
llama3.2:1b Meta Llama 3.2 1 Billion High-efficiency conversational loops
phi4-mini:3.8b Microsoft Phi-4 3.8 Billion Heavy reasoning and coding logic
qwen3:4b Alibaba Qwen 3 4 Billion Structured data and multilingual logic
qwen3.5:4b Alibaba Qwen 3.5 4 Billion Advanced context processing
gemma3:4b Google Gemma 3 4 Billion Maximum analytical depth on Orin Nano
nemotron-3-nano:4b NVIDIA Nemotron 3 4 Billion Edge-optimized reasoning and tool-use

For this lesson, we will just be using CPU computation. This will allow us to benchmark simply between models.

Benchmarking Local LLM on NVIDIA Jetson Orin on Jetpack 7.2        
Model CPU/GPU Power Prompt Rate Eval. Rate Throttling Correct Answer Swap
gemma3:1b CPU MaxN
gemma3:1b GPU MaxN
gemma3:1b GPU 25 Watts
gemma3:1b GPU 15 Watts
llama3.2:1b CPU MaxN
llama3.2:1b GPU MaxN
llama3.2:1b GPU 25 Watts
llama3.2:1b GPU 15 Watts
phi4-mini:3.8b CPU MaxN
phi4-mini:3.8b GPU MaxN
phi4-mini:3.8b GPU 25 Watts
phi4-mini:3.8b GPU 15 Watts
qwen3:4b CPU MaxN
qwen3:4b GPU MaxN
qwen3:4b GPU 25 Watts
qwen3:4b GPU 15 Watts
qwen3.5:4b CPU MaxN
qwen3.5:4b GPU MaxN
qwen3.5:4b GPU 25 Watts
qwen3.5:4b GPU 15 Watts
gemma3:4b CPU MaxN
gemma3:4b GPU MaxN
gemma3:4b GPU 25 Watts
gemma3:4b GPU 15 Watts
nemotron-3-nano:4b CPU MaxN
nemotron-3-nano:4b GPU MaxN
nemotron-3-nano:4b GPU 25 Watts
nemotron-3-nano:4b GPU 15 Watts