Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] GPU Memory problem on Apple M2 Max 64GB #1870

Closed
2 tasks done
CyborgArmy83 opened this issue Jun 15, 2023 · 22 comments
Closed
2 tasks done

[User] GPU Memory problem on Apple M2 Max 64GB #1870

CyborgArmy83 opened this issue Jun 15, 2023 · 22 comments
Labels

Comments

@CyborgArmy83
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I cannot seem to load larger models than 30 billion parameters into my system, it gives some kind of memory error? I should have plenty of ram. I have tried using the new K quants as well as the older ones. They only seem to load on CPU (which is extremely slow and almost unbearable to use for inference)

Current Behavior

When loading models I can see with another utility that it will not load past 27~29 Gigs into my system?

Environment and Context

Apple M2 Max 64GB latest 13.4

if this issue is related to something that is already known, I apologize. There seem to be so many different issues that I am not entirely sure anymore which is related to which. I am a just a webdev so C++ is quite new to me.

@Schaltfehler
Copy link
Contributor

Is the model suppose to work with llama.cpp?
How do you load the model? Do you one of the example scripts?
Not sure if it helps in your case but you try to load the model with --mlock or --no-mmap
Did you build with metal support LLAMA_METAL=1 make? If so, you can only load q4_0 and q4_1 GGML models.

@shouyiwang
Copy link

This is a known Metal issue that they are unable to allocate enough memory for the GPU. The current limit is approximately half of the physical memory available.

@CyborgArmy83
Copy link
Author

This is a known Metal issue that they are unable to allocate enough memory for the GPU. The current limit is approximately half of the physical memory available.

Oh no! that really sucks. I've just invested almost 5K in a 64GB MAX model because everyone pointed me in that direction for local LLMs. Now I can't use the extra unified memory for GPU? Is there anything we can do to fix this? please tell me there is a solution.

@shouyiwang
Copy link

shouyiwang commented Jun 16, 2023

@CyborgArmy83
A fix may be possible in the future.
Actually using CPU inference is not significantly slower. And it is not a waste of money for your M2 Max. The CPU bandwidth of the M2 Max is still much higher compared to any PCs, and that is crucial for LLM inference.

If you are eager to experiment with 65B using your GPU, 65B Q3_K_S is the recommended choice. However, it is computationally demanding and generally not optimal due to excessive compression. Nevertheless, it still outperforms any 33B models.

The 65B Q4_K_S models are one of the easiest models to compute and have significantly better quality than Q3_K_S. However, due to its larger size, it can only be run on a CPU, as it exceeds half of your RAM's capacity. You'd better compare the speed and quality of the two options and choose whichever you prefer.

@CyborgArmy83
Copy link
Author

@CyborgArmy83 A fix may be possible in the future. Actually using CPU inference is not significantly slower. And it is not a waste of money for your M2 Max. The CPU bandwidth of the M2 Max is still much higher compared to any PCs, and that is crucial for LLM inference.

If you are eager to experiment with 65B using your GPU, 65B Q3_K_S is the recommended choice. However, it is computationally demanding and generally not optimal due to excessive compression. Nevertheless, it still outperforms any 33B models.

The 65B Q4_K_S models are one of the easiest models to compute and have significantly better quality than Q3_K_S. However, due to its larger size, it can only be run on a CPU, as it exceeds half of your RAM's capacity. You'd better compare the speed and quality of the two options and choose whichever you prefer.

Thanks! I did find the Metal performance difference on 7B, 13B and some 30B models I could run night and day!

So I am not sure why you say it's not that much of a difference? Could be twice as fast right?

And how can you even know the true speed difference at this moment since it's not working for anyone? Not trying to be disrespectful here, just trying to understand your reasoning.

I will look into 65B Q4_K_S as you've suggested. Q3 seems to low quality for my needs.

I really hope we can fix this as a community of smart developers!

@shouyiwang
Copy link

@CyborgArmy83
Yeah, for M2 Max, the GPU (38 core) is almost 2 times faster.
But for basic M1/M2 and M1/M2 Pro, GPU and CPU inference speed is the same.

Many people who own the M2 Max 96GB and M1/M2 Ultra models have reported speeds of 65B when using the GPU. It is also relatively easy to estimate the 65B speed based on the performance of smaller models.

@shouyiwang
Copy link

@CyborgArmy83
Hey, could you give the latest code in the master branch a try and see if it solves your problem? While you're at it, could you also check the running prints and let me know what the value of recommendedMaxWorkingSetSize is for your 64GB RAM? Thanks!

@CyborgArmy83
Copy link
Author

@CyborgArmy83 Hey, could you give the latest code in the master branch a try and see if it solves your problem? While you're at it, could you also check the running prints and let me know what the value of recommendedMaxWorkingSetSize is for your 64GB RAM? Thanks!

I followed the issue closely, but thanks for bringing it to my attention! I will check it out, but reading the conclusion here #1826 (comment) - it seems like there are still some big obstacles to be overcome. That's why I am also advocating for mentioning the limitations/remaining issues either in a new issue or discussion topic so others can join in on any new ideas.

@shouyiwang
Copy link

@CyborgArmy83
https://developer.apple.com/videos/play/tech-talks/10580/?time=546
Based on the video, it appears that 64GB Macs have 48GB (75%) of usable memory for the GPU. This should solve your problem. We still have issues because the limit for 16GB and 32GB is significantly lower, at only 65%.

@CyborgArmy83
Copy link
Author

@CyborgArmy83 https://developer.apple.com/videos/play/tech-talks/10580/?time=546 Based on the video, it appears that 64GB Macs have 48GB (75%) of usable memory for the GPU. This should solve your problem. We still have issues because the limit for 16GB and 32GB is significantly lower, at only 65%.

Do you mean that the new software changes from today should be better for 64GB machines?

I am left wondering if there is a way around these limitations. Maybe we can contact the guy from the video, he apparently works for Apple GPU engineering team. Might not be possible to contact him directly but I am willing to give it a try.

@shouyiwang
Copy link

@CyborgArmy83
Can you please try the latest software and tell me the output?
That helps a lot.
Thank you so much!

@CyborgArmy83
Copy link
Author

CyborgArmy83 commented Jun 26, 2023

@CyborgArmy83 Can you please try the latest software and tell me the output? That helps a lot. Thank you so much!

The new changes have resulted in some models now loading, but the limit is still rather annoying. Around 47GB or higher I get problems loading models on my 64GB M2 Max. I also had a M1 Max 64GB to test and the behavior is exactly the same. It seems like there is no current discussion on the matter so I have reached out to Apple engineering team. I hope some other people will also investigate this problem further! There might be a way to use more of our unified memory, this would be very welcoming! 65B models on CPU on Apple Silicon is a lot slower when you have a MAX chip with double the amount of GPU power. I've also read the current Metal implementation for local LLM inferencing is capped at 250GB/s speed, while the memory can go up to 400 GB/s for the MAX and 800 GB/s for the ultra. A recent study showed that memory bandwith might be more important than raw processing power. Hopefully it will get faster on current Apple silicon in the (near) future. Anyone else care to share their experience?

@RafaAguilar
Copy link

RafaAguilar commented Aug 6, 2023

Just to add to the list of the broken ones, the 70B Stable Beluga (Q6_K) which states that around 49GB is the recommended allocation size but model was higher (~54GB):

llama_model_load_internal: mem required  = 54749.41 MB
...
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00

Also breaks with the status 5 issue:

ggml_metal_graph_compute: command buffer 6 failed with status 5

running in M1 Max 64GB RAM.

@RafaAguilar
Copy link

RafaAguilar commented Aug 6, 2023

Btw, using @mbosc fork, I could run over M1 Max GPU the 70B with q4_K_M quantization from TheBloke's Repo, this time the mem required was ~40GB.

The speed can be noticed, over 5x.

CPU Stats

llama_print_timings:      sample time =  1043.88 ms /   241 runs   (    4.33 ms per token,   230.87 tokens per second)
llama_print_timings: prompt eval time = 49382.55 ms /    86 tokens (  574.22 ms per token,     1.74 tokens per second)
llama_print_timings:        eval time = 274492.32 ms /   240 runs   ( 1143.72 ms per token,     0.87 tokens per second)

GPU Stats

llama_print_timings:      sample time =   830.17 ms /   186 runs   (    4.46 ms per token,   224.05 tokens per second)
llama_print_timings: prompt eval time = 50223.16 ms /    86 tokens (  583.99 ms per token,     1.71 tokens per second)
llama_print_timings:        eval time = 37480.13 ms /   185 runs   (  202.60 ms per token,     4.94 tokens per second)

@RDearnaley
Copy link

What would be nice was if reducing --gpulayers fixed the problem, so that only the layers that Metal was actually running counted towards the recommendedMaxWorkingSetSize. I have an M2 MacBook Pro with 64GB, recommendedMaxWorkingSetSize = 49152.00 MB, so I can run LLama2 70B 6-bit quantized on CPU (rather slow), but not on GPU — I guess I'll need to redownload one of the 5-bit quantized versions, rather than just shifting a few layers to CPU threads.

Also, it would be nice if hitting this situation gave a more comprehensible error message than:

GGML_ASSERT: ggml-metal.m:1221: false

without the end user having to first fritz with the code to get GGML Metal debug logging turned back on and then decypher the meaning of the rather opaque:

ggml_metal_graph_compute: command buffer 4 failed with status 5

perhaps something like "Current allocated size is greater than the recommended max working set size: Metal ran out of memory for running your model on GPU (try a smaller context length or smaller model)"

@gsgxnet
Copy link

gsgxnet commented Oct 7, 2023

I wonder, is this an OS set limit? Does Sonoma (macOS 14) change anything?

@CyborgArmy83
Copy link
Author

CyborgArmy83 commented Oct 8, 2023

I wonder, is this an OS set limit? Does Sonoma (macOS 14) change anything?

It is enforced by the OS. There have been somewhat successful attempts to change the memory allocation with a custom kernel/module but they come at a price; significantly reduced security.

Sonoma hasn’t improved or degraded the current situation.

See: #2182

@gsgxnet
Copy link

gsgxnet commented Oct 8, 2023

It is enforced by the OS. There have been somewhat successful attempts to change the memory allocation with a custom kernel/module but they come at a price; significantly reduced security.

Sonoma hasn’t improved or degraded the current situation.

See: #2182

So we have to ask Apple to change that. I think it is just a parameter some developer did set for GPU using apps not to eat up all main memory. In a 96 GB Mac that will mean 24 GB are not available for data which will be processed on the Metal cores.
I doubt very much, there is any technical reason for such a setup. I assume a formula like 2 or 4 GB + 10 % of main memory would be enough for non-Metal reserveable main memory.

What might be the best way for influencing Apple?

@GabrielThompson
Copy link

GabrielThompson commented Nov 3, 2023

What would be nice was if reducing --gpulayers fixed the problem, so that only the layers that Metal was actually running counted towards the recommendedMaxWorkingSetSize. I have an M2 MacBook Pro with 64GB, recommendedMaxWorkingSetSize = 49152.00 MB, so I can run LLama2 70B 6-bit quantized on CPU (rather slow), but not on GPU — I guess I'll need to redownload one of the 5-bit quantized versions, rather than just shifting a few layers to CPU threads.

Also, it would be nice if hitting this situation gave a more comprehensible error message than:

GGML_ASSERT: ggml-metal.m:1221: false

without the end user having to first fritz with the code to get GGML Metal debug logging turned back on and then decypher the meaning of the rather opaque:

ggml_metal_graph_compute: command buffer 4 failed with status 5

perhaps something like "Current allocated size is greater than the recommended max working set size: Metal ran out of memory for running your model on GPU (try a smaller context length or smaller model)"

Does this imply that if a model exceeds the GPU's memory capacity, it cannot offload the excess to the CPU-managed memory for execution? For instance, for a 70B 6-bit Llama2 requiring about 52.5GB RAM, is it possible that 48GB is available for GPU processing with Metal acceleration, leaving the remaining 4.5GB for CPU execution?
And I'm also curious about the inference speed of the Llama 70B 6-bit model when run solely on the CPU of your M2max 64GB MBP.

@timothyallan
Copy link

^ I have that model and 6 OOM's. I can only run the 5-bit at default context size, and increasing either causes the error 5. Q4_K_M is the go to if you want to expand the context a bit I've found.

@zestysoft
Copy link

zestysoft commented Nov 25, 2023

So just to be clear, on an M1 MAX with 64GB of ram, even though the working set size is larger than the mem required, we should still expect it to fail?

export HOST=0.0.0.0 && python -m llama_cpp.server --model wizardcoder-python-34b-v1.0.Q4_K_M.gguf --n_gpu_layers 1 --n_ctx 1024 2>&1 |egrep "(mem required|recommended|failed to allocate)"
llm_load_tensors: mem required  = 19282.65 MiB
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MiB
ggml_metal_add_buffer: error: failed to allocate 'data            ' buffer, size = 19283.21 MiB

edit: never mind -- apparently llama_cpp.server doesn't handle this as well as using ./main for some reason.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants