Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MadDoc

macrumors 6502
Original poster
Apr 25, 2005
331
5
UK
This question comes up a lot but I haven’t found a good answer for my use case and am looking for advice.

I want to run GPT OSS 120b locally. This is to privately process medical reports.

I can’t decide between the M4 Max with 128Gb RAM or stretch myself and get the M3 Ultra with 256Gb RAM.

Not sure if the tokens/sec difference between the binned M3 and the 80 core one and I’m not sure what kind of tokens/sec the Max would get for long contexts. My understanding is that I’d need 256Gb rather than the base 96Gb for OSS 120b or else memory will be an issue.

The computer will only be used for AI inference, nothing else.
 
I can’t decide between the M4 Max with 128Gb RAM or stretch myself and get the M3 Ultra with 256Gb RAM.
I think 120b LLM is maybe a bit too large for 128GB - If it’s 8-bit quantized, you’re looking at 120 GB but macOS has overhead, so its a tight fit. Additionally M4 has less GPU and neural cores then the M3 Ultra.

I'd say the M3 Ultra with 256GB is the best option over the M4
 
  • Like
Reactions: MadDoc and awsom82
That’s what I’ve gone for. A little pricey but I think it’s my best bet to experiment with the higher intelligence models at home.
 
This question comes up a lot but I haven’t found a good answer for my use case and am looking for advice.

I want to run GPT OSS 120b locally. This is to privately process medical reports.

I can’t decide between the M4 Max with 128Gb RAM or stretch myself and get the M3 Ultra with 256Gb RAM.

Not sure if the tokens/sec difference between the binned M3 and the 80 core one and I’m not sure what kind of tokens/sec the Max would get for long contexts. My understanding is that I’d need 256Gb rather than the base 96Gb for OSS 120b or else memory will be an issue.

The computer will only be used for AI inference, nothing else.

T/S is really personal. Some people need to sit there and read output and are dissatisfied with 40t/s. I personally can prompt it then do something else and check back in a few minutes later, so 18t/s is acceptable.

As the ultra is so much more expensive, can you prep all your models and inputs, get the M4 Max 128GB, then run for a few days to see how it goes?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.