Status symbols could also be called symbols of inequality

robber@lemmy.ml · 8 hours ago

Well compared to the strix, 400GB/s is not that bad, I think with fast system RAM and expert offloading you could squeeze quite something out of it when running stuff in the 100b-a10b regions.

Your bigger problem is going to be future software support.

robber@lemmy.ml · 8 hours ago

In case you missed the Ornith 1.0 release (Qwen and Gemma RL finetunes for agentic / coding workloads), they look interesting to bridge the gap until we see larger 3.6 models or a 3.7 release. I didn’t test them yet but according to benchmarks, the 35b MoE seems to be more or less on par with Qwen3.6 27b dense, while ofc a lot faster.

robber@lemmy.ml · 9 hours ago

You can control how much context should be fitted with --fit-ctx and how much space the algorithm should leave unallocated (even on a per-GPU basis) with --fit-target.

robber@lemmy.ml · 1 day ago

I currently run Qwen3.6-27b on llama.cpp and use it via openwebui. Mostly, I use it for web research via tavily, to a lesser extent for coding and interactively learning about things that are new to me but common in training data (such as basic math or ML concepts).

robber@lemmy.ml · 1 day ago

Given the 27b is a dense model, I think the numbers are quite ok. Curious about the quant tho.

The cool thing about the strix is its large unified memory, but it lacks memory bandwith for compute intensive workloads. Something like Qwen3.5-122b MoE with only like 12b active parameters might run at twice the speed if it fits the configuration.

robber@lemmy.ml · 1 day ago

Since implementation of the --fit parameter and its relatives, and --fit on becoming the default, llama.cpp intelligently decides what to offload. For me, it made --n-cpu-moe obsolete.

robber@lemmy.ml · 21 days ago

Status symbols could also be called symbols of inequality

robber@lemmy.ml · 28 days ago

Your biggest issue with 2010 cards will be software (inference engine) support, I assume.

robber@lemmy.ml · edit-2 28 days ago

To add some practical advice:

It depends on what you mean by more advanced models. I run Qwen3.6-27b on 48GB VRAM across 3 cards (RTX 2000e Ada), and with the recent software optimizations merged into llama.cpp (tensor parallelism & MTP) I get around 30 tokens per second in generation. I use the model through openwebui for (agentic) web research and simple Q&A mostly and I’m quite happy with what it can do.

If you want something similar, maybe look at one or two second hand V100 PCIE 32GB. Or something from the Intel Arc Pro series, if you don’t mind the software support lacking behind a bit (as in less optimized).

Also it might be worth reading into the difference of dense vs MoE models, if you’re new to that. For MoE models, if your system RAM is fast enough, it’s often viable to offload the “experts” (largest parts of such models) to RAM, reducing VRAM capacity needs. Note that server motherboards with e.g. octa-channel RAM have a huge advantage over consumer boards (making DDR4 interesting despite slower speed per module).

And to adress your last question, while I have no direct experience, I’ve seen posts online about people connecting Strix Halo or DGX Spark devices, but usually via a 10+Gbit/s switch as interconnect is crucial (except if you just want to load balance).

Self-hosting LLMs is a very fun thing to do, but also a time- and money-consuming rabbit hole. You might wanna check out the LocalLlama community over at shitjustworks.

Edit: typos

robber@lemmy.ml · 3 months ago

Global sustainability rules???

robber@lemmy.ml · 2 years ago

Migrated my self-hosted Nextcloud to AIO and I absolutely love it