WEKA's 'Augmented Memory Grid': Real Pedigree, Wrong Architecture
WEKA is one of the few storage vendors that earns its performance claims through independently audited benchmarks. That makes it more frustrating, not less, that their ICMS response wraps existing software in a new brand name and skips every technical question that actually matters for G3.5 KV cache storage.
WEKA published a blog post called “Demystifying the BlueField-4 Inference Context Memory Storage Announcement.” The title implies they’re cutting through complexity. What they’re actually doing is using NVIDIA’s announcement as a launch vehicle for a new product name — “Augmented Memory Grid” — without providing a latency number, a throughput figure, a benchmark condition, or any technical detail that would let an engineer evaluate whether it works for their inference workload.
This stings more coming from WEKA than it would from most vendors, because WEKA has earned a reputation for exactlythat kind of rigor.
What WEKA Gets Right (Usually)
In January 2025, WEKA and HPE submitted SPECstorage Solution 2020 results claiming the #1 ranking across all five workloads: AI_IMAGE, EDA_BLENDED, GENOMICS, SWBUILD, and VDA. Unlike most benchmark announcements in this industry, those results are independently audited by SPEC and published with full hardware configuration disclosure on spec.org. We recognized this at the time as the standard the entire industry should meet but rarely does.
SPECstorage is the kind of transparency that costs something. Vendors who submit to it are committing to methodology review, configuration disclosure, and competitor comparison on equal terms. VAST Data won’t do it. Pure Storage does it selectively and supplements with unverified claims. WEKA did it across all five workloads simultaneously, which requires balanced architecture rather than a single optimized profile. The results are verifiable: anyone can pull the HPE Alletra Storage Server 4110 disclosure report from spec.org and examine the performance curves.
That record makes WEKA’s ICMS response genuinely disappointing. A vendor that publishes audited throughput curves for five enterprise workloads is capable of publishing benchmark data for KV cache tiering. They chose not to.
The Real Problem, Explained Precisely
Before examining what WEKA announced, it’s worth establishing what G3.5 storage actually is and why it’s architecturally distinct from everything that came before it.
During LLM inference, the transformer model builds a KV cache containing the key-value attention state for every processed token. For a long-context request — say, 128K tokens against a model like Llama 3.1 70B — the KV cache occupies approximately 40 GB per concurrent request. An H100 has 80 GB of HBM. That’s fewer than two full long-context requests before HBM fills and the runtime must start evicting cache entries, forcing expensive recomputation when those entries are needed again.
The math on whether flash tiering helps is straightforward. Recomputing 128K tokens of KV cache at roughly 10,000 tokens per second of prefill throughput takes about 12.8 seconds. Reading 40 GB from a high-performance NVMe flash tier at 90 GB/s takes about 0.44 seconds. Flash wins by almost 30x for this workload — which is exactly why NVIDIA formalized the G3.5 tier concept and why vendors rushed to announce support for it.
But the math only holds when the cache is actually reused. For stateless one-off API calls — which describes a large fraction of enterprise inference workloads — cache entries are written once and never read again. Tiering to flash in this pattern adds write latency with no compensating benefit. The workload that benefits from KV cache tiering is specifically multi-turn conversation, repeated document queries, and agentic loops with persistent context state. Neither NVIDIA nor WEKA specifies the reuse-rate threshold at which tiering becomes net positive. That threshold is the central engineering question for anyone considering this infrastructure, and it goes unasked.
What G3.5 Actually Requires
The architecture that makes G3.5 work is fundamentally different from a general-purpose distributed file system. The interface NVIDIA designed through NIXL — the NVIDIA Inference Xfer Library — expects a storage backend that can participate in zero-copy RDMA transfers directly between GPU VRAM and NVMe storage, bypassing the CPU and host OS entirely for the data path.
The flow is specific: the inference framework identifies KV cache blocks for offload, NIXL calls the storage plugin to allocate space and return RDMA credentials (remote key, buffer address), and the GPU writes KV data directly to the registered buffer via RDMA WRITE. On a cache hit, the storage server reads from NVMe and RDMA WRITEs back to the GPU’s VRAM. No copies through host CPU memory. No POSIX syscalls. No NFS or S3 protocol overhead.
This is not what a distributed POSIX file system does. A filesystem like WEKA’s core product handles metadata operations, namespace management, POSIX semantics, replication, snapshots, quotas, ACLs, and multi-protocol access (NFS, S3, SMB). All of that is exactly wrong for KV cache storage. KV cache blocks have no filenames, no directories, no metadata beyond a key and a size, no durability requirements (they’re ephemeral — if lost, recompute), and no access pattern that resembles any POSIX workload.
What G3.5 needs instead is a purpose-built ephemeral block store: a KV index mapping opaque keys to RDMA-registered buffer locations, a block lifecycle that goes Allocated → Committed → Persisted → Evicted → Deleted, a batched TRIM pipeline to keep QLC flash healthy under high delete rates, and a NIXL plugin exposing the correct interface (libvendor_nixl.so) so the inference stack can call it without knowing anything about the underlying hardware. The latency target is sub-millisecond for the RDMA path. The throughput target is sufficient to saturate the GPU cluster’s context memory bandwidth — in practice, 45-90 GB/s per storage node using RDMA over 400GbE.
What WEKA’s Announcement Actually Says
WEKA describes “Augmented Memory Grid” as extending KV cache beyond GPU HBM using “familiar interfaces and standard GPU servers,” providing “immediate relief from memory pressure with minimal disruption,” and operating across “x86, DPUs, GPUs, and whatever comes next.”
The post contains no latency numbers. No throughput figures. No NIXL plugin details. No RDMA architecture description. No benchmark conditions. No model specifications. No context window sizes. No discussion of whether the RDMA zero-copy path is implemented or whether the product relies on CPU-mediated transfers. No clarity on whether “Augmented Memory Grid” is a new software component or a new marketing name for WEKA’s existing distributed filesystem mounted on a GPU server.
“Available today” is the most specific claim in the announcement.
The phrase “familiar interfaces” is doing a lot of work here. G3.5 doesn’t use familiar interfaces — it uses NIXL, which is a purpose-built RDMA transfer library with a specific plugin API. If WEKA has implemented a NIXL plugin, that’s significant and they should say so. If they’re serving KV cache over NFS or their proprietary protocol, that’s a fundamentally different and slower path that sidesteps the GPU-direct data flow NVIDIA designed the tier around.
The Three-Layer Stack That Doesn’t Exist Yet
WEKA’s announcement, like every other ICMS partner announcement, sits on a dependency chain where each layer is incomplete. BlueField-4 hardware — the foundation of NVIDIA’s reference design — ships in H2 2026, at least four months from the CES announcement. The ICMS reference architecture exists as a specification that partners have committed to implement but haven’t shipped. WEKA’s Augmented Memory Grid is described without technical specifications and positioned as something new when it appears to be existing software with a new name.
NVIDIA’s own press release is explicit: “The storage infrastructure for the Vera Rubin pod is designed, built, and delivered by our storage partners based on the NVIDIA reference designs.” Twelve vendors — AIC, Cloudian, DDN, Dell, HPE, Hitachi, IBM, Nutanix, Pure, Supermicro, VAST, and WEKA — simultaneously announced “support” for architecture built on hardware that doesn’t exist. This is not a product ecosystem. It is a marketing ecosystem responding to an NVIDIA announcement.
The performance claims attached to this stack — “up to 5x higher tokens-per-second, 5x greater power efficiency” — come from NVIDIA and have no attached methodology. No model, no context length, no batch size, no baseline system, no benchmark framework. The claims are contingent on “proper orchestration through NVIDIA Dynamo, NIXL, and related frameworks” — software still maturing on hardware that hasn’t shipped. These are directional projections, not engineering specifications.
The Architecture Gap WEKA Should Address
WEKA has the engineering credibility to build real G3.5 storage. Their filesystem achieves genuine high-throughput low-latency I/O, and they have production deployments in GPU clusters. The question is whether “Augmented Memory Grid” is WEKA’s POSIX filesystem mounted on a GPU server — which will work but won’t achieve the latency or throughput of a NIXL-native zero-copy design — or whether they’ve built the ephemeral block store with RDMA-native NIXL integration that G3.5 actually requires.
These are not equivalent options. A POSIX filesystem path for KV cache offload means every read and write passes through the host OS, the VFS layer, and WEKA’s protocol stack before data reaches the NIC. A NIXL-native path with RDMA means data moves directly between GPU VRAM and NVMe. At 128KB block sizes — typical for large KV cache chunks — that difference is the gap between 10-20 µs RDMA latency and 100-200 µs OS-mediated latency. For an inference serving system where every decode step touches the KV cache, that latency difference accumulates across thousands of tokens.
WEKA knows this. Their SPECstorage results demonstrate they understand the distinction between theoretical bandwidth and measurable latency under load. The fact that their ICMS announcement omits these measurements entirely suggests either the NIXL-native implementation isn’t ready, or it is ready and they chose to publish a blog post instead of a benchmark. Either way, the omission is a choice — and it’s inconsistent with the technical standard WEKA set for themselves in their SPECstorage work.
The Honest Version
WEKA’s distributed filesystem has legitimate applicability to GPU workloads. Their SPECstorage results in AI_IMAGE workload are publicly verifiable and represent one of the better data points available on real storage performance under AI training access patterns. They have the hardware relationships and customer base to be a genuine ICMS partner when BlueField-4 ships.
But G3.5 KV cache storage is not the same problem as AI training data access. Training reads large files sequentially and repeatedly; KV cache stores ephemeral blocks written once, read a small number of times, and deleted quickly. Training can tolerate POSIX semantics; KV cache offload benefits from a purpose-built interface that integrates directly with the inference stack’s memory management. Training workloads are measured in MB/s of sustained throughput; KV cache latency matters in microseconds.
If WEKA has built a NIXL plugin with RDMA zero-copy transfers and sub-millisecond latency for KV cache blocks, they should publish the benchmark. The SPECstorage precedent they set shows they know how. If the implementation relies on WEKA’s standard filesystem protocol path, they should say that clearly — it’s still useful for teams without RDMA hardware or BlueField-4 — but it’s not the same product as what NVIDIA designed the G3.5 tier to run.
“Augmented Memory Grid” describes neither option. It’s a name for a capability that WEKA hasn’t characterized, built on hardware that doesn’t exist yet, in support of a standard that twelve other vendors announced simultaneously. For a company with WEKA’s track record, that’s a step backward.
References
[1] WEKA, “Demystifying the BlueField-4 Inference Context Memory Storage Announcement.” https://www.weka.io/blog/ai-ml/demystifying-the-bluefield-4-inference-context-memory-storage-announcement/
[2] NVIDIA Developer Blog, “Introducing NVIDIA BlueField-4-Powered Inference Context Memory Storage Platform for the Next Frontier of AI,” January 6, 2026. https://developer.nvidia.com/blog/introducing-nvidia-bluefield-4-powered-inference-context-memory-storage-platform-for-the-next-frontier-of-ai/
[3] NVIDIA Newsroom, “NVIDIA BlueField-4 Powers New Class of AI-Native Storage Infrastructure for the Next Frontier of AI.” https://nvidianews.nvidia.com/news/nvidia-bluefield-4-powers-new-class-of-ai-native-storage-infrastructure-for-the-next-frontier-of-ai
[4] ServeTheHome, “NVIDIA BlueField-4 with 64 Arm Cores and 800G Networking Announced for 2026.” https://www.servethehome.com/nvidia-bluefield-4-with-64-arm-cores-and-800g-networking-announced-for-2026/
[5] Blocks and Files, “Nvidia pushes AI inference context out to NVMe SSDs,” January 6, 2026. https://blocksandfiles.com/2026/01/06/nvidia-standardizes-gpu-cluster-kv-cache-offload-to-nvme-ssds/
[6] Blocks and Files, “Nvidia’s basic context memory extension infrastructure,” January 12, 2026. https://blocksandfiles.com/2026/01/12/nvidias-basic-context-memory-extension-infrastructure/4090541
[7] StorageMath, “Weka’s SPECstorage Records: How Benchmark Transparency Should Work.” /posts/weka-specstorage-benchmark-transparency/
[8] NVIDIA ai-dynamo, “NIXL — NVIDIA Inference Xfer Library.” https://github.com/ai-dynamo/nixl
[9] NVIDIA ai-dynamo, “Dynamo — NVIDIA Inference Framework.” https://github.com/ai-dynamo/dynamo