VAST Data's 'Classical HPC' Framing: When Marketing Rewrites Storage History
VAST Data positions legacy parallel file systems as obsolete for AI. The technical claims deserve scrutiny - metadata matters, but the 'classical vs. modern' narrative obscures more than it reveals.
VAST Data’s Jan Heichler argues that AI workloads require fundamentally different parallel file system designs than “classical HPC” systems [1]. The framing positions traditional parallel file systems - Lustre, GPFS (Spectrum Scale), BeeGFS - as architectural relics unsuited for modern AI. VAST’s disaggregated, shared-everything (DASE) architecture, naturally, represents the evolved alternative.
The technical claims contain genuine insights about metadata scalability. They also contain convenient omissions about trade-offs and a rewriting of storage history that benefits VAST’s competitive positioning. Let’s examine what’s accurate, what’s misleading, and what the article doesn’t mention.
The “Classical Striping” Strawman
Heichler states: “Classical striping parallel file systems were originally designed for traditional HPC workloads, where many processes write concurrently into the same large file” [1]. This positions striping as the defining characteristic of legacy systems and implies it’s the wrong approach for AI.
The framing is technically incomplete. Striping in parallel file systems serves multiple purposes: aggregate bandwidth by spreading data across storage targets, balance load across servers, and provide parallelism for large I/O operations. These benefits apply to AI workloads as much as traditional HPC. Large model checkpoints, training datasets, and intermediate tensors all benefit from striped access.
Modern Lustre and GPFS implementations have evolved significantly from their original designs. Lustre’s Progressive File Layout (PFL) allows files to start with conservative layouts and expand striping as they grow. GPFS supports file placement policies that can direct different workloads to appropriate storage pools. Characterizing these systems by their original 1990s designs ignores two decades of development.
The actual limitation Heichler identifies - that legacy systems tied clients to specific controllers or storage targets - reflects older architectures. But Lustre’s Data-on-MDT (DoM) feature, GPFS’s native RAID, and various client-side caching improvements have addressed many of these constraints. The “classical” systems in the article are older versions of systems that have continued to evolve.
The Metadata Insight (and Its Limits)
Heichler’s strongest technical point: “the real distinction today is not striping itself, but the metadata design, which determines how much concurrency the system can sustain” [1]. This is correct. Metadata scalability is often the bottleneck for workloads with millions of small files or high create/delete rates.
Traditional Lustre deployments use a single Metadata Server (MDS), which becomes a bottleneck at scale. GPFS distributes metadata but requires careful tuning. BeeGFS offers metadata distribution but with consistency trade-offs. VAST’s claim that their architecture handles metadata differently addresses a real problem.
However, the article doesn’t examine VAST’s metadata trade-offs. Scattering “data and metadata everywhere across everything” provides concurrency but creates consistency challenges. When metadata lives everywhere, maintaining coherent views during failures becomes complex. The DASE architecture must handle questions that single-MDS systems avoid: what happens when different nodes have different views of the namespace during a network partition? How are conflicting updates resolved?
VAST presumably has answers to these questions, but the article doesn’t explore them. The comparison presents legacy limitations without examining VAST’s corresponding trade-offs.
“Non-Isolated Failure Domains” as a Feature
The article lists “non-isolated failure domains” as a VAST advantage [1]. This deserves scrutiny because failure domain isolation is traditionally considered a reliability feature, not a limitation.
In traditional parallel file systems, a failed storage target affects only the data on that target. Other data remains accessible. This isolation limits blast radius - a disk failure doesn’t cascade into a cluster-wide event.
VAST’s architecture scatters data across all nodes, which means any node failure potentially affects access to any piece of data. The system handles this through redundancy and reconstruction, but the failure domain is inherently larger. Whether this is better depends on redundancy levels, reconstruction speed, and failure probability.
Framing larger failure domains as an advantage inverts traditional reliability engineering. The article should explain why VAST’s approach provides equivalent or better availability despite the larger blast radius. Without that explanation, “non-isolated failure domains” sounds like marketing spin on an architectural trade-off.
AI Workload Characterization
Heichler describes AI workloads as generating “millions of small, high-fanout reads, extremely parallel random I/O, concurrent access from GPUs” [1]. This characterization is partially accurate and partially misleading.
AI training workloads are diverse. Large language model training typically involves sequential reads of tokenized text data - not small random I/O. Image classification training reads millions of small files (individual images), matching the description. Video training reads large sequential files. Inference workloads read model weights once and then process input data with patterns that vary by application.
The “millions of small reads” characterization fits certain AI workloads (ImageNet-style training) but not others (LLM training, video processing, large model checkpoint reads). Generalizing one workload pattern to “AI” creates a strawman where traditional parallel file systems appear unsuited for all AI work.
Lustre and GPFS handle large sequential I/O extremely well - that’s what they were optimized for. If a specific AI workload involves large sequential access patterns, traditional parallel file systems may outperform architectures optimized for small-file random I/O. The right answer depends on the specific workload, not a generalized “AI requires X” claim.
The Multi-Protocol Argument
VAST claims advantages for supporting “hybrid file + object protocols with shared namespaces” [1]. This addresses a real operational need - AI pipelines often involve data ingestion via S3 semantics and training access via POSIX semantics.
However, multi-protocol support isn’t unique to VAST. NetApp ONTAP supports NFS, SMB, and S3 on the same data. Dell PowerScale provides similar capabilities. Pure Storage FlashBlade offers unified file and object access. The feature is valuable but not differentiating.
The deeper question is whether unified namespaces across protocols introduce consistency complexity. S3 has eventual consistency semantics (historically, now strong consistency in AWS). POSIX expects strong consistency. Serving the same data through both protocols requires careful handling of concurrent modifications. How does VAST handle a POSIX write concurrent with an S3 PUT to the same object? The article doesn’t address protocol consistency, which is where multi-protocol implementations often have subtle issues.
What’s Missing: Actual Performance Data
The article contains zero performance benchmarks. No IOPS numbers. No throughput comparisons. No latency measurements. No comparison between VAST and the “classical” systems being criticized.
If VAST’s architecture provides superior metadata performance, show the numbers. How many file creates per second? What’s the latency distribution for random reads at scale? How does throughput compare to a modern Lustre deployment on equivalent hardware?
Without performance data, the article is architectural philosophy, not engineering evidence. VAST may well outperform traditional parallel file systems for specific workloads - but the article provides no data to evaluate that claim.
The Competitive Positioning
VAST Data is a storage vendor competing against established parallel file systems for AI infrastructure budgets. Their marketing naturally positions competitors as outdated and VAST as modern. This doesn’t mean their technical claims are wrong, but it means the framing serves business purposes.
The “classical HPC” framing is particularly strategic. By defining existing competitors as “classical” and “traditional,” VAST positions its architecture as the evolutionary next step. This framing discourages potential customers from evaluating modern Lustre or GPFS deployments on their actual current capabilities.
A more neutral framing would be: “Different parallel file system architectures have different trade-offs. Here’s how VAST’s trade-offs suit specific AI workload patterns, with benchmark data showing performance differences.” That’s harder to write and less compelling as marketing, but more useful for purchasing decisions.
The Journalism Gap
The Blocks and Files article presents VAST’s claims without examining trade-offs, requesting benchmark data, or including perspectives from Lustre, GPFS, or other parallel file system vendors [1]. It reads as vendor content republished rather than reported.
Organizations evaluating parallel file systems for AI need comparative information: performance benchmarks on representative workloads, failure behavior under realistic conditions, operational complexity, cost per performance unit. None of this appears in the article.
The coverage serves VAST’s marketing objectives while providing limited value for infrastructure decisions.
The Actual Question
For organizations choosing parallel file systems for AI workloads, the useful questions are:
What are your workload characteristics? Millions of small files with high metadata rates favor architectures optimized for metadata distribution. Large sequential access favors traditional striped architectures. Mixed workloads require evaluating trade-offs.
What’s your failure tolerance? Larger failure domains require more redundancy to maintain equivalent availability. Understand the reliability math for each architecture.
What protocols do you need? If you need seamless S3 and POSIX access to the same data, evaluate how each system handles cross-protocol consistency.
What’s the total cost? VAST’s architecture has specific hardware requirements. Compare TCO on equivalent performance, not just per-node costs.
What’s your operational experience? Lustre and GPFS have decades of operational knowledge in the HPC community. VAST is newer with a smaller operational knowledge base. Consider support, community, and institutional knowledge.
None of these questions are answered by “classical systems are obsolete, modern systems are better.” They require workload-specific benchmarks and trade-off analysis that vendor marketing doesn’t provide.
References
[1] Blocks and Files, “VAST Data says AI parallel file systems are not your classical HPC ones,” December 1, 2025. https://blocksandfiles.com/2025/12/01/vast-data-says-ai-parallel-file-systems-are-not-your-classical-hpc-ones/
[2] Lustre Documentation, “Progressive File Layouts,” https://doc.lustre.org/lustre_manual.xhtml#progressivefilelayouts
[3] IBM Spectrum Scale Documentation, “File Placement Optimizer,” https://www.ibm.com/docs/en/spectrum-scale
StorageMath applies equal scrutiny to all vendors. VAST Data builds capable storage systems. Their architectural claims about metadata scalability have merit. The “classical vs. modern” framing serves marketing purposes while obscuring trade-offs that matter for purchasing decisions. Show the benchmarks.