Scality's 'Pipelines Over Models' Argument: When Storage Vendors Discover AI

Scality's CTO argues that data pipelines matter more than AI models. Convenient positioning for an object storage vendor - and partially true, but the two-tier storage simplification obscures real infrastructure complexity.

December 14, 2025 · 7 min read

Scality CTO Giorgio Regni argues that as AI foundation models become “broadly available and increasingly interchangeable,” competitive advantage shifts from models to data pipeline infrastructure [1]. The conclusion: organizations should focus on “how you collect, shape, govern, and deliver data to those models.” Conveniently, Scality sells the object storage that Regni positions as the foundation for these pipelines.

The argument contains genuine insight about data infrastructure importance. It also contains the predictable vendor positioning where the solution to AI challenges happens to be the product the vendor sells. Let’s separate the valid points from the self-serving framing.

The “Models Are Commoditized” Claim

Regni states: “As foundation models become broadly available and increasingly interchangeable, they stop being the differentiator” [1]. This claim is partially true and partially premature.

For many enterprise use cases, foundation models are indeed interchangeable. Summarizing documents, extracting entities, classifying text - these tasks work adequately across GPT-4, Claude, Llama, Gemini, and others. Switching between providers is straightforward when the application wraps model calls in an abstraction layer.

However, the claim overstates commoditization. Frontier model capabilities differ significantly. Claude’s extended context windows suit different use cases than GPT-4’s function calling. Llama’s open weights enable fine-tuning that proprietary APIs don’t allow. For applications pushing capability boundaries, model selection matters enormously.

More importantly, the “commoditization” framing serves infrastructure vendors. If models don’t matter, organizations should invest in infrastructure instead - infrastructure that companies like Scality sell. The framing redirects capital expenditure from model development and AI research toward storage and data management. This may be correct for some organizations, but it’s not a universal truth.

The Two-Tier Storage Architecture

Regni argues that the traditional five-tier storage pyramid is consolidating into two layers: fast local flash on GPU servers for active processing, and object storage for everything else [1]. “Flash gives you the bandwidth and latency to keep the GPUs busy. Object gives you the scale, durability, and metadata.”

This simplification has appeal but obscures important nuances.

The two-tier model assumes data moves cleanly between “active” and “everything else” states. Real AI pipelines have messier data lifecycle patterns. Training checkpoints need fast writes during training but become archival after training completes. Intermediate preprocessing results need temporary high-performance storage but aren’t permanent. Inference model weights need fast random access but are read-only after deployment.

A strict two-tier model forces architectural choices: either keep intermediate data on expensive GPU-attached flash or accept the latency of fetching from object storage. Organizations often need a middle tier - fast shared storage that isn’t GPU-local but provides higher performance than object storage. This is exactly the tier that parallel file systems like Lustre, GPFS, or VAST serve.

Regni’s two-tier model conveniently eliminates the tier where Scality doesn’t compete. Object storage is Scality’s business. GPU-local flash isn’t a storage vendor play. The middle tier - fast shared parallel file systems - is competition. Simplifying to two tiers removes competitors from the architectural diagram.

The Hyperscaler Reference

Regni notes that cloud providers have already adopted this simplified stack approach, suggesting enterprises should follow suit [1]. This hyperscaler reference is strategic but requires context.

Hyperscalers operate at scales that change economic calculations. At exabyte scale, building custom object storage infrastructure like AWS S3 or Google Cloud Storage makes sense. At petabyte scale, enterprises face different trade-offs. The infrastructure that works for Google’s AI efforts may not be appropriate for a company with 100PB of data.

Furthermore, hyperscalers don’t actually use two-tier architectures internally. Google uses Colossus for hot data and various tiered systems. Meta uses a combination of distributed file systems and object storage. AWS has multiple storage tiers internally. The “hyperscalers use simple two-tier architectures” claim misrepresents how large-scale AI infrastructure actually works.

Enterprises following hyperscaler patterns should examine what hyperscalers actually do, not vendor simplifications of hyperscaler architectures.

RAG Integration Claims

The article mentions that Scality’s RING object storage “integrates with vector databases and LangChain frameworks to support Retrieval-Augmented Generation (RAG) workflows” [1]. This deserves unpacking.

RAG workflows have specific performance requirements. The retrieval step queries a vector database to find relevant documents, then those documents are fetched and passed to the language model. Vector database query performance dominates retrieval latency - this is where Pinecone, Weaviate, Milvus, and pgvector compete. Document fetch from storage is typically a smaller latency component because embedding search returns document IDs, and actual document fetch can be parallelized.

Scality’s “integration” with vector databases likely means providing S3-compatible storage that vector databases can use for document storage. This is standard object storage functionality, not differentiated RAG capability. Any S3-compatible storage can serve this role.

The LangChain integration similarly needs scrutiny. LangChain is a Python framework for building LLM applications. It has standard interfaces for document loaders and vector stores. “Integrating” with LangChain typically means providing an S3-compatible endpoint that LangChain’s existing S3 document loader can use. This isn’t specialized RAG optimization; it’s standard protocol compatibility.

If Scality has genuinely differentiated RAG capabilities - like co-located embedding computation, optimized document chunking, or specialized retrieval indexes - those would be worth discussing. The article doesn’t describe any such capabilities, suggesting the “RAG integration” is marketing positioning rather than technical differentiation.

Data Lifecycle as Differentiator

Regni’s argument that competitive advantage comes from data lifecycle management - “versioning, enriching, and maintaining governance” - has validity [1]. Organizations with better data practices will build better AI applications than organizations with the same models but messier data.

However, this insight isn’t new and doesn’t particularly favor object storage. Data governance, versioning, and enrichment are practices that span storage layers. Governance requires metadata management systems (data catalogs, lineage tracking). Versioning can happen at the application layer, storage layer, or both. Enrichment is a compute problem more than a storage problem.

Object storage provides some relevant primitives - S3 versioning, metadata headers, lifecycle policies. But these features exist across storage types. Presenting data lifecycle as an object storage differentiator conflates a general practice with a specific storage architecture.

What’s Actually True

Stripping away the vendor positioning, several genuine insights remain:

Data quality and pipeline reliability do matter for AI outcomes. Models trained on garbage produce garbage. Organizations that can’t reliably move data from ingestion to training will struggle regardless of model quality.

Storage architecture decisions affect AI pipeline performance. GPU utilization depends on data feeding rate. Storage that can’t keep GPUs busy wastes expensive compute. This is real.

Object storage provides useful primitives for large-scale data management. Scalability, durability, and metadata capabilities suit many AI data management needs. This is why object storage is widely used in AI infrastructure.

The issue isn’t that Regni’s claims are entirely wrong. It’s that they’re framed to position Scality’s product as the answer to AI infrastructure challenges, which overstates the role of object storage and understates the complexity of real AI infrastructure.

What’s Missing

The article contains no benchmark data, no customer case studies, no performance comparisons. How does Scality’s RING perform for AI data pipeline workloads? What throughput can it sustain? How does it compare to alternatives?

There’s no discussion of total cost. Object storage is typically cheaper per gigabyte than file storage or block storage, but cost comparisons require examining performance per dollar, not just capacity cost.

There’s no acknowledgment of workloads where object storage isn’t ideal. Real-time inference with tight latency requirements, training workloads that need POSIX semantics, checkpoint writes that need transactional guarantees - these use cases may need more than object storage.

A complete analysis would address where object storage fits, where it doesn’t, and how it integrates with other storage tiers. The article presents object storage as the answer without examining the question.

The Core Issue

Storage vendors have discovered AI as a marketing opportunity. Every storage company now has an AI story: their product is essential for AI infrastructure, and their architecture is uniquely suited for AI workloads.

Scality’s argument follows this pattern. AI models are commoditized (so don’t invest there), data pipelines are the differentiator (which requires storage), and object storage is the foundation for data pipelines (which is what Scality sells). The syllogism is convenient.

Organizations building AI infrastructure should evaluate storage options based on workload requirements, performance benchmarks, and total cost - not vendor narratives about where competitive advantage lies. Scality may be the right choice for some AI data management needs. That determination requires analysis the article doesn’t provide.

“Pipelines matter more than models” is a useful corrective to model-obsessed AI discussions. It’s also a sales pitch from an infrastructure vendor. Both things are true simultaneously.

References

[1] Blocks and Files, “AI data pipelines matter more than models,” December 9, 2025. https://blocksandfiles.com/2025/12/09/ai-data-pipelines-matter-more-than-models/

[2] Scality, “RING Object Storage,” https://www.scality.com/products/ring/

[3] LangChain Documentation, “Document Loaders,” https://python.langchain.com/docs/modules/data_connection/document_loaders/