Designing Data Centers for LLM Training

A practical checklist for choosing colo or private GPU racks for LLM training: power, cooling, telemetry, and networking.

If you are evaluating colocation or building private GPU racks for large-model training, the question is no longer “Can we get enough compute?” It is “Can the facility keep those GPUs fed, cooled, observed, and networked well enough to sustain training throughput without blowing up operating cost?” That is the practical lens behind this guide, and it is why modern data center design for AI now starts with immediate power, liquid cooling, rack-level telemetry, and network topology—not glossy promises about future capacity. In the same way that a strong community depends on reliable moderation and participation loops, a high-performing AI stack depends on reliable utility, thermal, and fabric decisions that do not collapse under scale.

This article is a developer- and DevOps-friendly checklist for choosing a colocation provider or designing a private deployment for GPU racks used in LLM training. We will compare immediate power options, explain where liquid cooling matters, unpack vendor contract details, and show how to estimate real throughput rather than just theoretical specs. If you are deciding between colocation selection and a private build, this guide is meant to help you avoid the most expensive mistakes before the first rack ever goes live.

1. The new AI infrastructure baseline

Why old data center assumptions fail for LLM training

Traditional enterprise data centers were optimized for relatively predictable server loads, moderate rack densities, and air-cooling designs that worked fine when a rack meant 5 to 15 kW. That model breaks down when a single AI rack can cross 80 kW or even 100 kW under heavy accelerator loads. Once you enter that range, power delivery, water loops, cable management, and maintenance access become part of performance engineering, not just facilities engineering. This is why many teams now treat capacity planning as a core MLOps discipline rather than a procurement afterthought.

The practical shift is simple: for LLM training, a “good enough” room can still be a bad training environment. Even if the GPUs are state-of-the-art, throttling from thermal saturation or network oversubscription will reduce token throughput and make training runs more expensive per useful step. It is similar to how a well-designed workflow can still fail if the data source is unreliable; you can see the same lesson in guides like Technical SEO Checklist for Product Documentation Sites, where infrastructure quality and discoverability are only useful when the underlying system is actually dependable. In AI infrastructure, hardware is only valuable when the facility can sustain it continuously.

Immediate power is the first gate, not the last

For developer-scale LLM training, “immediate power” means the ability to turn on meaningful compute now, not after a 9- to 18-month utility expansion. The best model training teams often work in bursts: a new dataset lands, a hyperparameter sweep is needed, or a prototype must be fine-tuned before a product launch. If your colo promise is “future megawatts,” you are effectively paying option value for a capacity that may arrive after your research window closes. That is why providers with ready-now power frequently win even when their per-kW rate is slightly higher.

When you compare colocation selection options, look for evidence of energized capacity, breaker availability, and practical onboarding speed. Ask whether the facility has spare utility-fed capacity, whether the distribution architecture can support your target rack density, and whether your deployment will require an electrical construction project before the first GPU is installed. A provider that can quote a timeline is not enough; you need a provider that can show the path from contract signature to energized rack. That distinction matters more than almost any brochure claim.

LLM training changes the economics of every square foot

In conventional hosting, floor space is often the limiting factor. In AI, floor space becomes secondary to power and heat rejection. A modest number of racks can represent a multimillion-dollar training cluster, and the cost of downtime or performance throttling can dwarf real estate expenses. This is why modern AI facilities increasingly resemble industrial power systems with networking attached, rather than the other way around.

To keep this grounded, remember that the goal is not just to “host GPUs.” The goal is to maximize useful training output per dollar, per watt, and per minute of queue time. That framing forces teams to think in systems: power path, thermal path, networking path, observability path, and maintenance path. It is also why change control, runbooks, and evidence-based operations matter in the same way they do in risk-heavy services like those discussed in vendor diligence playbooks and trust-signal frameworks.

2. Power density: how much is enough?

Start with the rack, not the room

The fastest way to misjudge AI infrastructure is to ask what the building can hold before asking what each rack actually needs. For modern GPU training systems, you should model power at the rack level first, then work backward to upstream capacity. A compact training pod with dense accelerators, NVMe, switches, and redundant power conversion can easily approach 40 to 60 kW per rack, while frontier-style deployments may go much higher. If your provider cannot express capacity in terms of actual rack deliverables, they are not ready for serious LLM workloads.

Here is the question every DevOps team should ask: can the site support your planned density and sustain it during hot weather, maintenance events, and simultaneous peak loads? If not, the advertised number is just a nameplate rating. This is where GPU/cloud contract terms become operational, because the cost of stranded power or forced derating quickly changes total cost of ownership. In practice, you want an architecture that leaves a sensible margin without making you pay for capacity you cannot use.

Checklist for immediate power evaluation

Use this checklist when speaking with colocation sales and engineering teams. Ask whether the site has utility-backed capacity, what percentage is currently energized, and how much is immediately assignable to new tenants. Confirm whether your racks will require new switchgear, new busways, or long lead-time transformer work. Validate that the site’s redundancy model still works at your target density, because some facilities can support high power only in a limited subset of rooms or only at reduced redundancy levels.

Also ask about peak and sustained delivery, not just average contract power. LLM training is not a light office load, and bursty behavior from job scheduling can create local hotspots and transient demand. That is why teams evaluating capacity planning should model worst-case power draw at the cluster level, not just the typical load. If you are not sure how to translate electrical constraints into usable cluster size, build a small acceptance test with real jobs before signing the final deployment plan.

Why “ready-now” beats “roadmap” in developer-scale AI

A lot of teams get seduced by future expansion promises because they sound strategic. The reality is that AI experimentation cycles are short, funding windows are finite, and product teams need wins now. When a facility can energize today, it lets you deploy private training clusters, benchmark against cloud costs, and validate operational workflows before the next model iteration. That speed advantage can be worth more than a lower advertised rate in a room that will be available “next year.”

If you want a useful analogy, think about how teams plan launch dates for campaigns, events, or product launches. The window matters more than the abstract plan, which is why good operators pay attention to immediate execution timelines in fields as varied as logistics advertising and tech contracting. In AI infrastructure, the same logic applies to power: if it is not available when your team needs it, it is functionally unavailable.

3. Cooling choice: DLC vs RDHx

Direct-to-chip cooling when density gets serious

Direct-to-chip cooling, often called DLC, routes coolant close to the heat source and is increasingly the default answer for very high-density GPU deployments. It is attractive because it can remove heat efficiently from the hottest components without forcing the room air system to do impossible work. For racks that push beyond traditional air-cooling limits, DLC helps maintain performance consistency, reduce fan noise and fan power overhead, and improve thermal headroom during sustained training. If your cluster is built for long runs on large batches, the stability dividend can be significant.

DLC is especially compelling when your planned rack density rises into the zone where air cooling would require excessive airflow, excessive aisle containment complexity, or very high chilled-air costs. It also gives operators more control over the thermal profile of the rack because the cooling path is closer to the hardware that actually produces the heat. In return, you accept more complexity around coolant distribution, leak management, maintenance procedures, and vendor interoperability. This is why vendor diligence matters as much here as it does in enterprise software procurement.

RDHx as a transitional or selective strategy

Rear-door heat exchangers, or RDHx, can be a strong fit when you need better thermal performance than plain air cooling but are not ready to build a full direct-to-chip ecosystem. RDHx captures heat at the back of the rack, which helps relieve the room and can be easier to retrofit in some facilities. It is often more approachable for teams that are scaling gradually, piloting GPU racks, or working inside a colo that already supports the physical and plumbing requirements. In other words, RDHx can be a sensible middle ground when you need better cooling without committing to the full operational shift of DLC.

The tradeoff is that RDHx still leaves some heat-generating hardware components dependent on internal airflow and room conditions. That can be fine for many deployments, but once rack density becomes extreme, RDHx may not provide enough margin to keep temperature under control without larger supporting systems. If you want a useful mental model, think of RDHx as a stronger tailwind and DLC as a more direct drive train. Both can help, but the best choice depends on the speed and load profile of your training fleet.

How to choose between DLC and RDHx

Choose DLC if you are planning very dense racks, expect sustained utilization, and can enforce disciplined service operations. Choose RDHx if you need a faster path to deployment, want a retrofit-friendly design, or need an intermediate step before moving to higher density. Either way, the decision should be driven by thermal modeling, maintenance playbooks, and the facility’s actual support capability—not just vendor marketing. If the colo cannot explain the long-term service process in plain language, that is a warning sign.

A practical rule: if your cooling solution requires heroic assumptions to keep the GPUs within spec, it is not the right solution. Sustainable performance depends on boring reliability. That is a lesson as relevant to learning systems as it is to hardware, and in training infrastructure boring reliability is what protects throughput, budgets, and staff sanity. In AI facilities, excellence often looks unremarkable because the systems simply keep working.

Pro Tip: Ask the colocation provider for measured inlet and outlet temperature data at your target rack density, not just theoretical cooling capacity. Real telemetry beats brochure math every time.

4. Rack-level telemetry and observability

What you should monitor at the rack edge

For LLM training, rack-level telemetry is the difference between proactive operations and expensive guesswork. You want live visibility into power draw, inlet and outlet temperatures, coolant flow where applicable, humidity, breaker status, and network error rates. If a rack is approaching thermal limits or power constraints, your scheduler should know before jobs degrade or fail. This is not a luxury feature; it is basic operational control for high-value compute.

Strong telemetry also helps with capacity planning because it reveals how your real workloads behave under load. Many teams discover that the average power draw is far below the peak, but peaks cluster around synchronization phases, checkpointing, and large data movement. That insight matters when you are deciding whether to add more GPUs, redistribute jobs, or upgrade the cooling tier. Think of telemetry as the instrumentation that turns facility operations into a measurable engineering discipline.

What good telemetry looks like in practice

At minimum, a good rack should expose telemetry through APIs or monitoring tools that your operations team already uses. SNMP alone is often not enough for modern AI operations if it cannot be integrated into alerting, dashboards, and automated response workflows. Ideally, your data center design should let you correlate facility metrics with scheduler events, node health, and job runtime statistics. That correlation is what lets you answer questions like, “Did our training slowdown come from networking contention, thermal throttling, or a bad batch of hardware?”

Facilities that support change logs, maintenance windows, and clear ownership boundaries tend to be easier to operate. The same principle appears in high-trust software and marketplace systems, where evidence and transparency reduce friction. You can see that thinking in trust-signal design and in operational playbooks like automation patterns for manual workflows. When telemetry is integrated well, the infrastructure becomes easier to scale and easier to trust.

Telemetry is also a contract issue

Do not assume the facility will provide the telemetry granularity you need. Some providers offer only coarse-building metrics, while others expose rack or circuit level data. Ask whether you can access raw readings, historical exports, and alert thresholds. Clarify whether telemetry is included in the service level agreement or whether it is a premium feature.

This matters because telemetry data often becomes part of your cost model and incident review process. If you cannot reconstruct what happened during a training interruption, you cannot improve the system. Good observability saves money twice: first by preventing waste, and second by reducing time spent on root-cause analysis. That is a serious competitive advantage for teams running large-model training on tight schedules.

5. Networking choices that affect real throughput

Latency is not the only metric that matters

When teams hear “networking,” they often jump straight to latency. Latency matters, but for distributed LLM training, throughput, oversubscription, congestion behavior, and fabric stability are just as important. A cluster can have low nominal latency and still perform poorly if the network fabric collapses under synchronization traffic. If your training jobs use multi-node all-reduce patterns, the wrong design can stretch training time and inflate cost much more than a small latency difference ever would.

That is why networking must be designed as part of the training architecture, not as a facility footnote. The most relevant questions are: what east-west bandwidth is guaranteed, how much headroom remains during simultaneous jobs, and whether the facility supports the switch architecture your team wants to use. Many high-density deployments benefit from leaf-spine designs, clean cable paths, and a fabric that keeps contention predictable. If your provider cannot speak fluently about these details, you are not talking to the right technical team.

Connectivity options: private backbone, direct connect, or public cloud adjacency

For many developer-scale teams, the best deployment model is hybrid: private GPU racks for training, plus cloud or object storage for burst capacity and artifact distribution. In that case, the facility’s networking interconnect options matter as much as raw bandwidth. A colocated cluster with strong peering, good carrier choice, and direct-connect options can reduce egress pain and shorten data access times. If your dataset lives in multiple places, network path quality can materially change the economics of each training run.

As you evaluate colocation selection, ask whether you can colocate adjacent systems, use dedicated cross-connects, or integrate with cloud onramps. The right answer depends on whether your bottleneck is data ingress, synchronization, or artifact distribution. For practical negotiation guidance, the same mindset used in GPU/cloud contracts and channel-level ROI analysis applies: measure the cost per outcome, not just the list price of bandwidth.

Designing for real-world training throughput

The best way to evaluate networking is to benchmark it with workloads that resemble your actual training loop. Synthetic tests are useful, but they often miss the storage access patterns, gradient synchronization, and checkpoint write behavior that define your real bottlenecks. Build a test plan that includes model size, batch schedule, data loader pressure, checkpoint frequency, and failure recovery. The goal is not just to “pass a network test” but to determine whether the facility sustains the end-to-end job timeline you care about.

One useful comparison is to think about your cluster like a newsroom or operations team coordinating live updates. If the communication layer is slow or inconsistent, the whole system loses momentum. The same principle appears in operations workflows: responsiveness is an outcome, not just a feature. In AI training, network design is part of that outcome.

6. Colocation selection checklist for developer-scale LLM training

Facility questions you should ask before signing

Start with the basics: how much immediately available power can the site allocate to your deployment, what density is supported per rack, and what cooling topology is already live in production? Then move deeper: what is the lead time for cross-connects, what monitoring is available, and what is the change management process for maintenance or expansion? Ask whether the colo can support both current racks and your next two growth stages, because the cheapest move is usually the one you never have to make.

You also want clarity on redundancy and service expectations. If the provider says “N+1” without explaining which components are covered, that is not enough. A useful provider should be able to explain where the real failure domains are: utility feed, UPS, busway, coolant loop, switch fabric, or room-level environmental controls. For context on structured procurement and risk review, look at how teams approach enterprise diligence and credibility signals; you need the same rigor here.

Red flags in colo sales conversations

Beware of vague answers about “AI readiness” that do not translate into measured rack-level capabilities. Be skeptical of future capacity promises with no energized path, and of cooling claims that are not tied to your expected heat load. Also watch for hidden costs around installation labor, remote hands, custom cabling, and telemetry access. If the commercial model looks simple but the operational model is opaque, the bill usually arrives later.

Another red flag is a facility that treats AI as a side market rather than a core competency. High-density GPU environments stress utilities, operations, and maintenance in unusual ways. If the staff cannot discuss liquid cooling service, sensor calibration, or network congestion mitigation in practical terms, your deployment may become their learning project. You want a partner that has already learned the painful lessons elsewhere.

When private racks make more sense

Private racks can outperform colocation when your team needs strict control over topology, security, or specialized cooling. They also make sense if you have the engineering discipline to operate your own monitoring, patching, and capacity planning. If your workloads are stable and your organization expects sustained use, owning the stack can lower long-term costs and reduce dependency on external scheduling constraints. But private racks only work if you can afford the complexity of managing all the layers you are taking on.

That is why the decision should be business-driven, not ideology-driven. The right choice is the one that delivers the best combination of throughput, reliability, and flexibility for your model lifecycle. In some cases that means a colo with ready-now capacity; in others, it means a controlled private installation. Either way, do not start with “build versus rent.” Start with workload requirements and infrastructure realities.

7. A practical cost model for LLM training facilities

Cost per training outcome, not cost per rack

Raw rack pricing can mislead you because it ignores power restrictions, thermal headroom, network performance, and deployment delays. A cheaper rack that cannot support your target density or schedule may actually produce a higher effective cost per successful training run. Instead, calculate cost per trained token, cost per experiment completed, and cost per day of usable cluster time. That model reveals the hidden cost of throttling, idle time, and failed jobs.

Consider the economics of delayed deployment. If your team loses six weeks waiting on power delivery or cooling retrofits, the apparent savings from a lower colocation rate can evaporate quickly. In that sense, immediate power is a financial hedge against schedule risk. This is similar to how smart procurement teams prioritize marginal ROI in other domains: the cheapest input is not always the best investment if it does not move the outcome.

Watch the hidden operational costs

Do not ignore service labor, liquid coolant maintenance, spare parts strategy, telemetry licensing, and network cross-connect charges. These costs are often small individually but can become material over a year of continuous experimentation. If your training team will rotate hardware frequently, make sure the facility can support rapid swaps without procedural drag. If your environment is mission-critical, spare capacity and on-site support may be worth more than a nominal discount on power.

In addition, evaluate how easily the site supports expansion. A facility that requires a full redesign every time you add another rack will quietly tax your team’s time and budget. The best environments make scaling feel incremental, not disruptive. That principle shows up in other high-growth systems too, from alternative data sourcing to search cluster planning: the better the foundation, the lower the friction at the edges.

A simple decision framework

Score each option across four categories: immediate power, cooling suitability, telemetry quality, and networking fit. Weight each category based on your workload profile, not on marketing language. If your work is experimentation-heavy, flexibility and fast onboarding may matter more than absolute peak density. If your work is continuous large-scale training, thermal stability and fabric determinism may dominate.

Use the scorecard to compare colo providers and private rack designs on equal terms. The goal is not perfect objectivity, but a shared decision framework that engineering, finance, and leadership can all understand. Once you have that, you can justify tradeoffs clearly and avoid infrastructure decisions based on gut feel alone. That kind of decision hygiene pays off far beyond AI hardware.

8. Deployment checklist: what to verify before go-live

Pre-installation checklist

Before shipping hardware, confirm breaker allocation, rack power path, rail compatibility, floor loading, coolant interface, and network handoff details. Make sure you know who owns each step from delivery to commissioning. Verify that your remote hands process can actually support your hardware at the pace you need, especially if your team operates across time zones. If any of these steps are ambiguous, resolve them before equipment is on the truck.

It is also smart to run a tabletop incident review before go-live. What happens if a switch fails, a coolant loop alarms, or one rack trips unexpectedly during a training run? Who is on point, how are alerts surfaced, and what is the rollback or failover plan? These questions may feel bureaucratic, but they are what separate a professional AI environment from a demo lab.

Commissioning checklist

During commissioning, test real jobs under load. Monitor thermal behavior, power curves, network congestion, checkpoint latency, and any scheduler or storage bottlenecks. Log the baseline numbers carefully so future changes can be compared against something concrete. If the first run reveals unexpected hotspots or packet loss, treat that as a gift: you have caught the problem before your most expensive experiments depend on the system.

For teams that want better operational adoption, it helps to think in terms of micro-milestones. Like the ideas explored in gamified achievement systems, small measurable wins keep teams engaged and disciplined. In infrastructure terms, each validated rack, each clean checkpoint, and each stable overnight run is a milestone toward a trustworthy training platform.

Operational checklist after go-live

Once live, establish weekly reviews of rack telemetry, monthly capacity reviews, and a clear incident-postmortem process. Track not just uptime but useful uptime: how often the cluster was available at the density and performance profile your training jobs actually required. Review whether your observed thermal and network behavior justify changes to rack layout, cooling settings, or scheduling policy. This is where continuous improvement turns raw infrastructure into a real platform.

If you are managing multiple stakeholders, write these outcomes down in a living runbook. Good runbooks reduce ambiguity, help new team members ramp faster, and make vendor escalations more effective. This is the same operational logic that helps teams improve reliability in other systems, from ad ops automation to two-way operations workflows: clarity creates speed.

9. Comparison table: DLC, RDHx, and air cooling for AI racks

Option	Best For	Typical Density Fit	Operational Complexity	Key Tradeoff
Air cooling	Lower-density GPU or mixed-use rooms	Low to moderate	Lower	Often insufficient for sustained high-density LLM training
RDHx	Retrofits and transitional AI deployments	Moderate to high	Medium	Improves thermal handling, but still depends partly on room conditions
DLC / direct-to-chip	Very dense GPU racks and sustained training	High to very high	High	Best thermal performance, but requires more disciplined operations
Hybrid liquid + air	Mixed environments with phased AI expansion	Moderate to high	Medium to high	Flexible, but design consistency can be harder to maintain
Immersion cooling	Specialized deployments with strong vendor support	Very high	Very high	Powerful thermally, but narrower ecosystem and service model

10. FAQ: common questions from developers and DevOps teams

How much power do I need for a developer-scale LLM training rack?

It depends on the accelerator class, number of nodes, storage footprint, and networking gear, but a useful planning range is often 40 kW and up per rack for serious training environments. Some modern deployments can exceed 100 kW per rack when fully populated and aggressively tuned. The safest approach is to model your target hardware, then validate with vendor-specific power curves and headroom for peak loads.

Is RDHx enough, or do I need direct-to-chip cooling?

RDHx is often enough for transitional or moderately dense deployments, especially in colo environments where retrofit simplicity matters. Direct-to-chip cooling becomes more attractive as density increases and sustained training loads make air movement alone impractical. If your rack design requires high thermal margin or future expansion into very dense clusters, DLC is usually the more scalable path.

What telemetry should I require from a colo provider?

At minimum, ask for power draw, circuit status, thermal readings, coolant metrics where applicable, and network health indicators. Ideally, the provider should expose historical data and API access so you can correlate facility metrics with workload performance. Without that visibility, incident response and capacity planning become much harder than they need to be.

How important is latency for LLM training?

Latency matters, but it is not the whole story. For distributed training, throughput, fabric consistency, and congestion behavior can matter just as much or more. A network with slightly higher latency but stable high throughput may outperform a “faster” network that collapses under synchronization traffic.

Should we build private racks or use colocation?

Use colocation if you want faster deployment, lower operational burden, and access to existing power and cooling infrastructure. Build private racks if you need strict control over topology, security, or special cooling and have the team to support it. The right answer depends on how much operational complexity your team can realistically own.

11. Final recommendation: design for throughput, not vanity specs

What matters most

The best AI data center design for developer-scale LLM training is the one that turns hardware into usable throughput without constant intervention. That means immediate power you can actually consume, liquid cooling suited to your rack density, telemetry you can trust, and network paths that preserve model training efficiency. Fancy specs mean little if jobs stall, throttle, or cost twice as much as expected to complete.

If you are evaluating vendors, keep asking the same question from different angles: will this facility let my team train faster, more predictably, and with fewer hidden costs? If the answer is yes, you are probably looking at a serious AI infrastructure partner. If the answer depends on future capacity, future cooling, or future upgrades, then the risk is being shifted onto your team.

What to do next

Build a short shortlist of providers and score them using the same four-part framework: immediate power, cooling fit, telemetry depth, and network design. Run a real workload benchmark if possible, not just a brochure review. Compare the total cost of useful training time, not the sticker price of a rack. That is the fastest way to avoid expensive surprises and the best way to build a reliable AI platform your developers can actually depend on.

For deeper context on adjacent infrastructure and operations topics, see how resilience, trust, and operational maturity show up across different systems, from cluster-based search strategy to rapid-response frameworks. The core lesson is the same: strong systems are designed for real-world conditions, not idealized ones.

Redefining AI Infrastructure for the Next Wave of Innovation - A broader look at why immediate power and location are reshaping AI facility strategy.
Vendor Checklist: What to Negotiate in GPU/Cloud Contracts (and How to Reflect It on Invoices) - Learn the procurement details that affect real infrastructure costs.
Topic Cluster Map: Dominate 'Green Data Center' Search Terms and Capture Enterprise Leads - Useful for understanding how AI infrastructure topics cluster in search and procurement research.
Vendor Diligence Playbook: Evaluating eSign and Scanning Providers for Enterprise Risk - A transferable framework for evaluating operational vendors with rigor.
Open-Source Models for Safety-Critical Systems: Governance Lessons from Alpamayo's Hugging Face Release - Helpful for thinking about governance, reliability, and high-stakes system design.