Allocating the Unallocatable: Splitting Private-Cloud Overhead Across Teams

The Meeting Nobody Wants to Attend

Every quarter, in a conference room somewhere, a finance analyst presents a slide showing the infrastructure cost allocation for the previous quarter. Somewhere on that slide is a line item called "shared services" or "platform overhead" or, in the most honest companies, "unallocated." It is usually between 18% and 35% of the total infrastructure spend. It is allocated across the business units by headcount, or by revenue, or by some formula nobody in the room fully remembers the origin of. The engineering leaders nod. The finance team nods. Nothing changes, because the conversation cannot go anywhere useful when a third of the money is moving around according to a rule nobody believes.

This is the field note about that line item. Specifically, it is about a private-cloud environment at a financial services company with roughly 4,800 cores of on-premise compute, two colocation facilities, a dedicated MPLS network connecting them, and about 2.1 PB of shared storage across a mix of block, file, and object tiers. The total annual run cost was $11.8M. The "unallocatable" portion at the start of the engagement was $3.4M, or 29% of the bill, and the allocation methodology was headcount-weighted across six business units.

The question the CFO had asked, the question that started the engagement, was simple. "Is the trading platform team actually consuming $840K a year of shared storage, or are we just dividing by six?" The honest answer, when we dug into it, was that nobody knew. The less honest answer, which was what the existing allocation produced, was yes, exactly $840K, to the penny, because that was what the formula said.

The Turn

The cloud-native FinOps literature is not very helpful here. Most of it assumes a hyperscaler environment where every resource has a tag, every tag rolls up to a cost center, and the unallocated bucket is something to be optimized down toward zero through better tagging discipline. In private cloud, that premise collapses. Colo power draws are not tagged. The core switch does not know which team's packets it is forwarding. The SAN does not emit usage events that map cleanly to a Jira project. A meaningful share of the infrastructure is, by design, shared in ways that resist per-workload attribution, and no amount of tagging discipline will change that.

The operational question, then, is not how to eliminate the unallocated bucket. It is how to split it up in a way that is defensible, stable enough to build budgets on, and approximately fair, using the signals you actually have rather than the signals you wish you had. This is allocation without a clean taxonomy, and the field note below is the worked example of how we did it at this particular company. The exact numbers will not generalize. The method should.

The Three Buckets

We started by separating the $3.4M of unallocated cost into three buckets with different underlying dynamics, because a single allocation formula applied across all three produces nonsense.

Bucket one: colocation and power ($1.6M). The physical cost of running the data centers. Rack space lease, power, cooling, physical security, remote hands contracts, cross-connects. This cost scales with installed hardware, not with utilization. A rack of underutilized servers costs the same to keep running as a rack of fully utilized ones.

Bucket two: network ($780K). The MPLS circuits between facilities, the core and distribution switches, the edge firewalls, the load balancer fleet, the internet transit. This cost scales with capacity provisioned, not with traffic actually carried, because network hardware is sized to peak and runs mostly idle.

Bucket three: shared storage ($1.02M). The SAN tiers, the NAS filers, the object store, the backup infrastructure. This cost has two components: capacity (how much is provisioned) and performance (IOPS and throughput consumed), and the two components behave very differently.

Each bucket needed its own allocation method, and the methods needed to use signals that were already being collected, because the engagement budget did not include building a new metering infrastructure from scratch. This is the real constraint in most private-cloud allocation work. The perfect allocation requires instrumentation that does not exist. The good-enough allocation has to use what you have.

Bucket One: Colocation and Power

The temptation with colocation cost is to allocate it based on power draw, which seems like the physically correct answer. It is not wrong. It is just not available. The company had PDU-level power metering, but the PDUs fed racks that were shared across teams, and the servers inside the racks were not uniformly tagged to teams. Getting to team-level power attribution would have required either a major re-racking project or a physical inventory that took longer than the engagement itself.

What we had, instead, was a CMDB that tracked every physical server, its owner (at team granularity, about 75% reliable), its rack location, and its published power draw (the nameplate rating, which overstates actual draw by roughly 40% but is consistent across servers). We also had six months of PDU telemetry showing actual rack-level power consumption.

The method we landed on, in the end, was a two-step proportional allocation. Step one: calculate each rack's actual power draw from PDU telemetry, averaged over the measurement window. Step two: within each rack, distribute the rack's power cost across the servers in that rack, weighted by nameplate power rating. Sum each team's server-level allocations across all racks to get team-level power cost.

This is not accurate in the sense that a metrology engineer would use the word. It is wrong in at least three ways. Nameplate power overstates actual draw by varying amounts across server models. Idle servers draw meaningful power too, and weighting by nameplate alone does not capture the idle-versus-active distinction. Cooling costs are not perfectly proportional to compute power draw because of how the cooling topology works in the specific facility.

But the method had three properties that mattered more than accuracy. First, it used data that already existed, so it could be run monthly without new instrumentation. Second, the errors were roughly symmetric across teams, meaning no team was systematically over- or under-allocated in a way that would blow up the negotiation. Third, and most importantly, the teams could follow the math. A senior engineer on the trading platform team could look at the allocation, see which of their servers were in which racks, see the PDU readings for those racks, and reproduce the number. Allocation methods that cannot be reproduced by the people paying the bill are allocation methods that get argued about forever. Allocation methods that can be reproduced get argued about once, and then people move on to the actual work.

The output was a team-level colo cost that ranged from $42K (the smallest user, an internal tools team) to $487K (the trading platform, which had large bare-metal hosts for latency reasons). The trading platform's previous allocation under the headcount formula had been $267K. The new number was almost twice as high. The trading platform team was not happy. The trading platform team also could not argue with the math, because the math was visible.

Bucket Two: Network

Network allocation is where most private-cloud FinOps engagements go to die. The ideal allocation would be based on actual traffic: which team's workloads generated which bytes across which links. In theory, flow data from the core switches can produce this. In practice, enabling NetFlow at the volume required to attribute the full $780K of network cost produces so much telemetry data that the storage and processing cost of the telemetry rivals the cost you are trying to allocate. We were not willing to spend six months building a traffic attribution pipeline for a quarterly finance conversation.

What we did instead was split the network bucket into two sub-buckets and handle them differently.

The first sub-bucket was the fixed network infrastructure, meaning the MPLS circuits, core switches, firewalls, and the baseline capacity of the load balancer fleet. About $620K of the $780K. This cost exists whether anyone sends a single packet across it, and it scales with the capacity provisioned, not with traffic. For this sub-bucket, we allocated proportionally to each team's share of installed compute and storage, using the same CMDB data we used for the colo bucket. The reasoning was that the network is sized to serve the workloads, so workloads are a reasonable proxy for what is driving the capacity requirements. This is not right. It is defensible.

The second sub-bucket was the variable portion, which at this company was mostly internet transit and burst load balancer capacity. About $160K. For this we had netflow data at the internet edge and load balancer logs, both of which mapped cleanly to the frontend services of the applications, and each application had a clear team owner. This sub-bucket allocated on actual usage, and the allocation was precise to the 5% range.

The combined allocation was $620K allocated by proxy and $160K allocated by actual usage. When presented to the teams, we made the split explicit, showing exactly which portion of their network cost was proxy-based and which was usage-based. This mattered because it pre-empted the argument we would otherwise have had repeatedly, which was "our team does not use much network, why are we paying so much." The answer, "because you have a lot of compute and the network is sized for the compute," was visible on the page, and the conversation shifted from "this number is wrong" to "should we be running this much compute," which was the conversation we wanted to have anyway.

Bucket Three: Shared Storage

Storage was the hardest of the three, because storage cost is genuinely two-dimensional. A team can consume a lot of capacity with very little performance, or a lot of performance on a small capacity, and the cost structure of a SAN reflects both. Allocating based on capacity alone punishes the team that stores a lot of cold data. Allocating based on IOPS alone punishes the team that has a high-performance workload on a small dataset. Neither answer is fair.

The method we landed on, which I think generalizes reasonably well, was a split allocation using what we called a capacity-plus-performance model. Each storage tier was decomposed into its underlying cost drivers: raw capacity cost (the disks), performance cost (the controllers, cache, and fabric), and operational overhead (replication, snapshots, monitoring, backup infrastructure). The ratios came from the storage team's own internal cost model, which they had built for procurement purposes but had never used for allocation.

On the tier-one SAN, the decomposition came out to roughly 35% capacity, 50% performance, 15% operational overhead. On the tier-three object store, it was 85% capacity, 5% performance, 10% overhead. The tiers were structurally different, and the allocation reflected that.

For each team, we pulled the provisioned capacity from the storage management platform (which was tagged to teams reasonably reliably, about 90%) and the actual IOPS and throughput from the performance telemetry (which was tagged to LUNs, which mapped to applications, which mapped to teams). Capacity cost was allocated by provisioned capacity share. Performance cost was allocated by consumed IOPS and throughput share, weighted by tier (a tier-one IOP is more expensive than a tier-three IOP). Operational overhead was allocated proportionally to the sum of the first two.

The result was a per-team storage allocation that made intuitive sense when you looked at the teams. The data science team, which had enormous amounts of cold storage on tier three but very little performance consumption, came out as the biggest capacity user but a mid-range total spender. The trading platform, which had modest capacity but extreme performance requirements on tier one, came out as the biggest performance user and a top-three total spender despite using far less space. Under the old headcount allocation, both teams had been allocated the same amount of storage cost to the dollar, which was self-evidently wrong once the new numbers were visible.

There was one important caveat. About $140K of the storage bucket was backup infrastructure, which did not decompose cleanly into capacity and performance. Backups are a capacity problem from the source side and a throughput problem from the target side, and the cost of the backup platform reflects both. We allocated this sub-bucket proportionally to protected capacity, acknowledging that this under-charges teams with aggressive RPO/RTO requirements and over-charges teams with long retention but infrequent recovery needs. This was a deliberate simplification. We flagged it explicitly, told the teams what the simplification was, and committed to revisit it in a later cycle if anyone wanted to make the case that the simplification was materially unfair to them. Nobody did, which is usually what happens when you name a simplification out loud instead of hiding it.

The Residual

After the three buckets were allocated, there was still a residual of about $190K, roughly 5.6% of the original unallocated total, that did not map cleanly to any of the methods above. This included things like the director of infrastructure's salary share that went into cost of goods sold, the cost of the infrastructure monitoring stack itself, and a handful of shared services (internal DNS, time servers, certificate authority) that were consumed by everyone in ways too small to meter individually.

We made a deliberate decision to leave this residual allocated on a headcount basis, and to name it explicitly as "genuine shared overhead" in the reporting. The reasoning was that trying to be precise about the last 5.6% would have cost more to implement than the precision would be worth, and leaving a small, honest, named residual was better than pretending the whole bucket had been allocated accurately. This is the part of the methodology I am most confident about. Every allocation system produces a residual. The question is whether you admit it exists.

What Changed

The mechanical output of the engagement was a new allocation methodology that moved about $1.1M of the $3.4M unallocated bucket onto teams that had previously been under-allocated, and moved about $1.1M off teams that had previously been over-allocated. The net change to the total infrastructure bill was zero, which is the point of an allocation exercise. The allocation does not change the cost. It changes who gets asked to justify the cost.

The behavioral output was more interesting. Within six months, three things had happened.

The trading platform team, newly accountable for $487K of colo cost instead of $267K, identified 14 bare-metal servers that had been provisioned for a now-defunct project and were running idle in a high-density rack. Those servers had been visible in the CMDB for over a year. Nobody had had a reason to look at them until the team started getting a bill for the rack they lived in.

The data science team, now seeing their storage allocation decomposed into capacity and performance, started a project to move cold datasets off tier-one storage where they had been living for convenience. The project produced an annualized run-rate savings of $180K within the storage bucket, about 17% of the bucket total.

The platform team, which runs the shared infrastructure, started receiving pushback from business units about specific cost drivers for the first time in the company's history. This was initially uncomfortable and eventually useful, because the pushback revealed that several "platform" costs were actually being driven by the needs of one or two specific teams, and pulling those costs out of the shared bucket and allocating them directly to the driving team produced a more honest picture of what the platform actually cost to run for the rest of the business.

None of this would have happened under the headcount allocation. The headcount allocation produces a number that teams cannot act on, because the number does not respond to their actions. A team that cuts compute usage in half under a headcount allocation sees their allocated infrastructure cost change by exactly zero. Allocation methods that do not create action are allocation methods that do not matter.

What Would Be Done Differently

Three things, in descending order of importance.

The storage decomposition should have come first, not last. The capacity-plus-performance storage model produced the most behavior change and the most savings of any of the three buckets, because storage was the place where the old allocation was most egregiously wrong. If I were doing this again, I would start with storage, because the political capital from a successful first bucket funds the harder conversations about the buckets where the method is murkier.

The CMDB cleanup should have been explicit up front. We discovered, about six weeks in, that the CMDB's team attribution data was 75% reliable, which meant a quarter of the servers were either unassigned or assigned to teams that no longer existed. We ended up running a parallel CMDB cleanup project to fix the gaps, which took longer than expected and delayed the final allocation numbers by about five weeks. If I were doing this again, I would budget the cleanup into the engagement plan explicitly, rather than discovering it mid-project. The CMDB is the foundation of every non-tagged allocation method, and its accuracy is not usually what the team thinks it is.

The monthly review cadence should have been set up on day one. We produced a beautiful one-time allocation report at the end of the engagement and then, because the monthly review process had not been built into the engagement scope, the report did not get reproduced for four months. By the time it came back, the numbers had drifted enough that the teams had lost the thread, and the second report generated more questions than it answered. The lesson is that the allocation methodology is a product, not a project, and the monthly regeneration of the numbers is the minimum viable version of that product. Without it, the entire effort is a one-time forensic exercise rather than an ongoing management system.

The Underlying Lesson

The industry has spent a lot of effort on the idea that perfect tagging produces perfect allocation, and on the corollary that private cloud environments are bad at FinOps because they cannot achieve perfect tagging. Both ideas are wrong in a way that is worth naming.

Perfect tagging does not produce perfect allocation, because a lot of infrastructure cost is genuinely shared in ways that tags cannot describe. A power strip feeds multiple servers. A network link carries multiple tenants. A storage controller serves multiple volumes. The shared-ness is physical, not metadata-level, and no tagging taxonomy will change it. What tagging produces is direct allocation, which is the easy case. The hard case, in any environment, is the allocation of the shared substrate, and the hard case does not have a clean answer in hyperscaler environments either. It has an answer that the hyperscaler has made for you, which you have chosen not to look at.

Private cloud environments are not bad at FinOps. They are honest about the allocation problem, in a way that cloud-native environments are not, because the shared substrate is too visible to pretend away. The methods above work in private cloud because they have to, and the same methods apply, with minor modifications, to the shared-services layer of every cloud-native environment: the load balancers, the egress, the logging infrastructure, the shared databases. The discipline is the same. The difference is that in private cloud, you have to do the work to even know what the allocation problem is, whereas in the cloud, the work gets hidden inside the bill and only shows up when somebody asks a sharp question.

This field note is the worked example. The method is not elegant. The answer is not perfect. The numbers are defensible, reproducible, and actionable, which are the three properties that actually matter. Perfect is for dashboards. Defensible is for meetings where decisions get made.

Field Note in a Sentence Allocating private-cloud overhead is not about achieving perfect tagging, it is about decomposing the shared bill into buckets with different cost dynamics (physical, network, storage), applying allocation methods that use signals you already have (CMDB, PDU telemetry, storage telemetry, netflow at strategic points), naming the simplifications out loud, and treating the allocation report as an ongoing monthly product rather than a one-time forensic exercise.