Navigating the Complexities of Operating Large-Scale Kubernetes Environments - 1

July 13, 2022

Sayandeb Saha
NetApp

As containers become the default choice for developing and distributing modern applications and Kubernetes (k8s) the de-facto platform for deploying, running, and scaling such applications, enterprises need to scale their Kubernetes environments rapidly to keep up. However, rapidly scaling Kubernetes environments can be challenging and create complexities that may be hard for you to address and difficult to resolve without a clear strategy. This blog specifies a few common techniques that you can use to navigate the complexities of managing scaled-out Kubernetes environments.

Operating Clusters as Fleets

Most scaled-out Kubernetes environments contain hundreds, if not thousands, of clusters because Kubernetes at its core is also a cluster commoditization technology, making it extremely easy to create, run, and scale clusters.

Consequently, many large Kubernetes environments experience cluster sprawls. Operating these clusters as a fleet of compute clusters on which you apply consistent configuration, security, governance, and other policies so that they are easy to manage, monitor, upgrade, and migrate is a best practice.

Also, reduce the blast radius of your fleets (of K8s clusters) by isolating them in different geographies/public cloud regions so that a failure of one fleet because of a service impacting problem does not impact others, resulting in cascading failures, which could be catastrophic. Commercial software tools are available that can help with such tasks.

Auto Scaling Infrastructure

Large Kubernetes environments need highly elastic infrastructure to provide compute, storage, and networking resources, which is consumed on-demand to keep the environment humming. Kubernetes clusters scale up and down automatically to support application needs. Resource-constrained clusters can impact the availability of a service provided by the application implementing the service. Over-provisioning is always an option, but it's expensive to do so.

In public clouds, auto-scaling infrastructure is easier to realize if you watch the costs and instrument cost optimization tools to manage your costs. On-premises, it's much harder to build a true auto-scaling infrastructure. It means the ability to provision and (potentially de-provision) thousands of virtual or bare metal worker nodes, terabytes of storage, and networking resources in minutes to keep up with the dynamic nature of Kubernetes workloads. To mitigate the auto-scaling requirement for large Kubernetes deployments, you may want to adopt a "Namespace-as-a-Service" operating model described in the next section, which has many advantages.

Namespace-as-a-Service Operating Model

As enterprises grapple with the many challenges of managing and maintaining large-scale Kubernetes estates, they adopted an operating model called ‘Namespace-as-a-Service" for managing such environments. In the "Namespace-as-a-Service" operating model, you use a small number of very large Kubernetes clusters. You then onboard application teams on the clusters, allocate one or more namespaces (virtual clusters) for application teams based on their needs, add worker nodes as needed, and add storage and other cluster resources. You can then use role-based access control (RBAC), network policies, and ResourceQuotas at the namespace level to limit and share the consumption of aggregate resources available in the cluster in a multi-tenant environment securely.

As new application teams or applications from existing teams need cluster real estate, this process is repeated to achieve controlled scaling of your Kubernetes estate that is easier to manage and maintain. This operating model mitigates cluster sprawl and enables policy-based control over resource consumption.

Well-Architected Horizontally-Scaled Apps

Architecting the apps that run on Kubernetes properly also goes a long way towards scaling your Kubernetes environment. With Kubernetes, it is essential to design applications that scale horizontally so that it is easier to scale your Kubernetes environment as your applications scale. This design pattern is distinct from vertical scaling, where resources (CPU, memory, disk I/O) are allocated to a single application stack, which can hit limits making the environment unstable.

Ideally, Kubernetes applications should be implemented by using a set of microservices, which communicate with each other using an API. This is distinct from traditional monolithic applications, where subsystems of an application communicate with each other using internal mechanisms. Your developers can leverage Kubernetes to optimize the placement of the microservices on node(s) that are right sized to handle the resource requirements of the microservices. Designing your applications in this manner allows for offloading the complexity of managing these apps to the operational realm where Kubernetes can manage them for you.

Go to: Navigating the Complexities of Operating Large-Scale Kubernetes Environments - 2

Sayandeb Saha is Sr. Director, Product Management, at NetApp

Industry News

GitLab 18 Released

May 15, 2025

GitLab announced the launch of GitLab 18, including AI capabilities natively integrated into the platform and major new innovations across core DevOps, and security and compliance workflows that are available now, with further enhancements planned throughout the year.

Perforce Partners with Siemens

May 15, 2025

Perforce Software is partnering with Siemens Digital Industries Software to transform how smart, connected products are designed and developed.

Reply Launches Silicon Shoring

May 15, 2025

Reply launched Silicon Shoring, a new software delivery model powered by Artificial Intelligence.

Rocky Linux from CIQ for AI Introduced

May 15, 2025

CIQ announced the tech preview launch of Rocky Linux from CIQ for AI (RLC-AI), an operating system engineered and optimized for artificial intelligence workloads.

Linux Foundation and OpenSSF Release Cybersecurity Skills Framework

May 14, 2025

The Linux Foundation, the nonprofit organization enabling mass innovation through open source, announced the launch of the Cybersecurity Skills Framework, a global reference guide that helps organizations identify and address critical cybersecurity competencies across a broad range of IT job families; extending beyond cybersecurity specialists.

CodeRabbit Now Available on Visual Studio Code Editor

May 14, 2025

CodeRabbit is now available on the Visual Studio Code editor.

The integration brings CodeRabbit’s AI code reviews directly into Cursor, Windsurf, and VS Code at the earliest stages of software development—inside the code editor itself—at no cost to the developers.

Chainguard Libraries for Python Introduced

May 14, 2025

Chainguard announced Chainguard Libraries for Python, an index of malware-resistant Python dependencies built securely from source on SLSA L2 infrastructure.

Sysdig Donates Stratoshark to Wireshark Foundation

May 14, 2025

Sysdig announced the donation of Stratoshark, the company’s open source cloud forensics tool, to the Wireshark Foundation.

Pega Predictable A Agents Released

May 13, 2025

Pegasystems unveiled Pega Predictable AI™ Agents that give enterprises extraordinary control and visibility as they design and deploy AI-optimized processes.

Kong Introduces Event Gateway

May 13, 2025

Kong announced the introduction of the Kong Event Gateway as a part of their unified API platform.

Azul and Moderne Announce Partnership

May 13, 2025

Azul and Moderne announced a technical partnership to help Java development teams identify, remove and refactor unused and dead code to improve productivity and dramatically accelerate modernization initiatives.

Parasoft Adds Agentic AI to SOAtest

May 13, 2025

Parasoft has added Agentic AI capabilities to SOAtest, featuring API test planning and creation.

Zerve 2.0 Released

May 13, 2025

Zerve unveiled a multi-agent system engineered specifically for enterprise-grade data and AI development.

LambdaTest Partners with MacStadium to Power AI Workloads on Apple Silicon

May 12, 2025

LambdaTest, a unified agentic AI and cloud engineering platform, has announced its partnership with MacStadium, the industry-leading private Mac cloud provider enabling enterprise macOS workloads, to accelerate its AI-native software testing by leveraging Apple Silicon.

Tricentis Expands Capability for Integrated Toolchain within RISE with SAP

May 12, 2025

Tricentis announced a new capability that injects Tricentis’ AI-driven testing intelligence into SAP’s integrated toolchain, part of RISE with SAP methodology.

DEVOPSdigest

Operating Clusters as Fleets

Auto Scaling Infrastructure

Namespace-as-a-Service Operating Model

Well-Architected Horizontally-Scaled Apps

Industry News

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

Operating Clusters as Fleets

Auto Scaling Infrastructure

Namespace-as-a-Service Operating Model

Well-Architected Horizontally-Scaled Apps

Related Links

Industry News

Search form

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics