0% found this document useful (0 votes)
6 views8 pages

Tiered Compute Architecture

Uploaded by

sisodia.hs95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Tiered Compute Architecture

Uploaded by

sisodia.hs95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Compute Server Config:

CPU: 5th Gen Intel® Xeon® Silver 4514Y (16 Cores, 32 Threads @ 2.00 GHz) RAM: 128 GB
DDR5 4800MHz ECC REG Memory Storage: 2 x 4TB Samsung 990 Pro M.2 NVMe Gen 4
SSDs (8TB Total)

Other Key Features

●​ Motherboard: ASRockRack SPC741D8-2L2T/BCM with PCIe 5.0 support


●​ Networking: Dual 10GbE/1GbE LAN ports
●​ Management: Integrated IPMI for remote management
●​ Chassis: 1U Rack Mount with 4 hot-swap bays
●​ Power Supply: Dual 650W (80+ Platinum) redundant power supplies
●​ Warranty Options: 3-Year Return-to-Base (RTB) on all components / 3 Years Onsite

NAS Server Config:

CPU: 5th Gen Intel® Xeon® Silver 4509Y (8 Cores, 16 Threads @ 2.60 GHz) RAM: 32 GB
DDR5 4800MHz ECC REG Memory Storage:

●​ OS Drive: 1 x 500GB Crucial P3 Plus M.2 NVMe SSD


●​ Data Drives: 6 x 16TB NL-SAS 7.2K RPM Enterprise HDDs

Other Key Features

●​ Motherboard: ASRockRack SPC741D8-2L2T/BCM with PCIe 5.0


●​ Networking: Dual 10GbE and 1GbE LAN ports
●​ Management: Integrated IPMI for remote management
●​ Chassis: Chenbro RM23808-800R 2U with 8 hot-swap drive bays
●​ Power Supply: Dual 800W (80+ Platinum) redundant power supplies
●​ Cooling: Hot-swap cooling fans
●​ Warranty: 3-Year Return-to-Base (RTB)
Core Architecture: Clustered & Resilient

System Architecture Plan: 5 Compute Nodes & 4 Storage Nodes

The goal is to move beyond individual servers and create two robust clusters: a
Compute Cluster for running virtual machines and a Distributed Storage
Cluster for providing a single, unified storage pool. This design ensures high
availability, scalability, and simplified management.

Tiered Compute Architecture

We will logically divide our five compute servers into two separate groups, or
"clusters," which will share the same underlying network and storage
infrastructure.

●​ Compute Cluster A: Mission-Critical HA Cluster (2 Nodes)


●​ Compute Cluster B: General Purpose & Dev/Test Cluster (3 Nodes)

Mission-Critical HA Cluster (2 Nodes)

This cluster will be engineered for maximum uptime to host our most essential
applications, where downtime is not an option.

●​ Configuration: The two compute servers will be configured in a cluster


using a hypervisor like Proxmox VE or VMware ESXi, with the High
Availability feature enabled.
●​ How it Works: The VMs for our critical apps will run on these two nodes.
The cluster will constantly monitor the health of both servers.
●​ Failure Scenario: If one of the two servers fails, the HA system will
automatically and immediately restart its virtual machines on the
second, healthy server. This ensures service continuity with minimal
disruption.
●​ Resource Planning: To guarantee a successful failover, we should run
this cluster at less than 50% of its total resource capacity. This ensures
that if one node goes down, the surviving node has enough free CPU and
RAM to run the critical VMs from both servers combined.

General Purpose Cluster (3 Nodes)

The remaining three compute servers will form a larger, separate resource pool
for all other tasks.

●​ Purpose: This cluster is ideal for workloads that are not business-critical.
This includes:
○​ Development and testing environments.
○​ Staging servers.
○​ Continuous Integration/Continuous Deployment (CI/CD) runners.
○​ Applications that can tolerate brief periods of downtime.
●​ Benefit of Isolation: This structure is key to our stability. A risky software
test or a resource-intensive development task running on this
general-purpose cluster cannot crash or impact the performance of our
mission-critical applications running on the separate HA cluster.
●​ Flexibility: With three nodes, we have a larger pool of resources to
experiment with and work with, without putting our core services at risk.
We can still cluster them for easier management, but we may not require
the same aggressive HA settings.

Storage Cluster Technology Options (4 Nodes)

Our four storage servers, each equipped with six 16 TB drives, create a total raw
capacity of 384 TB. The technology we choose will determine the final usable
space and management overhead. Below are the options for consideration.

Option 1: Ceph Cluster This is a true distributed system where all 24 drives
across all four nodes form a single, self-healing storage pool.

●​ Management Complexity: Very High. Ceph is incredibly powerful but has


a steep learning curve, requiring significant command-line expertise.
●​ Usable Storage Calculation: Using the standard 3x replication for
resiliency.
○​ 384 TB Raw / 3 = ~128 TB Usable
●​ Pros: Extreme resiliency (can survive an entire server failure), unified
storage (block, file, and object), and massive scalability.
●​ Cons: Highest management overhead and lowest storage efficiency.

Option 2: TrueNAS SCALE (4 Separate Pools) In this model, we would run an


independent TrueNAS instance on each node, managing its own RAID-Z2 array.

●​ Management Complexity: Medium. TrueNAS has a polished web


interface and is far simpler than Ceph, offering a great balance of power
and usability.
●​ Usable Storage Calculation: Using RAID-Z2 (two-drive parity) on each
node.
○​ Per Node: (6 drives - 2 parity) x 16 TB = 64 TB
○​ Total Usable: 64 TB/node x 4 nodes = 256 TB Usable
●​ Pros: Excellent data integrity (ZFS), great storage efficiency, and a robust,
mature platform.
●​ Cons: Results in four separate storage pools to manage, not a single
unified namespace.

Option 3: OpenMediaVault (OMV) (4 Separate Pools) Similar to the TrueNAS


approach, this option prioritises simplicity above all else.

●​ Management Complexity: Very Easy. OpenMediaVault is the easiest of


the three to manage, with a clean and straightforward web interface.
●​ Usable Storage Calculation: Using standard Linux RAID 6 on each node.
○​ Per Node: (6 drives - 2 parity) x 16 TB = 64 TB
○​ Total Usable: 64 TB/node x 4 nodes = 256 TB Usable
●​ Pros: Extremely simple to set up, lightweight, and perfect for basic, reliable
network storage.
●​ Cons: Lacks the advanced features and data integrity checksumming of
ZFS found in TrueNAS.
Operational Usage Model & Data Strategy

Our architecture is built on a powerful tiered-storage principle. We will leverage


the distinct advantages of both the local NVMe SSDs in our compute nodes for
speed and the central storage cluster for capacity and data resilience. This
strategy ensures optimal performance for active workloads while providing a
robust backend for bulk data and protection.

High-Performance Tier: Local Compute Storage

Each of our five compute nodes is equipped with 8 TB of extremely fast Gen 4
NVMe storage. This tier is dedicated to workloads where speed and low latency
are critical.

●​ VM Operating Systems: The primary disk for every virtual machine (the
C:\ drive or / root partition) will be hosted on this local NVMe storage. This
ensures VMs boot quickly and feel responsive.
●​ High-I/O Applications: This is the ideal location for workloads that
perform intensive read/write operations, such as:
○​ Databases (SQL, PostgreSQL, etc.)
○​ Application Caches and temporary files.
○​ CI/CD build directories for fast code compilation.
●​ Mission-Critical VMs: For the 2-node HA cluster, running the OS disks of
our critical applications locally guarantees the highest possible
performance.

Capacity Tier: Central Storage Cluster

Our 4-node storage cluster provides a massive 256 TB usable pool. This tier
serves as the workhorse for bulk data, backups, and centralised resources. It will
be connected to all compute nodes over our redundant 10GbE network.

●​ Additional VM Data Disks: While the VM's OS runs on fast local storage,
any large data volumes (like a D:\ drive for files or a /data mount) will be
provisioned from the central cluster. This gives our VMs access to huge
amounts of storage without consuming the premium local NVMe space.
●​ Centralised Backups: The storage cluster is the primary target for all our
data protection tasks. We will configure backup software (like Proxmox
Backup Server or Veeam) to store nightly snapshots of every VM in the
entire lab. The high resilience of the storage nodes makes it a perfect
repository for our critical backups.
●​ ISO Library and VM Templates: All operating system installation files
(ISOs) and master VM templates will be stored here. This allows us to
deploy new VMs to any compute node instantly from a single, centralised
source.
●​ General Office File Shares: We can create standard SMB/NFS shares on
the storage cluster for company-wide use, such as storing project files,
archives, and other shared documents.

A Practical Workflow Example

Here’s how we would deploy a new critical file server VM:

1.​ Deployment: We deploy the VM from a master template stored on the


central storage cluster onto one of the mission-critical compute nodes.
2.​ Disk Placement:
○​ The VM's OS disk (100 GB) is created on the compute node's local
NVMe storage for speed.
○​ A data disk (20 TB) for the file share is created on the central
storage cluster and attached to the VM.
3.​ Operation: The VM runs with a snappy, responsive OS while serving a
huge amount of data from the resilient storage backend.
4.​ Backup: A nightly backup job takes a snapshot of the entire VM and
stores it safely on the central storage cluster.

Network Architecture & Configuration

To support our 9-node cluster, we will implement a fast, secure, and resilient
network foundation. The design prioritises performance through 10GbE
connectivity and security through network segmentation using VLANs.

Core Switching Hardware

We will build the network around a central, high-performance switching fabric.

●​ Recommendation: A Layer 3 Lite or full Layer 3 switch is


recommended. This allows the switch to handle routing between our
internal VLANs directly, which is more efficient than sending that traffic to
the office's core router.
●​ Port Requirements: The switch (or switches) must have at least 18 x
10GbE SFP+ ports to connect all nine servers with redundant links. A
24-port model would be a good choice, providing room for future
expansion.
●​ Vendor Options: We can source this from a variety of reliable enterprise
vendors such as Cisco, Aruba, MikroTik, or Ubiquiti.
●​ Uplink: The switch fabric will connect to the main office core router via a
10GbE SFP+ or faster link to ensure a high-speed connection for the
services that require it.

VLAN & IP Address Strategy

We will segment the network into separate Virtual LANs (VLANs) to enhance
security and organise traffic.

Exam Intern
ple et
VLAN Subne Acces
ID Name Purpose t s
192.16
For accessing server IPMI and the 8.10.0/
10 MGMT_VLAN Proxmox/hypervisor management interfaces. 24 No
192.16
VM_TRAFFIC_ The main network for all general purpose 8.20.0/
20 VLAN Virtual Machines. 24 Yes
(Recommended) A dedicated, isolated network 192.16
STORAGE_VL for storage traffic between compute and 8.30.0/
30 AN storage nodes (NFS/iSCSI). 24 No

●​ Management VLAN (10): This network is strictly for internal


administration. Firewall rules will be applied to block all internet access
to and from this VLAN, securing our core infrastructure.
●​ VM Traffic VLAN (20): This will be the default network for our VMs,
allowing them to communicate with each other and, where permitted,
access the internet through the office's core router.
●​ Storage VLAN (30): It is a best practice to create a dedicated VLAN for
storage traffic. This isolates the heavy, performance-sensitive traffic,
preventing it from interfering with management or VM traffic. We can also
enable optimisations like Jumbo Frames on this VLAN to improve
throughput.

You might also like