Overview

We are seeking an elite Solutions Architect to lead the end-to-end design, sizing, and deployment of NVIDIA AI Factory-aligned infrastructure. In this highly technical, customer-facing role you will translate complex AI and machine learning workload requirements into fully engineered infrastructure solutions spanning colocation facilities, GPU compute, high-performance networking, parallel storage, and the complete NVIDIA AI software stack.

You will serve as a trusted technical advisor to enterprise and hyperscale customers, partnering with sales, product, and engineering teams to win and deliver transformational AI infrastructure programs. Your expertise will directly shape how organizations build and operate production AI Factories capable of training frontier models, running large-scale inference fleets, and accelerating data science pipelines at scale.

Your Impact

Solution Design & Architecture

Lead discovery workshops to capture AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi-tenancy needs.
Architect full-stack AI Factory solutions aligned to NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software layers.
Develop detailed Bills of Materials (BOMs), rack elevation diagrams, network topology drawings, and power/cooling budgets for customer proposals.
Define GPU cluster architectures using NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink-Switch configurations.
Design RTX PRO 6000 Blackwell Server Edition deployments for inference-optimized and enterprise AI workloads.
Conduct workload sizing and TCO/ROI modeling to validate infrastructure dimensioning for training, finetuning, and inference at scale.

Colocation & Facility Planning

Specify colocation requirements including critical power load (MW-scale), UPS and generator configurations, and PUE targets.
Design high-density GPU deployments utilizing air-cooled, direct liquid cooling (DLC), and rear-door heat exchanger configurations.
Define meet-me room (MMR) and cross-connect requirements; specify carrier-neutral telecom diversity strategies.
Engage colocation providers and data center operators to validate capacity availability and negotiate technical SLAs.
Coordinate with facilities and MEP engineers to validate power infrastructure from utility feed through PDU to rack level.

GPU Compute Infrastructure

Architect multi-node GPU clusters optimized for large language model (LLM) pre-training, fine-tuning, and reinforcement learning from human feedback (RLHF).
Size and configure DGX SuperPOD, HGX H/B-series, and MGX modular systems based on model parameter count, dataset size, and iteration timelines.
Define server firmware, BIOS, BMC, and DGXOS baselines for production GPU infrastructure.
Establish GPU health monitoring, RAS (Reliability, Availability, Serviceability) policies, and lifecycle management procedures.

High-Performance Networking

Design backend GPU fabric networks using NVIDIA Quantum InfiniBand (NDR 400Gb/s and HDR 200Gb/s) for distributed training traffic.
Architect Spectrum-X Ethernet-based AI networking solutions for inference clusters requiring highbandwidth, low-latency connectivity.
Specify ConnectX-8/7 HCA deployments and configure RDMA over Converged Ethernet (RoCEv2) or InfiniBand transport for NCCL collective operations.
Integrate BlueField-3 DPUs for GPU-accelerated network functions, storage offload, zero-trust security isolation, and bare-metal provisioning.
Design leaf-spine and fat-tree topologies for non-blocking bisectional bandwidth in GPU training clusters.
Define Quality of Service (QoS) policies separating storage, compute fabric, and management plane traffic.

Parallel Storage Architecture

Design high-performance parallel file system solutions using VAST Data, Hammerspace, and Pure Storage FlashBlade//E for AI training and checkpoint storage.
Size storage capacity, IOPS, and throughput based on dataset characteristics, checkpoint frequency, and concurrent reader/writer counts.
Architect multi-tier storage hierarchies: hot NVMe flash (VAST/FlashBlade) for active datasets, warm object storage for model archives, and cold tape/cloud for long-term retention.
Configure VAST Data Universal Storage for disaggregated storage with NFS, S3, and POSIX access; tune for large sequential read performance.
Deploy Hammerspace Global Data Environment for distributed data management and NFS-over-RDMA acceleration across geographically dispersed GPU clusters.
Define data pipeline architectures ingesting from cloud object stores (S3, GCS, ABS) to local flash for GPUlocal data loading without I/O bottlenecks.

AI Software Stack & Orchestration

Deploy and configure NVIDIA AI Enterprise (NVAIE) software stack including NVIDIA GPU Operator, NIM microservices, and RAPIDS accelerated data science libraries.
Architect inference serving infrastructure using NVIDIA NIM (NVIDIA Inference Microservices) for optimized LLM and vision model deployment with autoscaling.
Implement NVIDIA Dynamo for distributed inference and disaggregated serving of large-scale generative AI models.
Configure and optimize CUDA toolkit, cuDNN, NCCL communication libraries, and custom kernel environments for training workloads.
Deploy Base Command Manager and DGXOS for cluster lifecycle management, node provisioning, health dashboards, and job scheduling integration.
Integrate NVIDIA Mission Control for AI Factory operations, observability, and multi-cluster fleet management.
Design and deploy Kubernetes-based AI platforms using NVIDIA GPU Operator, integrating with Run:ai for dynamic GPU resource scheduling and multi-tenant workload isolation.
Configure SLURM workload manager for traditional HPC-style job scheduling on bare-metal GPU clusters, including preemption policies, fair-share scheduling, and burst-to-cloud integration.
Establish MLOps toolchain integrations with popular frameworks (PyTorch, JAX, TensorFlow) and experiment tracking platforms (MLflow, Weights & Biases).

Customer Engagement & Delivery

Serve as primary technical point of contact throughout the pre-sales and delivery lifecycle, from initial discovery through post-deployment optimization.
Produce and present architecture design documents, technical proposals, and executive-level briefings to CTO/CIO and VP-level stakeholders.
Lead proof-of-concept (POC) and pilot deployments, including benchmark design, execution, and results analysis.
Collaborate with procurement, logistics, and deployment teams to ensure on-time delivery of complex infrastructure programs.
Provide post-deployment hypercare support, performance tuning, and capacity planning advisory services.
Contribute to internal knowledge bases, solution playbooks, and reference architectures for repeatable AI Factory deployments.

Technology Stack

Candidates must demonstrate deep, hands-on expertise across the following technology domains:

GPU Compute	DGX B200 / B300, DGX H100 / H200, HGX B200 / B300, HGX H100 / H200, MGX platforms, GB300 NVL72 / GB200 NVL72, RTX PRO 6000 Blackwell Server Edition, NVLink Switch System, NVLink-C2C
Networking	NVIDIA Quantum InfiniBand (NDR 400G, HDR 200G), Spectrum-X Ethernet, ConnectX-8 / ConnectX-7 HCAs, BlueField-3 DPU, SHARP in-network computing, UFM Fabric Manager, RDMA / RoCEv2 / InfiniBand
Storage	VAST Data Universal Storage (NFS/S3/POSIX), Hammerspace Global Data Environment, Pure Storage FlashBlade//E (Evergreen//One), NFS-over-RDMA, parallel file systems (Lustre, GPFS/WEKA), S3-compatible object storage
AI Software	NVIDIA AI Enterprise (NVAIE), NIM Microservices, RAPIDS (cuDF, cuML, cuGraph), NVIDIA Dynamo, CUDA Toolkit, cuDNN, NCCL, TensorRT, Triton Inference Server
Cluster Mgmt	Base Command Manager, DGXOS, NVIDIA Mission Control, DGX Cloud, UFM, IPMI / Redfish BMC management
Orchestration	Kubernetes (K8s), NVIDIA GPU Operator, Run:ai GPU scheduling, SLURM, OpenMPI, Helm, Argo Workflows, Kubeflow, KServe
Colocation	Critical power design (kW – MW), UPS / generator, CRAC / CRAH / DLC / immersion cooling, hot-aisle containment, PUE optimization, carrier-neutral telecom, cross-connects, MMR design
Frameworks	PyTorch, JAX, TensorFlow, Hugging Face Transformers, DeepSpeed, Megatron-LM, vLLM, LMDeploy

Qualifications

Bachelor's degree in Computer Science, Electrical Engineering, Computer Engineering, or a related technical discipline; Master's degree preferred.
8+ years of solutions architecture, systems engineering, or technical pre-sales experience, with at least 4 years focused on GPU infrastructure or HPC environments.
Proven track record designing and deploying NVIDIA DGX or HGX-based GPU clusters in production AI/ML environments.
Deep understanding of distributed deep learning concepts: tensor parallelism, pipeline parallelism, data parallelism, gradient checkpointing, and mixed-precision training.
Hands-on experience with InfiniBand or high-speed Ethernet fabric design, RDMA configuration, and collective communication tuning (NCCL, MPI).
Direct experience sizing and deploying parallel storage systems (VAST, Hammerspace, or Lustre/WEKA/GPFS) for AI training workloads.
Strong working knowledge of Kubernetes, GPU Operator, and at least one GPU workload scheduler (Run:ai or SLURM).
Experience with Linux system administration, CUDA development environment configuration, and GPU driver/firmware management.
Demonstrated ability to create compelling technical proposals, architecture diagrams (Visio/Lucidchart/draw.io), and BOM-level documentation.
Exceptional communication skills with proven ability to present to both deep technical audiences and Clevel executives.

Preferred Qualifications:

NVIDIA-certified professional credentials (DCA-Core, NCP-DS, or equivalent).
Experience with NVIDIA Base Command Platform or Mission Control for multi-cluster AI Factory operations.
Familiarity with sovereign AI, government cloud, or regulated industry AI infrastructure requirements.
Experience integrating AI Factory infrastructure with public cloud (AWS, Azure, GCP) for hybrid and burstto-cloud architectures.
Background in MLOps, LLMOps, or platform engineering for production AI model lifecycle management.
Prior experience with colocation data center procurement, RFP development, and SLA negotiation.
Contributions to open-source AI infrastructure projects or published technical content (blogs, whitepapers, conference presentations).
Active participation in the NVIDIA Partner Network (NPN) ecosystem or prior experience at an NVIDIA Elite Solution Provider.

Core Competencies

Technical Depth

End-to-end AI infrastructure expertise from silicon to software; ability to go deep on any layer of the stack.

Systems Thinking

Ability to reason holistically about performance, reliability, power, cost, and operability trade-offs across complex integrated systems.

Customer Obsession

Relentless focus on understanding customer AI objectives and delivering solutions that accelerate time-to-value.

Executive Presence

Confidence and clarity when presenting complex technical architectures to senior business and technology leaders.

Analytical Rigor

Data-driven approach to workload sizing, performance modeling, and TCO analysis with attention to detail.

Collaborative Leadership

Ability to lead cross-functional pursuit teams, align internal stakeholders, and orchestrate complex delivery programs.

Position Specifics

The initial base salary range for this position is expected to be between $170,000 and $190,000 annually. The final base salary offered will be determined by multiple factors, including, but not limited to, job-related knowledge, depth of experience, skills, certifications, and geographic location. In addition to the base salary, our compensation structure may include other components such as commissions and discretionary bonuses.

ePlus offers a full range of medical, financial, and/or other benefits (including 401(k) eligibility, employee stock purchase program and various paid time off benefits, such as vacation, sick time, and personal leave), dependent on the position offered. Details of participation in these benefit plans will be provided if an offer of employment is extended.

If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

#LI-DY1

#IND1

Who We Are

At ePlus, we believe technology is a people business. Our team is passionate, skilled, and driven to deliver solutions that make a real difference. Join us and be part of a culture that values collaboration, innovation, and extraordinary results.

Corporate Values

Respectful communication and cooperation: We prioritize respectful communication, fostering an environment where everyone is treated with dignity and respect.
Teamwork and employee participation: Collaboration and teamwork thrive through diverse perspectives, both within our teams and in our interactions with our customers.
Work/life balance that supports our employees’ varying needs: We value the well-being of our employees, recognizing that a healthy work-life balance is pivotal to our collective success.
Embracing communities: We embrace and support the communities that nurture us. Our employees' dedication to fostering positive change is a source of immense pride for us.

Commitment to Diversity, Inclusion and Belonging

We are an equal opportunity employer that does not discriminate or allow discrimination based on race, color, religion, sex, sexual orientation, gender identity, age, national origin, citizenship, disability, veteran status, or any other classification protected by federal, state, or local law.
ePlus is dedicated to fostering, cultivating, and preserving a culture that represents diversity, enables inclusion, and makes our employees feel comfortable bringing their full, unique selves to work.

Physical Requirements

While performing this role, you will engage in both seated and occasional standing or walking activities. We provide reasonable accommodations, in accordance with relevant laws, to support success in this position.
By embracing our values, you will contribute to our collective mission of making a positive impact within our organization and the broader community. We understand that this job description serves as a guide and is not an employment contract.

ePlus maintains a California Consumer Privacy Act (CCPA) Privacy Notice on our Trust Center, available here: CCPA Privacy Notice.

Notice to Recruiting Agencies: ePlus only accepts unsolicited resumes when presented directly by a candidate. Unsolicited resumes submitted to ePlus from any other source will be considered ePlus property and will not qualify for any placement or referral fees. ePlus will only pay such fees in connection with a valid written agreement between ePlus and the referring agency, and then only after providing advance written approval to the referring agency to submit resumes in connection with a particular opportunity.

Create a Job Alert

Interested in building your career at ePlus Technology, inc.? Get future opportunities sent straight to your email.

First Name

Last Name

Preferred First Name

Country

Phone

Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf

LinkedIn Profile

Website

Are you 18 years of age or older?

Select...

Are you legally authorized to work in the U.S.? (If applying to non-US job, select N/A)

Select...

Do you now, or will you in the future, require immigration sponsorship for work authorization (for example, H-1B status)?

Select...

Are you currently subject to a non-compete, non-solicitation, employment agreement or any other active obligation with another employer?

Select...

What is your desired salary?

Have you previously been employed by ePlus Technology, inc, or any of its subsidiaries or affiliates?

Select...

Imagine you're in a first meeting with a potential client who needs an HPC/AI solution for a new project. They can give you their compute, storage, and networking requirements, but what other questions would you ask to ensure you design a successful and scalable solution?

Tell us about a time you had to design a solution using technologies from multiple vendors. How did you manage the complexities of integration and communicate the value of this combined solution to the client?

How do you approach identifying and capturing customer business and technical requirements during the sales cycle?

Describe your experience providing technical leadership in a pre-sales or customer-facing engineering role. What types of solutions or technologies did you support?

Which OEMs and technologies are you most experienced with? (Please include any experience working directly with OEM field teams.)

How do you approach identifying and capturing customer business and technical requirements during the sales cycle?

Is there anything else you'd like us to know about you or your background?

If you were referred by a current employee, please provide their full name and email address.

Please enter your full address here (Street, City, State, and Zip Code).

E.g. 123 Example Street, San Francisco, CA 10001

Do you currently or have you previously worked for Deloitte?

Select...

If you answered Yes to having been employed by Deloitte, please provide the dates during which you were employed by Deloitte and your title(s). If you answered No, enter "N/A"

EEOC Questions

We invite applicants to voluntarily share their demographic background. If you choose to participate, your responses will help us identify opportunities to improve and enhance our hiring process.

How would you describe your gender identity?

Select...

Are you Hispanic/Latino?

Select...

How would you describe your racial/ethnic background?

Select...

Principal Solutions Architect (Req#1048)

Solution Design & Architecture

Colocation & Facility Planning

GPU Compute Infrastructure

High-Performance Networking

Parallel Storage Architecture

AI Software Stack & Orchestration

Customer Engagement & Delivery

Technology Stack

Core Competencies

Apply for this job

EEOC Questions