World Wide Technology Logo

World Wide Technology

HPC Engineer - Storage

Posted 8 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in IND
Entry level
Remote
Hiring Remotely in IND
Entry level
The HPC Engineer - Storage focuses on deploying high-performance storage systems, managing configurations, automating installations, and maintaining I/O performance benchmarks in a cluster environment.
The summary above was generated by AI
Job Summary & Responsibilities

Technical Competencies

Essential Skills

High-Performance Storage:

  • Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
  • Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
  • RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.

Automation & Containerisation:

  • Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
  • Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
  • GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.

Desirable Experience

  • Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
  • Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
  • Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.

Certifications

Highly Desirable:

  • NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
  • Vendor Certifications:
    • VAST Certified Administrator (VCP-AD1)
    • WEKA Technical Xpert Certification
  • Red Hat Certified Specialist in Storage Administration

Success Metrics (KPIs)

  • I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
  • Mount Stability: Zero "Stale File Handles" or disconnected mounts across the cluster during the 72-hour burn-in period.
  • Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.
Preferred Qualifications1. Storage Integration & Client Configuration • Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare-metal DGX/HGX nodes using Ansible. • Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory. • Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s. • Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes. 2. Validation & Performance Benchmarking • Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the "Gold Standard" read/write targets (e.g., 400GB/s read throughput). • Latency Tuning: Tune client-side kernel parameters (read-ahead buffers, queue depths, sysctl settings) to minimize latency for small-file random I/O patterns common in checkpointing. • Acceptance Reporting: Generate "As-Built" storage validation reports, documenting effective throughput and IOPS for client sign-off. 3. Operations & Support • Capacity & Quotas: Implement project-level quotas and monitor usage trends to prevent "Disk Full" outages on critical scratch filesystems. • Ticket Resolution: Handle L2 support tickets for storage issues, such as "Stale file handles," "Slow dataset loading," or "CSI Driver crashes." • Lifecycle Management: Execute non-disruptive client-side driver upgrades and firmware patches during maintenance windows.

Similar Jobs

3 Hours Ago
Easy Apply
Remote
India
Easy Apply
Senior level
Senior level
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
As the Lead Product Designer for Discover, you'll design AI-driven experiences for a two-sided marketplace while mentoring the design team and owning the product design direction.
Top Skills: Ai-Assisted Prototyping Tools (CursorClaude CodeFigmaLovable)V0
3 Hours Ago
Remote or Hybrid
Pune, Mahārāshtra, IND
Senior level
Senior level
Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Lead the vision and development of internal platforms and APIs for Mastercard's digital payments, focusing on secure and scalable solutions. Collaborate with engineering teams to drive platform capabilities aligned with product strategy and regulatory needs.
Top Skills: AgileAPIsCloud-Native ArchitecturesDomain-Driven DesignMicroservicesPci ComplianceSecurityTokenization
3 Hours Ago
Remote or Hybrid
Mid level
Mid level
Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation
The role involves acquiring new enterprise-level clients, managing sales cycles, developing territory plans, and collaborating across teams to drive revenue.
Top Skills: Enterprise SoftwareFinanceLegalMarketingSales Engineering

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account