New

Disaster Recovery Engineer

TEKsystems
$70.00 - $72.00 / hr
life insurance, sick time, 401(k), retirement plan
United States, Illinois, Riverwoods
Oct 25, 2025
Description **Sit out of Chicago, IL We are seeking a Lead Engineer to lead the resiliency, disaster recovery (DR), and operational continuity efforts for our mission-critical Domestic Transaction Switching Application and Diner's Club International (DCI) Switch. This role requires deep technical expertise in Java, Linux, networking, and distributed systems-combined with strategic program delivery skills to coordinate multiple infrastructure, development, and operations teams. The ideal candidate has hands-on experience managing Active/Active multi-data-center architectures, low-level TCP/IP integrations, and DR orchestration within regulated financial environments. Core Responsibilities Application & Infrastructure Oversight Oversee the Domestic Transaction Switching Application, a Java-based platform running on VMs hosted in Nutanix clusters with Red Hat Linux. Manage all low-level TCP/IP socket communications, including connectors, listeners, and transaction routing logic. Coordinate with teams supporting the Diner's Club International Switch, a WebSphere application, ensuring interoperability and fault-tolerant communication between domestic and international payment networks. Ensure high availability, scalability, and compliance of multi-data-center deployments through Active/Active/Active architecture review and validation. Disaster Recovery (DR) Strategy & Analysis Own the end-to-end DR planning, testing, and documentation as outlined in Milestone 5.1 of the detailed DR plan. Evaluate the impact of DR events across configuration data sources, including 30+ read-only configuration files (IIN ranges, currency codes, merchant category codes, etc.) loaded into in-memory caches. Assess external dependencies such as DB2 Global Database, mainframe negative files and account-level processing files, and the Oracle UI used by operations to manage client connections and routes. Perform criticality analysis to classify configuration dependencies (blockers, critical, non-critical) and design mitigation strategies for stale or unavailable data sources. Define recovery point (RPO) and recovery time objectives (RTO) for all dependent systems. Active/Active Architecture Validation Review and strengthen the Active/Active/Active data-center strategy for the Hydra Switching Application. Identify and document exceptions, such as low-volume participants operating in Active/Passive mode, and assess potential transaction impact during site failover; inventory and track remediation plans. Analyze inter-data-center dependencies, including the dynamic key exchange (DKE) process requiring three-way acknowledgment for encryption key rotation. Document functional areas that degrade or fail during partial data-center outages and propose operational mitigations. Transaction Extracts & Event Processing Oversee downstream batch transaction extracts distributed to Data Warehouse, Settlement Systems, WorldPay, and regional datastores (e.g., India). Verify Kafka Enterprise Event Bus integrity during DR events, ensuring Active/Active message replication and recovery consistency-trust but verify. Analyze downstream dependencies to validate continuity for all transaction, settlement, and compliance feeds. Control Plane & Platform Dependencies Assess DR implications for control-plane components (Jenkins, GitHub, Nexus, Vault, Protegrity, Okta, etc.) which operate in Active/Passive configurations. Coordinate with enterprise platform teams to balance scope and minimize global outage risk during DR testing. Contribute to the Enterprise DR Playbook to define which components are within or excluded from DR scope. Monitoring, Runbooks & Evidence Capture Maintain comprehensive monitoring coverage using Splunk (functional transaction view) and DataDog (infrastructure health). Develop runbooks and implementation plan templates integrating ServiceNow, Jenkins, and Autosys workflows for deployment, validation, and rollback. Standardize evidence capture processes using Splunk dashboards, system logs, and console screenshots for audit and compliance reporting. Non-Production Test Environments & Simulation Design and coordinate a production-like QA/Dev environment for full DR simulation testing across all dependent components. Execute controlled DR test events, emulating change windows and data-center failovers: place impacted data center into down state, freeze configuration and batch jobs, redirect traffic and validate health on remaining sites, bring passive site online and validate configuration/job recovery. Document lessons learned and integrate continuous improvement into DR planning. Required Skills & Experience Strong background in Java application architecture and TCP/IP socket programming. Expertise with Linux (Red Hat), VM environments, and Nutanix infrastructure. Knowledge of multi-data-center Active/Active design patterns and high-availability systems. Familiarity with Kafka, DB2, Oracle, and mainframe data integration. Hands-on experience with Splunk, DataDog, Jenkins, ServiceNow, and Autosys. Proven ability to lead technical DR exercises, coordinate multi-team execution, and present results to leadership. Excellent analytical, documentation, and stakeholder-communication skills. Preferred Qualifications Experience in financial transaction processing, payment network systems, or card authorization platforms. Familiarity with WebSphere Application Server, PCI DSS, and encryption key management (DKE) processes. Experience developing or managing Active/Passive control-plane components in enterprise environments. Knowledge of site reliability engineering (SRE) principles and observability best practices. Bachelor's or Master's degree in Computer Science, Engineering, or related technical discipline. Skills linux, cloud, python, aws, kubernetes, Disaster recovery, Automation, Engineering Top Skills Details linux,cloud,python,aws,kubernetes Additional Skills & Qualifications **Sit hybrid in Chicago High Level- Regions / Failovers oAnsible / Puppet / Chef Copy bare metal so it plays better rather than using Cloud oAWS Openshift Automate it all oDNS Non-DNS failover strategy DNS is okay for Blue/Green deployment Should be able to explain why & how oLoad-balancing oMonitoring Use current AWS issues as scenario High availability oDisaster recovery is a property of high availability Cloud oAWS / Azure / GCP Pipelines oJenkins, Github This role sits at the core of our global payments infrastructure, ensuring that billions of transactions continue to process securely and reliably, even during adverse events. The successful candidate will help shape the resiliency architecture, automation, and DR strategy that safeguard customer trust and institutional stability across our network Experience Level* Expert Level Pay and Benefits The pay range for this position is $70.00 - $72.00/hr. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to specific elections, plan, or program terms. If eligible, the benefits available for this temporary role may include the following: * Medical, dental & vision * Critical Illness, Accident, and Hospital * 401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available * Life Insurance (Voluntary Life & AD&D for the employee and dependents) * Short and long-term disability * Health Spending Account (HSA) * Transportation benefits * Employee Assistance Program * Time Off/Leave (PTO, Vacation or Sick Leave) Workplace Type This is a hybrid position in Riverwoods,IL. Application Deadline This position is anticipated to close on Oct 27, 2025. h4>About TEKsystems: We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company. The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law. About TEKsystems and TEKsystems Global Services We're a leading provider of business and technology services. We accelerate business transformation for our customers. Our expertise in strategy, design, execution and operations unlocks business value through a range of solutions. We're a team of 80,000 strong, working with over 6,000 customers, including 80% of the Fortune 500 across North America, Europe and Asia, who partner with us for our scale, full-stack capabilities and speed. We're strategic thinkers, hands-on collaborators, helping customers capitalize on change and master the momentum of technology. We're building tomorrow by delivering business outcomes and making positive impacts in our global communities. TEKsystems and TEKsystems Global Services are Allegis Group companies. Learn more at TEKsystems.com. The company is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.