Job Details

Site Reliability Engineer III- Operations

Position Number: 6978964
Location: Orem, UT
Position Type: Computing - Network/System Administration

Site Reliability Engineer III- Operations

Salary: $66,920.00 - $78,729.00 Annually
Job Type: FT Exempt Salaried Staff
Job Number: FY2605592
Closing: 3/25/2026 11:59 PM Mountain
Location: DX Building
Division: VP Digital Transformation/CIO

Position Announcement

Utah Valley University is seeking a Site Reliability Engineer to join our IT team and play a key role in supporting our digital transformation initiatives. In this role, you will design, implement, and maintain reliable software systems and scalable automated solutions that improve IT operations, workflows, and user experiences. You will ensure the stability, security, and performance of systems and applications, manage updates and configurations, and develop processes and tools that increase efficiency and minimize downtime.

This position offers the opportunity to apply your expertise in system engineering, administration, and IT operations while collaborating with department leadership to plan and execute updates, maintain high-performing services, and respond to operational issues. You will also contribute to monitoring, alerting, and documentation efforts, helping to ensure the organization's critical systems run reliably and efficiently. If you enjoy problem-solving, improving processes, and building systems that impact users across an organization, this role offers a dynamic and rewarding environment to grow your skills.

Summary of Responsibilities

Performs day-to-day administration, maintenance, upgrades, and operation of existing and recently developed systems including Virtualization infrastructure (on-premises and cloud), Microsoft Windows System administration, and Linux operating systems as well as application systems and technologies. Ensure standard operating procedures, runbooks, disaster recovery, and service catalog definitions are matured and ready for production. Responsible for lifecycle of product set maintenance, availability, reliability, and performance reporting to decision makers and developers. Perform tasks, as needed, to augment work needed on systems by their engineers to ensure timely achievement of project plans and goals. Advocate, contribute, recommend, and facilitate these ever-improving standards and best practices through successful adoption of change within UVU's Digital Transformation department.
Ensures that the underlying infrastructure is running smoothly, and that systems and tools are working as expected. Analyze day-to-day functions and the processes of systems and network management software to ensure they are performing within predetermined specifications. In support of core systems availability and reliability, integrate diverse monitoring solutions for emerging and existing IT infrastructure using automation and API tools in on-premises and cloud architectures. Engineer centralized, enterprise-wide alerting and key performance indicators that gives timely, actionable information to subject matter experts, stakeholders, and leadership. SRE teams conduct post-incident reviews, documenting findings and acting on lessons learned. Following the incident resolution, the engineer will revisit the issue and determine the cause. Build or optimize the incident lifecycle to bolster reliability of services. Maintain documentation and runbooks to ensure that teams get information when they need it.
Develops operational tools and processes, builds reliable systems, ensures compliance to operational standards, and provides support to operational staff. This can be anything from adjustments to monitoring and alerting to code changes in production. A SRE can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident response and management. SREs responsibilities include writing and developing code to automate processes, such as analyzing logs, testing production environments, and responding to any issues. Such automation allows developers and engineers to focus their attention on bug fixes and building new features rather than be burdened by the day-to-day operational requirements needed in their projects.
Provides leadership, communications, development, engineering, automation, and feedback necessary for enterprise planning and architecture. Timely and responsive work is key for providing what went well or what went badly during a change/incident/problem cycle. Participates in after-hours and weekend on-call rotation and provide training to other on-call staff. Provide remote hands for systems and application administrators that need physical and virtual support within on-premises and cloud facilities. Perform other job-related duties as assigned.

Qualifications / Licenses / Certifications

Graduation from an accredited institution with a bachelor's degree in Information Technology or a related field, plus three years of work experience in IT, or a combination of education and experience in a related field totaling seven years.

Licenses or Certifications
SRE Professional Certificate, Azure/AWS associate/practitioner levels, ITIL/TOGAF, Docker Certified Associate, CKA

Knowledge / Skills / Abilities

Knowledge

Knowledge of ITIL Change, Incident, and Problem Management.
Knowledge of TCP/IP, firewall management, and operating system configuration.
Proficient and current knowledge of industry trends, tools, and processes.
Knowledge of Agile and iterative development process (e.g. Scrum and Kanban).
Knowledge of automation and containerization technologies such as Docker, Kubernetes, Ansible, Terraform, and SaltStack.
Knowledge of ITSM platforms such as Jira Service Management, ServiceNow, or other.
Knowledge of Engineering practices: availability, reliability and scalability, as well as disaster recovery
Knowledge of various automation tools as they are usually responsible for building and integrating software tools to enhance an organizational system's reliability and scalability.

Skills

Recognize key design, implementation, and process issues and proactively craft and automate solutions.
Skill with system engineering and design for NOC/SOC purposes.
Skill with scripting languages such as Perl, Power-shell, Bash, Python.
Skills with most of the common programming languages including javascript, HTML5, CSS, JQuery, Json, and PHP.
Skill with the design, implementation, and maintenance of Active Directory, and/or LDAP directories.
Skills with TCP/IP, application network protocols, firewall management, operating system configuration, anti-virus software, and relational databases.
Practical Experience with various Monitoring solutions such as Prometheus, PRTG, Site24x7, TestCafe, Selenium, Splunk, NewRelic, Azure Monitor, and AWS CloudWatch.
Expertise in the major cloud providers such as Azure, AWS, and Google Cloud.
Experience with alert management/on-call tools such as PagerDuty, VictorOps, and Opsgenie.
Experience with instant communication and team collaboration platforms like MS Teams, Slack, or Jitsi
Proven IT project planning and development skills

Abilities

Ability to read, write, and interpret technical documentation, runbooks, procedures manuals, and knowledge-base articles pertaining to network systems and application management.
Ability to complete Root Cause Analysis (RCA) investigations and write post-incident reports.
Ability to improve team practices through code reviews, handoffs of work, and incidents.
Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability, and provide support for service engineers with customer incidents.
Ability to debug production issues and build monitoring that alerts on symptoms rather than on outages.
Ability to turn into repeatable actions and into automation.
Ability to conduct and direct research into IT issues and products, as required
Ability to communicate technical ideas and concepts to a non-technical audience.

EEO Statement:

UVU employment decisions are made on the basis of an applicant's qualifications and ability to perform the job without regard to race, color, religion, national origin, sex, sexual orientation, gender identity, gender expression, age (40 and over), disability, veteran status, pregnancy, childbirth, or pregnancy-related conditions, genetic information, or other bases protected by applicable federal, state, or local law.

To apply, please visit https://www.schooljobs.com/careers/uvu/jobs/5259107/site-reliability-engineer-iii-operations

jeid-2f201690d1da3347bd3fdec0b5daf64e