Reliability-Centered Maintenance: A Comprehensive Guide

Preventive maintenance, with few exceptions, has long been considered the gold standard for industrial and facility maintenance. Traditional Preventive Maintenance (PM) programs operated on the principle that scheduled maintenance directly impacts operating reliability, assuming that mechanical parts wear out predictably. However, groundbreaking studies revealed a critical flaw in this assumption. Nowlan and Heap concluded that basing maintenance solely on a maximum operating age, regardless of the limit, has minimal impact on failure rates. Independent studies further highlighted the discrepancy between perceived and actual design life for most equipment, with many exceeding their expected lifespan.

Reliability-Centered Maintenance (RCM) emerges as a more nuanced approach. It’s the optimal blend of reactive, time-based, condition-based, and proactive maintenance practices, illustrated in Figure 1. These strategies work in synergy, maximizing equipment reliability and minimizing lifecycle costs.

RCM encompasses these various maintenance types and requires a thorough understanding of system boundaries, equipment functions, functional failures, and failure modes.

Understanding RCM

Traditional PM assumes statistical failure predictability for individual machines and components. Parts are replaced or adjusted pre-emptively to avoid failure. For example, bearings were often replaced after a set number of operating hours based on the assumption of increased failure rate with service time.

However, Figure 2, Bearing Life Scatter, demonstrates the wide variation in bearing life, invalidating the effectiveness of time-based maintenance strategies.

Advancements in computer technology in the 1990s enabled the identification of failure precursors, allowing for more confident equipment condition assessment and repair scheduling. Further discoveries revealed that age or usage only account for a small percentage of equipment failure characteristics. This led to increased emphasis on Condition Monitoring (CM), also known as Condition-Based Maintenance, which reduced the reliance on time-based PM.

Interval-based maintenance remains appropriate for scenarios involving abrasive, erosive, or corrosive wear, material property changes due to fatigue, or a clear correlation between age and functional reliability.

Furthermore, maintenance should be forgone for systems or components where failures pose no significant threat to mission, environment, safety, security, or lifecycle cost. In these cases, equipment should be run until failure and then replaced.

RCM has gained traction across various government and industry sectors as a strategic maintenance approach. It tailors maintenance based on the consequences and costs of failure. By employing proactive techniques like improved design specifications, condition monitoring integration during commissioning, and Age Exploration (AE), RCM aims to minimize maintenance and improve reliability throughout the equipment lifecycle.

RCM Principles

Key principles underpinning RCM are:

  • Function-Oriented: RCM prioritizes preserving system or equipment function, not just operability. While redundancy enhances functional reliability, it also increases lifecycle costs.
  • System Focused: RCM emphasizes maintaining overall system function rather than individual component function.
  • Reliability Centered: RCM applies actuarial principles to failure statistics, focusing on the relationship between operating age and experienced failures. It emphasizes the conditional probability of failure at specific ages.
  • Design Limitations: RCM acknowledges inherent equipment design limitations, recognizing that maintenance can only maintain the reliability level provided by the design. However, maintenance feedback can contribute to design improvements. RCM addresses the discrepancy between perceived and actual design life through Age Exploration (AE).
  • Safety, Security, and Economics: Safety and security take precedence, followed by cost-effectiveness as the determining factor.
  • Failure Definition: RCM defines failure as any unsatisfactory condition, including both loss of function and loss of acceptable quality.
  • Logic Tree Screening: RCM employs a logic tree to ensure a consistent maintenance approach across various types of equipment.
  • Task Applicability & Effectiveness: RCM tasks must directly address the failure mode and its characteristics, reducing the probability of failure in a cost-effective manner.
  • Maintenance Task Types: RCM acknowledges time-directed (PM), condition-directed (CM), and failure finding tasks (Proactive Maintenance). Consciously choosing to perform no maintenance and running equipment to failure is also an acceptable option for certain equipment.
  • Living System: RCM uses collected data to continuously improve design and future maintenance, a vital component of the Proactive Maintenance element.

Types of RCM

Several approaches exist for conducting and implementing RCM programs. These can range from rigorous Failure Modes and Effects Analysis (FMEA) with mathematically-calculated probabilities to more intuitive, streamlined approaches based on experience and common sense. Terms used to describe these include Classical, Rigorous, Intuitive, Streamlined, Abbreviated, Concise, Preventive Maintenance (PM) Optimization, Reliability Based, and Reliability Enhanced. The best approach depends on factors such as:

  • Consequences of failure
  • Probability of failure
  • Available historical data
  • Risk tolerance
  • Resource availability

Classical/Rigorous RCM

  1. Benefits: Provides the most comprehensive knowledge of system functions, failure modes, and maintenance actions addressing functional failures. It also produces the most complete documentation.
  2. Concerns: Traditionally relies heavily on FMEA with limited analysis of historical performance data, and can be labor-intensive, delaying the implementation of obvious condition monitoring tasks.
  3. Applications: Suitable for situations where:
    • Failure consequences pose catastrophic risks to environment, health, safety, or the business unit’s economic stability.
    • Resultant reliability and maintenance costs remain unacceptable after streamlined FMEA implementation.
    • The organization lacks sufficient maintenance and operational knowledge on the system/equipment’s function and functional failures.

Abbreviated/Intuitive/Streamlined RCM

  1. Benefits: Quickly identifies and implements condition-based tasks and eliminates low-value maintenance tasks based on historical data and personnel input. The goal is to minimize analysis time and achieve early wins.
  2. Concerns: Reliance on historical data and personnel knowledge can lead to overlooking hidden failures with low probabilities of occurrence. It also requires in-depth understanding of condition monitoring technologies.
  3. Applications: Suitable when:
    • The system/equipment function is well-understood.
    • Functional failure won’t result in loss of life or catastrophic impact on the environment or business unit.
    • For these reasons, it has been recommended for DOS, NASA, and NAVFAC facilities, as well as in both discrete and continuous manufacturing facilities.

RCM Analysis

The RCM analysis should address the following questions:

  • What does the system or equipment do; what are its functions?
  • What functional failures are likely to occur?
  • What are the likely consequences of these functional failures?
  • What can be done to reduce the probability of failure(s), identify the onset of failure(s), or reduce the consequences of the failure(s)?

Answers to these questions, combined with the decision logic tree in Figure 3, Reliability-Centered Maintenance (RCM) Decision Logic Tree, guide the selection of the appropriate maintenance approach.

The analysis leads to one of four possible outcomes:

  • Perform Condition-Based actions (CM).
  • Perform Interval (Time- or Cycle-) Based actions (PM).
  • Redesign the system, accept the failure risk, or install redundancy.
  • Perform no action and choose to repair following failure (Run-to-Failure).

Failure

Failure, the cessation of proper function or performance, is examined at multiple levels: system, sub-system, component, and sometimes even parts. An effective maintenance organization strives to deliver the required system performance at the lowest cost, requiring a deep understanding of failure at each level. System components may be degraded or failed without causing a system failure. Conversely, several degraded components can combine to cause a system failure, even if no individual component has completely failed.

System and System Boundary

A system is a user-defined group of components, equipment, or facilities supporting an operational requirement, driven by mission criticality or environmental, health, safety, regulatory, quality, or other agency/business defined requirements. Complex systems can be divided into unique sub-systems along user-defined boundaries.

  • A system boundary or interface definition describes the inputs and outputs crossing each boundary.
  • The facility envelope is the physical barrier created by a building, enclosure, or structure.
  • Standardizing boundary selection is crucial. For example, a pump could include the first upstream/downstream isolation valve, the coupling, and associated gauges, while the motor includes the electrical circuit from the load side of the motor control center, but not the coupling.

The aim is to create modular FMEAs that can be assembled like Lego® blocks, selecting maintenance actions based on the consequences of risk, determined by the criticality and probability factors defined in Tables 1 and 2 respectively.

Function and Functional Failure

Function defines the expected performance, encompassing physical properties, operational performance (including output tolerances), and time requirements.

Functional failures describe the various ways a system or subsystem fails to meet its functional requirements. A system operating in a degraded state, but not impacting the requirements, has not experienced a functional failure.

Defining the functions’ non-performance is key to clearly defining the functional failure. For example, the function of a pump should be defined specifically in terms of flow rate, discharge pressure, vibration levels, B10 (L10) Life efficiency, etc. (Reliability HotWire)

Failure Modes

Failure modes are equipment- and component-specific failures resulting in the functional failure of the system or subsystem. A machinery train composed of a motor and pump can fail due to complete winding failure, bearing failure, shaft failure, impeller failure, controller failure, or seal failure. Performance degradation leading to insufficient discharge pressure or flow also constitutes a functional failure.

Dominant failure modes are the most common failure modes responsible for a significant portion of all failures.

Preventive or conditioned-based maintenance may not be warranted for all failure modes or causes, especially if the likelihood of occurrence is low or the effect is inconsequential.

Reliability

Reliability is the probability that an item will survive a given operating period under specified conditions without failure, often expressed as B10 (L10) Life and/or Mean Time to Failure (MTTF) or Mean Time Between Failure (MTBF). The conditional probability of failure measures the probability that an item entering a given age interval will fail during that interval. An increasing conditional probability of failure indicates wear-out characteristics.

Failure rate or frequency plays a limited role in maintenance programs because it’s too simplistic. While useful for cost decisions and determining maintenance intervals, it doesn’t inform on appropriate maintenance tasks or failure consequences. Maintenance solutions should be evaluated based on the safety, security, and economic consequences they are intended to prevent. A maintenance task must be applicable (i.e., prevent failures or ameliorate failure consequences) to be effective.

Failure Characteristics

Conditional probability of failure (Pcond) curves fall into six basic types (Pcond versus Time), as shown in Figures 4 and 5, Random Conditional Probability of Failure Curves. The percentage of equipment conforming to each wear pattern, from three separate studies, is also shown. (More)

The failure characteristics shown in Figures 4 and 5, Random Conditional Probability of Failure Curves, were first noted in the book, Reliability-Centered Maintenance. Follow-on studies in Sweden (1973) and by the U.S. Navy (1983) produced similar results. These studies showed that random failures accounted for 77–92% of total failures, while age-related failure characteristics accounted for the remaining 8–23%.

The difference between complex and simple items has implications for maintenance. Simple items often show a direct relationship between reliability and age, especially when factors like metal fatigue or mechanical wear are present. In these cases, age limits can be effective for improving reliability.

Complex items often show infant mortality, followed by a gradual increase or constant failure probability. A marked wear-out age is uncommon. Scheduled overhaul can increase the overall failure rate by introducing a high infant mortality rate into an otherwise stable system.

Preventing Failure

Every equipment item has a resistance to failure. Equipment is subjected to stress that leads to failure when it exceeds that resistance. Figure 6, Preventing Failure, illustrates this concept. Failures can be prevented or item life extended by:

  • Decreasing the amount of stress applied to the item.
  • Increasing or restoring the item’s resistance to failure.
  • Decreasing the rate of degradation of the item’s resistance.

Stress is use-dependent and can be variable. For simple items, a review of failures reveals that most failures occur at about the same age and for the same reason. Measuring resistance to failure can help select a preventive task.

Adding excess material or changing the material type can increase resistance to failure or slow degradation. Excess strength may be provided to compensate for loss from corrosion or fatigue. The most common method of restoring resistance is by replacing the item. Resistance to failure in simple items decreases with use, but a complex unit contains hundreds of interacting parts with many failure modes. The mechanisms of failure remain the same, but operate on many components simultaneously and interactively, so failures no longer occur for the same reason at the same age. Maintenance tasks are only effective if there are a few dominant or critical failure modes.

Failure Modes and Effects Analysis (FMEA)

FMEA is applied to each system, sub-system, and component identified in the boundary definition. For every function, there can be multiple failure modes. FMEA addresses each system function and failure, and the dominant failure modes associated with each failure, examining the consequences on the mission or operation, the system, and the machine.

Even with multiple failure modes, the effects are often the same or similar. Similar systems and machines will also have the same failure modes, but system use determines the failure consequences. For example, a ball bearing will have the same failure modes regardless of the machine, but the dominant failure mode will often change, and the effects of the failure will differ.

Figure 7, FMEA Worksheet, provides an example.

Criticality and Probability of Occurrence

Criticality assessment quantifies the importance of a system function relative to the identified mission. Table 1, Criticality/Severity Categories, provides a method for ranking system criticality. Adapted from the automotive industry, this system provides 10 categories, but it can be expanded or contracted for site-specific needs.

Table 1. Criticality/Severity Categories

Ranking Effect Comment
1 None No reason to expect failure to have any effect on safety, health, environment, or mission.
2 Very Low Minor disruption to facility function. Repair to failure can be accomplished during trouble call.
3 Low Minor disruption to facility function. Repair to failure may be longer than trouble call but does not delay mission.
4 Low to Moderate Moderate disruption to facility function. Some portion of mission may need to be reworked or process delayed.
5 Moderate Moderate disruption to facility function. 100% of mission may need to be reworked or process delayed.
6 Moderate to High Moderate disruption to facility function. Some portion of mission is lost. Moderate delay in restoring function.
7 High High disruption to facility function. Some portion of mission is lost. Significant delay in restoring function.
8 Very High High disruption to facility function. All of mission is lost. Significant delay in restoring function.
9 Hazard Potential safety, health, or environmental issue. Failure will occur with warning.
10 Hazard Potential safety, health, or environmental issue. Failure will occur without warning.

Credit: Reliability, Maintainability, and Supportability Guidebook, Third Edition, Society of Automotive Engineers, Inc., Warrendale, PA, 1995.

Table 2, Probability of Occurrence Categories, provides one possible method of quantifying the probability of failure. Historical data is invaluable, but if unavailable, estimates can be based on experience with similar systems. The statistical (“Effect”) column can be based on operating hours, days, cycles, or other unit. The statistical bases (“Comment”) may be adjusted to account for local conditions.

Table 2. Probability of Occurrence Categories

Ranking Effect Comment
1 1/10,000 Remote probability of occurrence; unreasonable to expect failure to occur.
2 1/5,000 Low failure rate. Similar to past design that has, in the past, had low failure rates for given volume/loads.
3 1/2,000 Low failure rate. Similar to past design that has, in the past, had low failure rates for given volume/loads.
4 1/1,000 Occasional failure rate. Similar to past design that has, in the past, had similar failure rates for given volume/loads.
5 1/500 Moderate failure rate. Similar to past design that has, in the past, had moderate failure rates for given volume/loads.
6 1/200 Moderate to high failure rate. Similar to past design that has, in the past, had moderate failure rates for given volume/loads.
7 1/100 High failure rate. Similar to past design that has, in the past, had high failure rates that has caused problems.
8 1/50 High failure rate. Similar to past design that has, in the past, had high failure rates that has caused problems.
9 1/20 Very High failure rate. Almost certain to cause problems.
10 1/10+ Very High failure rate. Almost certain to cause problems.

Credit: Reliability, Maintainability, and Supportability Guidebook, Third Edition, Society of Automotive Engineers, Inc., Warrendale, PA, 1995.

RCM Implementation

There isn’t a single path for implementing RCM successfully. It involves FMEA, condition monitoring, and optimizing maintenance through Age Exploration (AE). A successful RCM implementation process first recognizes the source of return on investment (ROI), which may be tangible (quantifiable financial benefits) or intangible (employee skills, morale, customer relations). A baseline and goal must be established through benchmarking, defining the gap between the “As-Is” and “To-Be” state, and ROI for closing the gap.

RCM isn’t for everyone, and few organizations benefit from all elements of a classical RCM program. Like all tools, RCM has diminishing returns. Not all elements applicable to a nuclear power plant are applicable to a non-production facility. However, some truths should be followed:

  1. Key performance indicators (KPIs) are essential for establishing baselines, goals, and measuring progress.
  2. Thermography works for electrical distribution, boilers, couplings, roofing systems, and building façades.
  3. Unquantified specifications for alignment, imbalance, motor circuit phase impedance, oil condition, and vibration lead to latent defects 80% of the time.
  4. Failure to commission and check the sequence of operations to a predetermined specification leads to unexpected outcomes.
  5. Pareto analysis helps determine where to start the RCM process, focusing on bottlenecks and recurring failures.
  6. RCM implementation works better in a team environment.
  7. Failure modes for identical equipment are the same; only the consequence and probability change.
  8. The impact of poor water chemistry on energy consumption and lifecycle cost is often underestimated.
  9. Most failures are random; Age Exploration can reveal hidden assets.
  10. Celebrate successes and address failures to build support for long-term success.

Key Performance Indicators (KPIs) Selection

Careful thought must be given to selecting KPIs to support the maintenance program. Meaningful KPIs are invaluable, while inaccurate or inapplicable KPIs are detrimental. First, identify organizational goals, then select controllable KPIs. Identify issues of concern for KPI consideration. All process owners should have a self-selected metric to indicate goals and progress. This fosters data collection and promotes KPI use for continuous improvement. Consider data collection capabilities, the cost of obtaining data, and the value it adds to the program.

Benchmark Selection

After selecting KPIs, establish benchmarks representing organizational goals. Benchmarks may be derived from organizational goals or selected from surveys of similar organizations. These benchmarks serve as targets for growth and evaluation of risks associated with non-achievement.

Utilization of KPIs

After establishing benchmarks and beginning data collection, timely action is required. Displaying KPIs in public areas keeps people informed about important goals and performance expectations. This often has an immediate effect on workers and helps the Team and Management determine Team priorities and measure productivity.

Relevant Codes and Standards

Additional Resources

Federal Agencies

Organizations

Publications

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *