Internet-Draft Network Fault Terminology January 2025
Davis, et al. Expires 25 July 2025 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-ietf-nmop-terminology-10
Published:
Intended Status:
Informational
Expires:
Authors:
N. Davis, Ed.
Ciena
A. Farrel, Ed.
Old Dog Consulting
T. Graf
Swisscom
Q. Wu
Huawei
C. Yu
Huawei Technologies

Some Key Terms for Network Fault and Problem Management

Abstract

This document sets out some terms that are fundamental to a common understanding of network fault and problem management within the IETF.

The purpose of this document is to bring clarity to discussions and other work related to network fault and problem management, in particular to YANG models and management protocols that report, make visible, or manage network faults and problems.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 25 July 2025.

Table of Contents

1. Introduction

Successful operation of large or busy networks depends on effective network management. Network management comprises a virtuous circle of network control, network observability, network analytics, network assurance, and back to network control. Network fault and problem management [RFC6632] is an important aspect of network management and control solutions. It deals with the detection, reporting, inspection, isolation, correlation, and management of events within the network. The intention is to focus on those events that have a negative effect on the network's ability to forward traffic according to expected behavior. Fault and problem management extends to include actions taken to determine the causes of problems and to work toward recovery of expected network behavior.

A number of work efforts within the IETF seek to provide components of a fault management system, such as YANG models or management protocols. It is important that a common terminology is used so that there is a clear understanding of how the elements of the management and control solutions fit together, and how faults and problems will be handled.

This document sets out some terms that are fundamental to a common understanding of network fault and problem management. While "faults" and "problems" are concepts that apply at all levels of technology in the Internet, the scope of this document is restricted to the network layer and below, hence this document is specifically about "network fault and problem management." The concept of "incidents" is also touched on in this document, where an incident results from one or more problems and is the disruption of a network service.

Note that some useful terms are defined in [RFC3877] and [RFC8632]. The definitions in this document are informed by those documents, but they are not dependent on that prior work.

2. Usage of Terms

The terms defined in this document are principally intended for consistent use within the IETF. Where similar concepts are described in other bodies, an attempt has been made to harmonize with those other descriptions, but there is care needed where terms are not used consistently between bodies or where terms are applied outside the network layer. If other bodies find the terminology defined in this document useful, they are free to use it.

Other documents may make use of the terms as defined in this document. It is suggested here that such uses should use capitalization of the terms as in this document to help distinguish them from colloquial uses, and should include an early section listing the terms inherited from this document with a citation.

3. Terminology

This section contains key terms. It is split into three subsections.

3.1. Context Terminology

This section includes some terminology that helps describe the context for the rest of this work. The terms may be viewed as a cascaded hierarchy with each subsequent term building on the previous. The definitions are deliberately kept relatively terse. Further documents may expand on these terms without loss of specificity. Such contextualization (if any) should be highlighted clearly in those documents.

Network Telemetry:

This is defined in [RFC9232] and describes the process of collecting operational network data categorized into network planes. Data collected through the Network Telemetry process does not contain network or device configuration information. Nor does it contain any data related to service definitions (i.e., "intent" per Section 3.1 of [RFC9315]).

Network Monitoring:

This is the process of keeping a continuous record of a resource, function, or connectivity service. The term 'monitoring' focuses on one single dimension and measurement in dimensional data modeling [DimensionalModeling]. This could be a measurement of the service state, a network function measurement, or the state of a network function of a resource as an example.

Network Analytics:

Network Analytics is the process of deriving analytical insights into or from operational network data. A process could be a piece of software, a system, or a human that analyzes operational data and outputs new analytical data, ideally metadata (a symptom, for example), which is related to the operational data.

Network Observability:

This is the enablement of network behavioral assessment through analysis of observed operational network data (logs, alarms, traces, etc.) with the aim of detecting symptoms of network behavior, and to identify, anomalies and their causes. Network Observability begins with information gathered using Network Monitoring tools and that may be further enriched with other operational data (e.g., change records). The expected outcome of the observability processes is identification and analysis of deviations in observed state versus the expected state of a network.

Thus, there is a cascaded sequence where:

  • Network Telemetry: the process of collecting operational data from a network.
  • Network Monitoring: the process of creating/keeping a record of data gathered in Network Telemetry.
  • Network Analytics: the process of deriving insight through the data recorded in Network Monitoring.
  • Network Observability: the process of enabling behavioral assessment of a network through Network Analytics.

3.2. Core Terms

The terms are presented below in an order that is intended to flow such that it is possible to gain understanding reading top to bottom. The figures and explanations in Section 4 may aid understanding the terms set out here.

System:

An assembly of components that exhibits some behavior.

Resource:

A component of a System.

Resource is a recursive concept so that a Resource may be an assembly of other Resources (for example, a network node comprises an assembly of interfaces).

Characteristic:

Observable or measurable aspect or behavior associated with a Resource.

  • A Characteristic may be considered with respect to the concept of dimensional modeling that is built on facts (see 'Value', below) and dimensions (the contexts and descriptors that identify and give meaning to the facts).

  • The term "Metric" is another word for a measurable Characteristic.

Value:

A Value is the measurement of a Characteristic associated with a Resource. It may be in the form of a categorization (e.g., high or low), an integer (e.g., a count), on a continuous variable (e.g., an analog measurement), etc.

Condition:

A Condition is an interpretation of the Values of a set of Characteristics of a Resource (with respect to working order or some other aspect relevant to the Resource purpose/application).

Change:

In the context of Network Monitoring, a Change is the variation in the Value of a Characteristic associated with a Resource.

  • Not all Changes are noteworthy (i.e., they do not have Relevance).

  • Perception of Change depends upon Detection, the sampling rate/accuracy/detail, and perspective.

Detect:

To notice the presence of something (State, Change, activity, form, etc.).

  • Hence also to notice a Change (from the perspective of an observer such as a monitoring system).

Event:

The variation in Value of a Characteristic of a Resource at a distinct moment in time (i.e., the period is negligible).

  • Compared with a Change, which may be over a period of time, an Event happens at a distinct moment in time.

State:

A particular Condition that something (e.g., a Resource) has (i.e., it is in a State) at a specific time.

  • While a State may be observed at a specific moment in time, it is actually determined by summarizing measurement over time in a process sometimes called State compression.

Relevance:

Consideration of an Event, State, or Value (through the application of policy, relative to a specific perspective, intent, and in relation to other Events, States, and Values) to determine whether it is of note to the system that controls or manages the network.

Occurrence:

An Event with Relevance.

A particular Change with Relevance.

  • An Occurrence may be an aggregation or abstraction of finer-grain Occurrences.

  • Applies to all scales and scopes, i.e., is essentially fractal (can recurse indefinitely).

  • Note that Occurrence is used here with respect to the temporal dimension.

Fault:

An Occurrence that is not desired/required (as it may be indicative of a current or future undesired State). A Fault can potentially be associated with a cause. See [RFC8632] for a more detailed discussion of network faults.

Problem:

A State regarded as undesirable and which may require remedial action. A Problem cannot necessarily be associated with a cause. The resolution of a Problem does not necessarily act on the thing that has the Problem.

  • Note that there is a historic aspect to the concept of a Problem. The current State may be operational, but there could have been a Fault that is unexplained, and the fact of that unexplained recent Fault is a Problem.

  • Note that whilst a Problem is unresolved it may continue to require attention. A record of resolved Problems may be maintained in a log.

  • Note that there may be a State which is considered to be a Problem from several perspectives. For example, consider a loss of light State may cause multiple services to fail. In this example, a State Change (so that the light recovers) may cause the Problem to be resolved from one perspective (the services are operational once more), but may leave the Problem as unresolved (because the loss of light has not been explained). Further, in this example, there could be another development (the reason for the temporary loss of light is traced to a microbend in the fiber that is repaired) resulting in that unresolved Problem now being resolved. But, in this example, this still leaves a further Problem unresolved (why did the microbend occur in the first place?).

Incident:

A Network Incident is an undesired Occurrence such as an unexpected interruption of a network service, degradation of the quality of a network service, or the below-target performance of a network service. An Incident results from one or more Problems, and a Problem may give rise to or contribute to one or more Incidents. Greater discussion of Network Incident relationships, including Customer Incidents and Incident management, can be found in [I-D.ietf-nmop-network-incident-yang].

Anomaly:

A (network) Anomaly is an unusual or unexpected Event or pattern in network data in the forwarding plane, control plane, or management plane that deviates from the normal, expected behavior. See [I-D.ietf-nmop-network-anomaly-architecture] for more details.

Symptom:

An observable Characteristic, State, or Condition considered as an indication of a Problem or potential Problem.

Cause:

The Events (Detected or otherwise) that gave rise to a Fault/Problem.

Consolidation:

The process of considering multiple Faults, Problems, Symptoms, and their Causes to determine the underlying Causes.

Alert:

An indication of a Fault.

Alarm:

Per [RFC8632], an Alarm signifies an undesirable State in a Resource that requires corrective action. From a management point of view, an Alarm can be seen as a State in its own right and the transition to this State may result in an Alert being issued. The receipt of this Alert may give rise to a continuous indication (to a human operator) highlighting the potential or actual presence of a Problem.

3.3. Other Terms

Three other terms may be helpful:

Intermittent:

A State that is not continuous, but keeps occurring in some time frame.

Transient:

A State that is not continuous, and occurs once in some time frame.

Recurrent:

A Problem that is actively resolved, but reoccurs.

4. Workflow Explanations

The relationship between System, Resources, and Characteristics is shown in Figure 1. A System is comprised of Resources, and Resources have Characteristics.


        Characteristics
               ^
               |
            Resources
               ^
               |
             System

Figure 1: Relationship Between Elements of a System

The Value of a Characteristic of a Resource may change over time. Specific Changes in Value may be noticed at a specific time (as digital Changes), Detected, and treated as Events. This is shown on the left of Figure 2.

The center of Figure 2 shows how the Value of a Characteristic may change over time. The Value may be Detected at specific times or periodically and give rise to States (and consequently State Changes).

In practice, the Characteristic may vary in an analog manner over time as shown on the right-hand side of Figure 2. The Value can be read or reported (i.e., Detected) periodically leading to Analog Values that may be deemed Values with Relevance, or may be evaluated over time as shown in Figure 6.


      Event                State                  Value

        ^                    ^                      ^
 Detect :             Detect :               Detect :
        :                    :                      :

   ^        ^          ^     ^     ^                   /\
   :        :          :     :     :                  /  \
   :        :          :     :     :             /\  /    \
    __    __               _____                /  \/
   |        |             |     |            /\/
 __|        |__       ____|     |____       /

Change at a time     Change over time      Change over time

Figure 2: Characteristics and Changes

Figure 3 shows the workflow progress for Events. As noted above, an Event is a Change in the Value of a Characteristic at a time. The Event may be evaluated (considering policy, relative to a specific perspective, with a view to intent, and in relation to other Events, States, and Values) to determine if it is an Occurrence and possibly to indicate a Change of State. An Occurrence may be undesirable (a Fault) and that can cause an Alert to be generated, may be evidence of a Problem and could directly indicate a Cause. In some cases, an Alert may give rise to an Alarm highlighting the potential or actual presence of a Problem.



        Alert- - - - > Alarm
          ^
          |
          |     -----> Cause
          |    |
          |----------> Problem
          |
          |
        Fault
          ^
          |
          |
          |
      Occurrence
          ^
          |
          |----------> State
          |
          |
        Event

Figure 3: Event and Dependent Terms

Parallel to the workflow for Events, Figure 4 shows the workflow progress for States. As shown in Figure 2, Change noted at a particular time gives rise to State. The State may be deemed to have Relevance considering policy, relative to a specific perspective, with a view to intent, and in relation to other Events, States, and Values. A State with Relevance may be deemed a Problem, or may indicate a Problem or potential Problem.

Problems may be considered as Symptoms and may map directly or indirectly to Causes. An Incident results from one or more Problems. An Alarm may be raised as the result of a Problem.


        Alarm
          ^
          |     ------> Incident
          |    |
          |    |   ---> Cause
          |    |  |
      Problem---------> Symptom
          ^
          |
          | Relevance
          |
          |
        State

Figure 4: State and Dependent Terms

Figure 5 shows how Faults and Problems may be Consolidated to determine the Causes. The arrows show how one item may give rise to another.

A Cause can be indicated by or determined from Faults, Problems, and Symptoms. It may be that one Cause points to another, and can also be considered as a Symptom. The determination of Causes can consider multiple inputs. An Incident results from one or more Problems.


                                      ---------
                       ------------- |         |
                      |  ----------> | Symptom |
                      | |            |         |
                      | |             ---------
                      v |                 ^
                   ---------              |
          ------->|  Cause  |<---------   |
         |         ---------           |  |
         |           ^   |             |  |
         |           |   |             |  |
         |            ---              |  |
         |                             |  |
     ---------                      ---------          ----------
    |  Fault  |------------------->| Problem |------->| Incident |
     ---------                      ---------          ----------

Figure 5: Consolidation of Symptoms and Causes

Figure 6 shows how thresholds are important in the consideration of Analog Values and Events. The arrows in the figure show how one item may give rise to or utilize another. The use of threshold-driven Events and States (and the Alerts that they might give rise to) must be treated with caution to dampen any "flapping" (so that consistent States may be observed) and to avoid overwhelming management processes or systems. Analog Values may be read or notified from the Resource and could transition a threshold, be deemed Values with Relevance, or evaluated over time. Events may be counted, and the Count may cross a threshold or reach a Value of Relevance.

The Threshold Process may be implementation-specific and subject to policies. When a threshold is crossed and any other conditions are matched, an Event may be determined, and treated like any other Event.


Occurrence
     ^
     |
     |---------------------> State
     |
     |        -------                  Relevance
     |------>| Count |-----------------------------> Value
     |        -------          |                       ^
     |           |             |                       |
     |           |             |                       | Relevance
     |           |             v                       |
     |           |        -----------           ----------------
   Event         |       | Evaluated |         |                |
     ^           |       | over time |<--------|  Analog Value  |
     |           v        -----------          |                |
     |      -----------        |               |                |
     |     | Threshold |       |               |                |
     |<----|  Process  |<------                |                |
     |     |           |<----------------------|                |
     |      -----------                         ----------------
     |                                                 ^
     |                                                 |
     | Detect                                   Detect |
     |                                                 |
Change at a Time                                Change over Time

Figure 6: Counts, Thresholds, and Values

5. Security Considerations

This document specifies terminology and has no direct effect on the security of implementations or deployments. However, protocol solutions and management models need to be aware of several aspects:

6. Privacy Considerations

In general, Fault Management should not expose information about end-user activities or user data. The main privacy concern is for a network operator to keep control of all information about Faults to protect their privacy and the details of how the network operators operate their network.

7. IANA Considerations

This document makes no requests for IANA action.

Acknowledgments

The authors would like to thank Med Boucadair, Wanting Du, Joe Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif Mostafa, Kristian Larsson, Dirk Hugo, Carsten Bormann, Hilarie Orman, Stewart Bryant, Paul Kyzivat, and Jouni Korhonen for their helpful comments.

Special thanks to the team that met at a side meeting at IETF-120 to discuss some of the thorny issues:

Informative References

[DimensionalModeling]
Wikipedia, "Dimensional Modeling", , <https://en.wikipedia.org/w/index.php?title=Dimensional_modeling>.
[I-D.ietf-nmop-network-anomaly-architecture]
Graf, T., Du, W., and P. Francois, "An Architecture for a Network Anomaly Detection Framework", Work in Progress, Internet-Draft, draft-ietf-nmop-network-anomaly-architecture-01, , <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-network-anomaly-architecture-01>.
[I-D.ietf-nmop-network-incident-yang]
Hu, T., Contreras, L. M., Wu, Q., Davis, N., and C. Feng, "A YANG Data Model for Network Incident Management", Work in Progress, Internet-Draft, draft-ietf-nmop-network-incident-yang-02, , <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-network-incident-yang-02>.
[RFC3877]
Chisholm, S. and D. Romascanu, "Alarm Management Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, , <https://www.rfc-editor.org/info/rfc3877>.
[RFC6632]
Ersue, M., Ed. and B. Claise, "An Overview of the IETF Network Management Standards", RFC 6632, DOI 10.17487/RFC6632, , <https://www.rfc-editor.org/info/rfc6632>.
[RFC8632]
Vallin, S. and M. Bjorklund, "A YANG Data Model for Alarm Management", RFC 8632, DOI 10.17487/RFC8632, , <https://www.rfc-editor.org/info/rfc8632>.
[RFC9232]
Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, , <https://www.rfc-editor.org/info/rfc9232>.
[RFC9315]
Clemm, A., Ciavaglia, L., Granville, L. Z., and J. Tantsura, "Intent-Based Networking - Concepts and Definitions", RFC 9315, DOI 10.17487/RFC9315, , <https://www.rfc-editor.org/info/rfc9315>.

Authors' Addresses

Nigel Davis (editor)
Ciena
United Kingdom
Adrian Farrel (editor)
Old Dog Consulting
United Kingdom
Thomas Graf
Swisscom
Binzring 17
CH-8045 Zurich
Switzerland
Qin Wu
Huawei
101 Software Avenue, Yuhua District
Nanjing
Jiangsu, 210012
China
Chaode Yu
Huawei Technologies