Today, some of the main challenges of
NOC management, described in the following diagram, are:
Troubleshooting billions of service
alarms
Processing around 20 million workflow
management notifications by NOC experts.
Manage millions of call center emails
Higher costs due to the low use of workflow management
Incident management is an area where
we already use specialized system structures. However, the continually evolving
nature of networks, both from the technological point of view and for
implementation, makes it very difficult to maintain rules written by hand in
specialized systems. Automated incident management independent of a
data-controlled domain, without the need for specific regulations, would significantly
improve automation in NOCs. For example, a failure in one node can cause
cascading failures in other nodes, resulting in a series of alarms. Machine
learning techniques allow us to discover contemporary patterns in a flow of signals
and other events, allowing us to identify the root cause in most failure
scenarios quickly. This frees the noc team so they can focus on
more complex challenges.
What type of complexity does this imply?
Typical handling of NOC alarms
involves mapping received signals for incidents using enrichment, aggregation,
deduplication, and correlation techniques. It is a challenge due to the
heterogeneity of alarm information caused by the solutions of several technologies
and several suppliers used in today's telecommunications networks. This
heterogeneity makes it difficult to create a harmonized view of the system and
considerably increases the complexity associated with detecting and resolving
faults.
Can we afford to encode long term domain knowledge?
Current NOC solutions include handling
alarms based on rules from different sources, such as nodes or service
management systems or element/network management systems. The rules are written
in such a way that they convert domain-specific information into an overview of
the network at the NOC Center and also include coding practices which process /
correlate alarms for appropriate grouping.
This rule development is time-consuming
and manually intensive. Continuous changes in the network with new types of
network nodes and the resulting new types of alarms also make the development
and maintenance of rules more complicated. Besides, the generation/updating of
the regulations must be carried out frequently; otherwise, the rules database
will be incomplete or even inaccurate.
Does this mean that we have stopped developing domain-oriented
rules?
This does not mean that the
development of traditional rules is disappearing, but domain-independent data
approaches will augment it. Besides, automatic detection of possible
correlations between alarms can increase the rule-based approach when the rules
are not complete or when domain-specific knowledge has not yet been acquired.
The data-based approach will help
identify correlations between domains and generate data-based information.
Gradually, the system can evolve towards a fully automated solution.
NOC based data automation
We will share with you a case study on the automatic incident formation, root causes, and self-correction scenarios in
which we work as part of our investigation.
We apply the principles of Machine
Intelligence (data mining and data science) to discover patterns of behavior in
large historical datasets. These behaviors or patterns essentially mean a
correlation between alarms and co-occurrence patterns. An exciting aspect of
our approach is that we evaluate it not only as time-series data but also
examine how to deal with broadly symbolic or categorical information collected
on the network and identify latent behaviors from it.
This approach helps experts in the
field to learn evolutionary and unknown behavioral models when the environment
is multi-technology and multi-vendor. These correlated and grouped models allow
the automatic grouping of alarms, which opens the way for the automatic
detection of incidents in the network, at the source, and in the mechanical
repair.
With this approach, we can achieve an
intelligent grouping of alarms and tickets with minimal manual participation;
We can reduce or altogether avoid the manual development of rules,
automatically identifying large and absent groups, and we can reduce the total a number of incident tickets.
No comments:
Post a Comment