With the advent of the digital economy, communications service providers are seeing their profits stagnate, while they watch the over-the-top players erode their revenue. The accelerating trend of their users moving to alternative services was one of the reasons for their decision to start on the journey to digital transformation.
The main benefits that network functions virtualisation (NFV) brings in include reduced time to market, agility, innovation, open ecosystem to avoid vendor lock-in, and future capex and opex reduction. On the other hand, operational transformation brings a lot of challenges that service providers need to overcome too.
Some of the service providers’ top priorities are network reliability and performance. Telecoms networks must always be on and must guarantee always-on services, no matter what technology is used, because society, business and industries depend on reliable connections for both routine and critical communications.
Product or service development within the telecoms industry has traditionally followed rigorous standards for stability, protocol adherence and quality, reflected by the use of the term “carrier-grade” to designate equipment that demonstrates this reliability.
Over decades, service providers have engineered an extensive range of sophisticated features into their networks, to the point where they guarantee their high reliability.
Service providers have built their networks, reputations and revenue streams on a foundation of carrier-grade reliability. By decoupling software and hardware and introducing the virtualisation layer, a multi-layer environment will be created – and in most cases each layer could be delivered by different suppliers. This will bring a lot of new challenges to service providers regarding assurance, such as lower reliability, security risks, interoperability issues, difficulties in fault demarcation and need for new skills and processes.
Because of these big changes due to the virtualisation layer and the dynamic environment, there is a need for new generation service assurance solutions.
Suppliers will also play an important role in this transformation journey, since they can assist with their global experience, product competence and expertise, providing new types of services and systems to support service providers with each step.
Requirement for NFV assurance
Practically, there are three key areas in achieving NFV-driven carrier-grade reliability: product, network deployment – design and integration – and network maintenance.
For example, applications should be deployed with resilience, the cloud OS should support VM migration, servers should have multiple NICs, storage with multiple controllers, and so on. Network designers should consider virtualised network functions (VNFs) in pools, with distributed design, proper dimensioning, resource clustering, security and network resilience.
Even if a redundant network is deployed, failures can happen. It is important to ensure network maintenance and assurance capabilities are used to minimise service downtime.
A carrier-grade network guarantees a five-nines availability standard, allowing no more than 5.27 minutes of downtime per year per service.
If there is just one failure in the system, the network operations centre (NOC) should be notified and proceed with remediation in less than five minutes, so that the five-nines target can still be met. That’s not a new challenge, but the introduction of virtualisation into the network adds complexity.
Hence there is an overwhelming need to automate the entire process, including quick service recovery processes, and to rely heavily on resiliency to provide seamless transfer from the failing element to healthy elements.
In legacy networks, reliability, redundancy and recoverability were managed in a reactive manner, focused on fault detection and troubleshooting. Over the years, telecoms operators introduced more proactive tools, applying analysis of customer and network data to determine potential network performance issues, which enabled faster detection and resolution of faults, often before the customer became aware of them.
The next stage in network management is automating the detection and correction of issues through the application of “smart” or artificial intelligence technologies. These solutions provide automated responses where the network components react to policy-based thresholds, enabling greater complexity in the network and decreased operations intervention.
NFV will accelerate the movement from monitoring to real-time intelligence and analytics that respond to pre-set policies to enact orchestrated alterations in the network.
Zero-touch operations and automation are in an incubation phase and it may take years for service providers to see it materialised. Thus, service assurance still requires significant manual intervention. They will need to enhance the current reactive maintenance capabilities for fault management and then gradually introduce smart maintenance capabilities.
Considering the above challenges, there are eight main requirements regarding NFV assurance and network maintenance that we shall take one at a time.
It seems like a basic requirement, but real experience shows that it is a very complex task. Most of the existing issues are related to interoperability, mainly due to proprietary APIs.
If there is no unified monitoring, network engineers will need to log into each node separately, collect information, and then perform troubleshooting. Alarm and log collection in a single place will enhance and shorten the fault demarcation process.
Network virtualisation will not happen overnight, and so telcos will need also to manage hybrid network for several years. Unified monitoring should also consider alarm and log collection from physical network functions.
Once the important step of unified monitoring has been achieved, telcos need to face another challenge. There are now too many alarms in one place and some of them will appear multiple times. That complicates fault demarcation even more.
Smart filters or pre-defined rules are needed to correlate the alarms automatically and help network engineers to identify quickly the potential root cause. The list of the rules or policies can always be expanded based on experience and past incidents.
Automated root cause analysis
Alarm correlation needs someone to analyse the results manually and identify the root cause. The ideal scenario would be for telcos to have an automated process to output the root cause analysis. That output would come from a server that analyses existing alarms and logs that have been collected, using a fault library as a reference. The fault library can be expanded with more cases in the future.
The automated process will reduce manual intervention and eventually the labour cost.
A network should be designed with resilience to avoid single point of failure. Even if a redundant network is deployed, failures will inevitably happen, so it is critical for the operations to follow a proactive, predictive and pre-emptive approach.
Most fault conditions that are identified by network probes and management tools should be responded to as much as possible by pre-programmed event responders, with autonomics and big data analytics.
There is huge amount of data that service providers need to digest and analyse in real time, which is not doable without analytics engine and policy rules. These elements are very important for closed loop automation. The network should be able to recover automatically where possible with virtual machine (VM) migration, reconstruction, scaling and so on.
A robust network should not be vulnerable to any kind of threats. For example, if the key performance indicators of an important service deteriorated the system should identify it immediately, isolate the faulty VM or VNF, and migrate traffic to the healthy components.
Fault management capabilities in multi-vendor environment
Service providers need to enhance their capabilities and processes to manage efficiently all elements in this multi-vendor ecosystem. Whenever there is an incident, CSPs need to be in a position to promptly answer the three questions:
1: Whose fault is it?
2: To whom should I escalate the trouble ticket?
3: Who should recover the service?
In the past, application and equipment were provided by one supplier. Thus any trouble ticket was sent to that one supplier.
There may be at least three or four different suppliers providing applications, virtualisation layer, servers,and storage. We also need to consider that each one of them may support different service level agreements for service restoration and resolution. Multi-vendor management has been impacted in NFV – and service providers need to consider it thoroughly.
Service continuity remains the priority for telcos. Thus, in case of failures, their focus is mainly on the application layer, but the fault may lie in infrastructure and this could cause a trouble ticket ping-pong game.
From one side new tools for fault demarcation could help, as described earlier, but multi-vendor management has its own challenges and service providers could build a team of experts to handle these cases.
The ideal scenario would be for the application supplier to offer premium customer support services by becoming the single point of contact for telcos, regardless of where the fault is. In this role, the service provider would be responsible for performing the initial demarcation, and would coordinate with the infrastructure suppliers for service restoration and resolution, and govern trouble-ticket management and reporting.
Support from other product suppliers will not be totally eliminated, but the single point of contact role could make the fault management process more efficient and shorten the resolution time.
Predictive network analytics
As mentioned earlier the evolution from proactive to smart maintenance is under way, and this requires near real-time network data, analytics and artificial intelligence. Predictive networks analytics can be used for different tasks, such as service assurance, capacity planning and efficient use of resources. Some ways to use these systems are to predict potential capacity bottlenecks; identify grey failures; and security loopholes and manage physical and virtual resources efficiently
Auto and seamless software upgrade
Each supplier provides software packages with bug fixes and requires service providers to load them into their network regularly. Suppliers may have different release dates and frequencies. A service provider needs to synchronise and plan these activities. In the past software upgrades were done by one supplier per node, but now telcos need to consider the impact of a new software package on any layer, before it is actually loaded in the live network.
Thus, multi-vendor verification – the vertical stack – is recommended for the telco’s test bed, mirroring the real network solution. Alternatively, suppliers may provide their own NFV lab accessed remotely – a lab as a service.
The software upgrade process should ensure that it doesn’t cause service downtime and there should be a quick rollback procedure in place. It also requires skills and processes more in line with IT’s agile DevOps methods than traditional network operations practices.
Some of the requirements regarding software upgrade are: online software update capability; automated testing of the new software in order to detect issues quickly and before the end users; automated rollback option in case of failure; capabilities to migrate only small amount of traffic to the new instances until services are verified; and in case of VNF pool, isolate traffic towards the ready-to-be-upgraded VNF automatically.
Once the virtualised appliances are handling live traffic, a continuous improvement process should start. This requires service providers to perform regular network health checks in order to identify current issues, potential threats and other network weaknesses.
Proactive maintenance and continuous improvement
Virtualisation makes the network health assessment more challenging. Traditional break-and-fix mode isn’t the right way to maintain networks. Service providers and suppliers usually pick specific elements to monitor, since there is not a standard way to monitor a network or a well-defined framework to describe what exactly needs to be checked or what best practices are. In many cases, this is proven to be unstructured, inefficient, insecure, and can cause long time recovery and high risk. If CSPs don’t look into challenges they risk customer churn, decreased market share and profit.
The first goal is to build a comprehensive assessment framework, which provides a
360° in depth view, covering different areas such as performance, security, network risks or even processes and capabilities.
Once the network is assessed, service providers will be able to identify the key areas that need improvement and to apply the appropriate solutions to minimise risks and mitigate any future incidents. Some of the potential solutions could involve network optimisation, troubleshooting, training, new processes, new features and tools, hardware or software licence expansion and so on.
Network reliability is very important for service providers to maintain always-on services. NFV introduced new challenges to networks and new capabilities are needed in NFV assurance and network maintenance.
Automation will play a key role in NFV assurance, but service providers and suppliers first need to take some concrete steps towards this goal. Operational transformation is a journey and collaboration between them is essential.