2019-05-21 | Voice Service Advisory

Kyle Olexa -

Event Description: Partial Voice Outage

Event Date: 05/21/2019

RFO Issue Date: 05/23/2019

Ticket Number: 70000

 

Scope of Impact

Services Impacted:

Customers on a subset of softswitch nodes experienced call completion errors on calls utilizing audio prompts and music resource files.

Event Description:

On May 21st, 2019, wile performing routine, non-service affecting maintenance, a Teleflex engineer inadvertently issued a command which reset network services on a LXD storage pool. Following this event the storage pool was active and accessible by the majority of softswitch nodes, which had failed over to alternate storage pools and maintained normal operation. TeleFlex NOC began inspection of dependent nodes. Shortly thereafter, Teleflex received reports of call issues from users on a subset of softswitches and engineers focused on the reported nodes to determine the cause of isolated service issues.

Logs on the subset of impacted softswitches showed errors reading certain audio files which, in certain dialplan actions, resulted in delayed call processing and call completion errors. The impacted softswitches had immediately re-established connections to shared storage pool following the network event; however, the file handler pointers were flagged as stale on some nodes. Since the effected softswitches had re-established the underlying storage maps to the pool linked with the network event, the process for failover to secondary storage pools was interrupted on these particular nodes.

To correct the issue, Teleflex went through each of the identified softswitches to validate active file handler pointers. For the effected nodes, engineers flushed the stale pointers and forced new connections to the active storage pools. Once pointers were re-established services for each impacted softswitch node resumed normal operation with full access to audio source files. Softswitch #25 was manually restored during troubleshooting activities. Once isolated a script to handle the refresh of storage pointers was successfully validated against Softswitch #46. The script was then deployed to the remaining identified nodes #30 and #34. Following resolution on the impacted nodes, the script was applied across all remaining (non-impacted) nodes to protect against any delayed impacts to nodes not initially effected.

Event Timeline:

Please note that all times listed in the timeline below are in 24-hour clock format, and refer to Central Daylight Time.

 

11:53 - Inadvertent restart of network services for LXD0001.

 

11:55 - LXD0001 back online and fully functional. Subsystem check completed.

 

12:03 - First reported call issues Softswitches #34, #25.

 

12:05 - NOC Escalation to Senior Engineering

 

12:10 - Log errors identified on #25 indicated problems opening audio resource files.

 

12:16 - Issue cause isolated to stale file handler pointers on #25.

 

12:24 - Softswitch #25 stale file pointers flushed and re-activated & verified resolution of node issue.

 

12:38 - Script applied to SoftSwitch #46 and validated. Full service restored.

 

12:54 - Script applied to remaining identified nodes #30, #32, #34, #35 and #37. Full service restored.

 

13:14 - Script applied to all SoftSwitches (not impacted by event) as precaution against any further impacts.

 

Corrective/Preventative Actions

Maintenance schedules and practices will be reviewed to determine what can be moved to off-hour/planned maintenance windows. Teleflex will be reviewing supervision and review processes related to technician activities. Engineering is reviewing logs on the impacted nodes to determine the operational state of each node impacted versus the nodes that successfully rolled to alternate connections and will adjust failover algorithms to protect against the identified corner-cases.

Have more questions? Submit a request

TeleFlex Networks

1510 Primewest Parkway | Suite 800
Katy, TX 77449