Historical record of incidents for Simwood
Report: "London: Reduced fibre network redundancy"
Last updateThe span has been restored and the links have been brought back into service, restoring full redundancy around our London ring.
We are seeing one of our physical fibre spans between Telehouse North and IXN hard down currently, suggesting physical disturbance. We have preemptively shut down links using this whilst it is investigated and repaired. This reduces redundancy around our London ring but as we have multiple other paths, no service impact is expected.
Report: "London: Reduced fibre network redundancy"
Last updateThe span has been restored and the links have been brought back into service, restoring full redundancy around our London ring.
We are seeing one of our physical fibre spans between Telehouse North and IXN hard down currently, suggesting physical disturbance. We have preemptively shut down links using this whilst it is investigated and repaired. This reduces redundancy around our London ring but as we have multiple other paths, no service impact is expected.
Report: "Slough AZ – At-Risk Advisory (Power Maintenance)"
Last updateThis incident has been resolved.
The Slough Availability Zone is currently operating on a single power feed due to planned power infrastructure maintenance by our datacenter provider. While all services remain fully operational, the zone should be considered at risk until full power redundancy is restored. The maintenance window is scheduled to conclude by 22:59 UTC on Saturday, 10 May. We are closely monitoring the situation and will provide updates as needed.
Report: "Reduced network redundancy in Manchester availability zone"
Last updateAll alarms have been cleared. Resolving this incident.
The optical card has been replaced and the link has been restored. Status will remain in “monitoring” until we have confirmation from the engineers, vendor, and when we have satisfied our internal checks.
There is currently reduced network redundancy in our Manchester availability zone due to an identified issue with an optical transport link. Vendor has identified an issue with a line card in Leeds and an engineer is being sent to perform a replacement, expected on-site at 22:15. No service or traffic impacted, but site is considered at-risk.
Report: "At risk - London Availability Zone"
Last updateVolta has been returned to N+3 connectivity and Telehouse East reconnected at N+1. Voice service was unaffected by this incident but full redundancy in the affected of our 3 UK Availability Zones has been restored.
The datacentre have confirmed that these two fibre pairs were indeed seperately cut within the building and are reviewing repair options.
We noticed some instability on the network around London this afternoon which resulted from two of our fibre pairs out of Volta (both East and West loops) having been disconnected/cut within the building within 5 seconds of each other. This is being investigated but the result is that Volta is reduced to N+1 from N+3 redundancy, and Telehouse East (which has no voice services whatsoever) has been temporarily isolated. The network continues to operate at 100% otherwise with no interruption to voice or related services but further events are always possible. We do not expect a rapid resolution to the cut fibre but will update when there is one.
Report: "Inbound calls to some hosted ranges failing"
Last updateRecently we discovered that calls being sent to us by BT, particularly to ported numbers, were increasingly including the destination number in a non-standard \(i.e. invalid\) format. These calls were as a result matching unexpected routing and causing our customers issues. However upon reporting this non-conformity to BT they confirmed they were unable/unwilling to fix since the format matched their own routing plan and they did not have the flexibility in their routing engine to accommodate a fix. We therefore subsequently prepared a config change in order to accommodate the invalid numbers. Automated testing by replaying historical live call scenarios, and continuous deployment are standard practice for us and, following completion of that, we initiated the rollout to production during the afternoon of 14th November. The rollout was to each call routing instance, of which there are many in each of our 5 availability zones, watching channel/call levels closely for any signs of issues. Part-way through the rollout, at 15.45, we were alerted to a number of inbound calls being rejected by some customers due to the RURI and To header in the outgoing INVITEs for those calls being truncated. We halted the rollout and rolled everything back to the previous state which remedied the situation, with normal state confirmed by 15.55. Following further investigation it was discovered that in the config change and suite of tests we had failed to consider a particular routing scenario involving hosted ranges, applying to a very small number of customers. In all, around 2% of _inbound_ calls across the entire network were affected, which made it extremely difficult to identify from the channel metrics, particularly as we were now intentionally rejecting the improper calls BT couldn’t/wouldn’t. Whilst the overall impact was very small and our Community Slack was uncharacteristically silent on the issue, the 15 customers affected by this issue on their hosted ranges, in some cases saw a much higher percentage of calls affected, depending on their individual traffic and configuration mix. In terms of lessons learned by this incident, we do not believe that not making changes at all, as some would advocate, is a competent approach. Equally, we do not believe that deploying out of hours when some scenarios are absent, only to see issues the following day when they return, the deployment is complete, and attention has turned elsewhere is an acceptable approach - that assures bigger impact, later, and a slower response. Further, with a large distributed network our approach of progressive automated roll-out is one we defend over manual updates to monolithic instances. Thus, as is our standard practice when our test suite fails to accommodate a scenario, it is updated to do so, and this has been done. This enables us to continue to rapidly iterate with automated testing providing the assurance it has so far through thousands of deployments and absolute consistency around the network. We’re sorry to those customers affected who, for the avoidance of doubt, had nothing “wrong” in their configuration at all. It was simply an edge case we’d missed, but which will now be tested automatically with every committed change in future.
At around 15.45 this afternoon we were made aware of inbound calls to some number ranges hosted on our network, in some circumstances, being delivered to customers with the destination number in the RURI and To header truncated, resulting in those calls failing to connect due to the target number not being recognised. Within 5 minutes we had identified the issue was related to an update we were in the process of rolling out, which was intended to work around an increasing number of calls being sent to us by BT with invalid destination numbers. This was necessary because they were unwilling/incapable of fixing in their own call routing, with the numbers matching their routing plan despite being invalid. By 15.55 we had rolled back the update across all availability zones, and affected customers had confirmed calls were once again being delivered as expected. We are currently working through examples provided to understand why each specific routing case was impacted and will update here with our findings in due course.
Report: "Elevated PDD"
Last updateAt 14h44 we were notified of the failure of a primary database node in London (Volta). This is a planned failure scenario and as designed service failed over cleanly to a candidate replacement in Slough (LD4). At 14h50 our call monitoring reported increased PDD (Post Dial Delay) from some parts of the network. This was owing to several call-routing nodes which were previously slaves to the failed master resyncing, and thus being unavailable for service. In this scenario, call-routing fails over to other back-up instances, which it did. Depending on the precise local state at the time of the call this can increase PDD. The first node had resynced by 14h58 and by 15h03 the last node had fully resynced and PDD had returned to normal levels everywhere. Our monitoring shows that less than 15% of calls network wide were impacted by increased PDD but customer experiences may vary according to their own timers and failover protocols. We are however investigating utilisation of the backup routing instances which, whilst not experience affecting, was not as evenly distributed as designed.
Report: "Delayed inbound SMS"
Last updateWe have identified delays in received SMS delivery.
Report: "London SIP edge"
Last updateThis incident has been resolved.
This has been stable since the incident was opened but we continue to investigate the root cause and will likely need to continue to do so once the incident is closed. DNS has been returned but we again urge customers to respect SRV or at least DNS for seamless failover in cases like this.
We are seeing elevated errors on our London SIP edge. Whilst this will not affect customers properly configured to use FQDN and SRV, we have manually failed DNS over to another Availability Zone for those who are not. We are investigating the underlying issue.
Report: "Manchester power issues"
Last updateAt 10.01 we lost reachability to equipment in Manchester over certain routes and a loss of power in Equinix Kilburn became apparent. Whilst power and service were restored a few minutes later we understand the site is running on generators and a UPS fault has been identified. We have very little active equipment in Kilburn but it is a major hub for networks and further power issues will affect reachability; it should be considered at risk.
Report: "Brief failures on inbound calling"
Last updateIt appears that a routine upgrade to our call routing engine disrupted some incoming calls this evening from 20:25 to 20:57. This type of update is quite normal for us and happens very often as part of our continuous development and deployment. It was being progressively rolled around the network such that calls could failover to the previous version in the event of any failure. However it did not fail and whilst passing all unit tests appears to have caused unexpected call rejections for calls hitting it. The deployment was paused as soon as we were aware of issues and has been rolled back. We will investigate the underlying issue and resolve before resuming continuous deployments. Apologies for anyone affected.
Report: "Power loss in London Volta"
Last updateAll services have been fully restored in Volta.
Portal access has been fixed for all customers and we continue to monitor all services
Whilst continuing to monitor the earlier problem we are aware that the portal is unavailable for some customers and are addressing that issue
We have been checking services and are confident all is working normally. We will be monitoring this until power is fully restored and services migrated back.
We have lost a power feed in Volta which has reduced redundancy across the whole site and taken some servers off-line. Affected services have already automatically migrated to other sites for those following our standard configurations and DNS has been modified for those forcing traffic to this particular site. Customers who haven't followed the interop at all (and are forcing traffic by IP address) will need to manually update their config. We are monitoring the situation and will advise on recovery in due course.
Report: "Intermittent media delays in calls"
Last updateThis incident is now closed. Thanks to our customers for their assistance and patience and apologies once more for the service issue.
We are continuing to monitor for any further issues.
No new valid reports have been submitted since the above change. We remain alert to any fresh evidence but for now remain in close monitoring mode. Thank you for you patience and assistance with this issue and our apologies for any degraded service.
As part of continuing to investigate this issue, a change was made around midday and so we continue to welcome fresh examples of any issues since that time to aid the investigation of this issue. Thank you.
We have had sporadic reports of delays in audio commencing on calls and have been investigating these since late last week. We would welcome fresh examples to assist us. Please submit these through team@simwood.com. Thank you
Report: "Intermittent call issues"
Last updateWe received reports of intermittent call issues from around 9:10. Engineers were working on reports from internal monitoring systems at this time and no further reports or alarms have been seen since 9:20. The service is continuing to be monitored
Report: "Portal issues"
Last updateThe Simwood Portal has been stable for some time and the issue is now resolved.
We have implemented a fix and the Simwood Portal is now accessible again. We will monitor the situation for the next hour or so.
We are investigating an issue with the Simwood Portal.
Report: "Intermittent issues with portal / API - degraded performance"
Last updateThis incident has been resolved.
We are continuing to investigate the cause of the issue and identifying a robust resolution - the portal is back to being accessible.
The issue has been identified and a fix is being implemented.
We have received reports of intermittent issues that are impacting the API / Portal. This is currently being investigated and we'll update accordingly.
Report: "Intermittent API/portal timeout"
Last updateReports are processed sequentially and a single customer was effectively causing a Denial of Service with thousands of requests for a heavy report, which was in turn delaying others beyond the timeout the API has. The API, portal and the platform in general were otherwise fine and this functioned as expected.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Reports of Network issues"
Last updateWe saw instability in a single router within our Telehouse East site. This was due to a hardware limit being reached which we believe was the indirect result of overnight maintenance at a peer network. Addressing this caused widespread reconvergence across the network. We are monitoring the status of this router in case of recurrence but it has been stable since 9.34am.
A fix has been implemented and we are monitoring the results.
We are still investigating, and traffic appears to have improved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Reports of audio issues"
Last updateWe saw network instability in our Telehouse North location which we believe to be due to malformed routes propagated externally from peers, triggering a reload of the BGP process. This in turn caused instability and packet loss for any calls traversing that equipment. Affected external BGP sessions were disabled restoring stability. The underlying issue will be investigated out of hours, and mitigated over the medium term.
We received 7 examples of affected calls which helped track down a network disturbance leading to audio loss on a very small number of calls. Once isolated, we have been able to resolve the issue and have received no further reports. If you do notice any further issues, then please get in touch.
We believe we have identified the cause of these audio issues and from the examples provided we can see that the issues aren't widespread - it has only affected a very minor volume of calls (less than 10). The audio issues have been highly intermittent and we are currently working on implementing a fix to ensure these issues no longer persist.
We are continuing to investigate the cause of the audio issues and performing the necessary tests in order to identify where the cause lies. We have received a few examples which is helping in the diagnosis, and when we're able to identify and/or confirm anything further, we'll provide the relevant updates.
We have received reports of audio issues and are currently obtaining and investigating examples.
Report: "VoIP DDoS Preparations – IMPORTANT customer update"
Last updateWe have completed work in connection with this potential threat and interacted and tested with customers and so this notice is marked as resolved. The potential threats for DDoS attacks remain.
We have been monitoring the current DDoS situation and working with industry colleagues. Our adapted plans for handling such an attack against Simwood properties has been published on our blog at https://blog.simwood.com/2021/09/voip-ddos-preparations-important-customer-update/ Some of those changes require customer action in order to benefit from them. Please digest this blog posting and contact us for any advice.
Report: "Support Ticket System down"
Last updateThe Zendesk service is now operating satisfactorily.
Zendesk have reported this as resolved and we are not experiencing any major issues although some minor internal only issues appear to remain.
Zendesk is still in the middle of a service interruption but is mostly operational for our staff now. We are mitigating the areas where it isn't. We anticipate that ticket and porting functions should work for our customers albeit that Zendesk are still working on their service
Service has partially been restored but it remains slow and sporadic so you may experience some issues or delay. Please continue to mail to team@simwood.com directly.
Our support system, supplied by Zendesk, is down as per status.zendesk.com. This will affect our ability to receive tickets from the portal and some porting functions. We will continue to process support tickets mailed into team@simwood.com.
Report: "Portal / API working slowly for some functions"
Last updateThis incident is now resolved. Apologies again for any interruption accessing information.
A change has been implemented and we believe all services are responding correctly and promptly. Please raise a ticket or comment in our Community Slack channel should you find otherwise. We will continue to monitor the service in the interim. We apologise for any interruption to your service with us.
Some mitigation has been applied and some functions appear to be working correctly. Those are being monitored but work continues in other areas
The API is working slowly for some functions and as the Portal works off of that, it is also affected. We have noted that the download of CDRs, rates and invoices are reported as being affected and are investigating
Report: "Calls not completing to imported numbers"
Last updateThis incident has been resolved. Apologies to those affected by this.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring calls to previously affected numbers
A partial fix has been implemented which should help some calls. Work on the core problem is ongoing
An issue has been identified for inbound calls to some ported numbers and is been worked on
We are receiving notifications of calls not completing to some ported numbers and are investigating
Report: "Support Ticket system failure"
Last updateThis incident has been resolved.
Zendesk appears to be functionally normally now. Their status page reflects there is an incident so we will continue to monitor performance and their updates
Zendesk has acknowledged they have a problem. Their status page doesn't reflect this at this time.
Our Support Ticket system, provided by Zendesk, is down. We can continue to take calls and tickets emailed to team@simwood.com. We have raised an urgent ticket with Zendesk
Report: "Google outage"
Last updateGoogle report the incident resolved and we're seeing still seeing email flowing.
We are seeing some emails arriving now although the service is still down according to Google.
Google have now acknowledged the issue on https://www.google.com/appsstatus#hl=en&v=status
Google appear to have a large outage which they have yet to acknowledge. We use Google Apps for our email and thus are presently not receiving emails into our ticketing system. If you have urgent requirements, please telephone us. No Simwood services are otherwise affected.
Report: "September CDRs"
Last updateThe cluster remains stable and reconciliation continues to run in the background.
The cluster is fully restored. We will monitor for an our before closing this incident and will perform a reconciliation of CDRs in the background which should complete over the next 24 hours.
We currently have an issue with some of the search nodes that present CDRs to customers through the portal and API. This means CDRs viewed through the portal or API are incomplete. We are working on restoring the cluster to 100% but will then need to performa a reconciliation of CDRs for September, against the master database, which we will do over the next 24 hours.
Report: "API access is failing for some services"
Last updateThis incident has been resolved. Apologies for the API service interruption
A range of errors have been reported from Simwood API. The problem has been identified and a fix is being worked on
Report: "Support Ticket system failure"
Last updateOur Zendesk ticketing service continues to operate satisfactorily and Zendesk are now in monitoring mode, so this incident will now be closed.
The Zendesk ticketing system service seems good for us although they continue to work on it "Pod 18 is online in a degraded state. Our teams are working to restore service. We sincerely apologise for the disruption this has caused to your Zendesk service." So this service status remains open, but left as monitoring, and will be updated when Zendesk's work is complete.
The Zendesk ticketing service has resumed but is slow
An update from Zendesk (we're on Pod 18): We’re seeing some improvements on Pod 17. We are still working towards a full resolution for Pod 18. Please bear with us as we work to fully resolve this issue. The next update by us will by 3pm.
Zendesk remains unavailable which also means you will be unable to submit porting requests via the portal at this time. No further information is available yet. Any update available will be provided by 2pm
Our support ticketing system is hosted by Zendesk and they are experiencing server issues which can be seen at https://status.zendesk.com/ The submission of tickets via the support centre is affected by this as is our ability to process existing tickets. In the meantime, please mail ticket requests to team@simwood.com
Report: "Vodafone issues"
Last updateThis incident has been resolved.
Having previously taken all Vodafone interconnects out of route whilst testing, we have determined this is only affecting our Vodafone interconnects in the Leeds area. All others have been restored to service now and are passing traffic. We will leave those in the Leeds area down and escalate with Vodafone separately. We will monitor for an hour and then mark this issue resolved if that proves to be the case; we have ample redundant capacity between our networks elsewhere.
We have noticed a higher number of failures than normal to Vodafone destinations out-with the Simwood network. We have made on-net changes where we can but are investigating further.
Report: "Vodafone issues"
Last updateThis incident has been resolved.
We are continuing to monitor this.
Inbound volumes have now normalised and things look much better from here across all our Vodafone interconnects. We are still seeing elevated volumes of outbound traffic directly to Vodafone, which could illustrate issues elsewhere still.
Our outbound traffic directly to Vodafone is now at elevated levels (suggesting issues elsewhere still) but we're only seeing about 15% of the usual levels of inbound traffic from the Vodafone network.
Our Leeds interconnects are now passing traffic and they are restoring to normal levels.
Our interconnects with Vodafone in London and Manchester are back passing calls, but we see those in Leeds still down. Traffic is running at about 40% now.
We're now seeing a low level of calls to/from Vodafone completing.
Vodafone operated services such as 101 are not available as part of this incident. 101 is the non-emergency contact number for any police force in England and Wales and it is available 24 hours a day, 7 days a week
We're now seeing 0% of normal traffic to and from Vodafone destinations. This issue appears to have started at 13.07BST and is ongoing.
We're seeing traffic on our bilateral interconnects with Vodafone operating at about 10% of normal levels and hearing widespread reports of issues on the Vodafone network. This is not a Simwood issue and is off-net, but will be affecting a large proportion of calls. We have mitigated on-net as far as possible and will update with any resolution by Vodafone.
Report: "Reports of issues on Vodafone Network"
Last updateThis issue appears to be resolved and our monitoring shows calls completing normally.
We are continuing to monitor this issue.
We are aware of an issue affecting calls to and from numbers on the Vodafone network. This fault is beyond the Simwood network, and our own services are operating normally, however we will continue to update this incident with further information as it becomes available.
Report: "London edge proxy restart"
Last updateWe have been closely monitoring since the restart and are happy that everything is working as expected.
We have just performed (16.37 BST) an emergency reload of one of our London edge proxies which had stopped processing TCP traffic. Calls were failing over to Slough internally but with increased PDD.
Report: "Outbound SMS"
Last updateWe have determined this incident has been resolved.
There was an issue this morning with outbound SMS via the portal and API timing out. We have identified this issue and it is now resolved. We continue to monitor this and will update accordingly.
Report: "BT Openreach porting"
Last updateBT appear to have closed their Openreach porting team. Please see our blog for background: http://blog.simwood.com/2020/03/bt-to-halt-porting-operations-an-open-letter-to-ofcom/ We will update open porting orders as and when we have more information but at the present time it looks like they will not complete.
Report: "Invoices in Portal and API"
Last updateThis appears to be resolved and is now working as expected. We will continue to monitor this, and await a response from Xero regarding the underlying cause. Please accept our apologies for any inconvenience caused.
This has been identified as an integration issue between Simwood and the Xero API and is being investigated. Only retrieval of invoices is affected, Invoices are still being generated as normal and this does not affect any other aspect of billing such as CDRs.
We are currently investigating an issue preventing the display and retrieval of Invoices in the Portal and API, no other services are affected and we would like to reassure you this does not affect any other aspect of billing, such as CDRs.
Report: "Calls issues"
Last updateAt 11.08 today we were alerted to call volumes reporting as lower than normal and declining. We also began to receive reports of increased PDD and 503 call failures in our community Slack channel. Our investigations later identified the root cause of these intermittent failures to be a result of excessive memory fragmentation on the master Redis node, causing increased latency and connection failures. Those connection failures caused slave nodes, which are distributed throughout the network and used for all read activities, to in some cases resynchronise. Two call-routing nodes, one in Slough and one in Volta, alerted intermittently increased PDD as a result of the \(one of many\) local nodes they were querying being in this unstable state. Other nodes and those in other sites were functional at this stage. By 11.27, mid-investigations, our system automatically elected a new master node, and call volumes immediately began to climb reaching normal levels quite quickly. I hindsight, this was the primary issue mitigated. However, by this time, we were seeing wider spread reports of 403 ‘out of call credit’ and ‘account not enabled’ errors, both for call traffic and in the portal. These were network wide and not just restricted to a few nodes. We realised that numerous accounts were marked as ‘credit blocked’ and by 11.54 had manually reset them, causing the remaining accounts to have successful calls again. As our investigations continued, during which service was working normally, there was a second instance of some ‘credit locked’ accounts at 13.47. These were immediately corrected and this repeat assisted us in identifying the cause. We subsequently discovered that one of the services responsible for monitoring calls in progress and disabling accounts had not failed over and was still connected to the old master, and thus continuing to experience connection difficulties. A bug was identified which meant that should this \(out of band\) service fail to get a returned value for a particular key in Redis \(rather than a negative result\), the account was treated as disabled. ‘Disabled’ accounts with call attempts are blocked at a different level in our stack, to enable more efficient call rejection. This accounts for the progression of error messages some customers were seeing, i.e. they were first identified as disabled, then marked as blocked for no credit. This calls in progress service was stopped as a precautionary measure, the bug was patched and the service was restarted. The incident did not then recur. Technically speaking, service was degraded from 11.07 until 11.27, with a relatively small percentage of calls affected. However, as a number of accounts experienced complete network-wide call rejection until 11.54, starting at different times for each affected account, we are treating this as an SLA eligible incident from 11.07 until 11.54, and again, for 1 minute at 13.47. This grossly overstates the aggregate impact according to our statistics but we appreciate those accounts that were affected experienced a complete loss of service. We’ve learned some useful lessons through this incident and will schedule remedial work to prevent a recurrence as soon as possible. We’re sorry for the disruption caused here and very grateful to our community slack members who provided helpful insight to compliment our own telemetry which helped us identify a potentially elusive issue.
This incident was resolved earlier but we've now determined and mitigated root cause. A Post Mortem will follow shortly. We're sorry for the disruption caused and confirm this will trigger SLA credits for eligible customers.
Calls in progress are now updating again.
We have just stopped a service that updates the "calls in progress" values in the portal and is related to credit control. This means the portal will not show calls in progress for the moment.
A fix has been implemented and we are monitoring the results.
We've seen our call volumes drop and a number of customers in our community Slack channel have reported issues. We're investigating.
Report: "Bristol Office"
Last updateWe are now back in the office and calls are being answered as normal.
Due to a fire alarm within the Bristol office we are currently unable to receive incoming calls to our main telephone number and ticket responses may be delayed. All Simwood services, and the Simwood network, remain unaffected. Please accept our apologies for any inconvenience this may cause.
Report: "Virgin Ported Numbers"
Last updateAs of 1845 Virgin have confirmed that this should be resolved, and traffic has been re-routed so affected numbers should now be in service. This incident is part of a wider issue affecting the Virgin network after a third party damaged part of their fibre infrastructure and is outwith the control of Simwood.
We are aware of an issue affecting numbers ported from the Virgin network to Simwood. Virgin have confirmed they are experiencing issues with a fibre break and are working to re-route the affected traffic. This fault is outwith the Simwood network, and our own services are operating normally, however we will continue to update this incident with further information as it becomes available.
Report: "Elevated PDD"
Last updateThis incident has been resolved.
The PDD issue has been identified and addressed. Traffic over the last five minutes appears to be returning to normal. We will continue to monitor traffic.
We have identified an issue with excessive PDD on some inbound and outbound calls causing calls to fail with a timeout affecting many customers. We will update as soon as we have affirmative information and by 11:50 regardless.
Whilst aggregate volumes look normal, some customers have reported high PDD or timeouts on certain calls. We're investigating.
Report: "SMS Services"
Last updateThis was resolved as of 1615. We are aware that some customers are continuing to receive messages from earlier, this is as a result of them being queued upstream of Simwood and we cannot control when the originating network retries where the message could not be delivered initially. Please accept our apologies for any inconvenience caused.
We are aware of an issue affecting some SMS message deliverability and are investigating. Some customers may experience delays in inbound and outbound SMS. This incident will be updated with more information as soon as it is available.
Report: "Database Performance"
Last updateNormal service was restored at 1045, and the backlogged CDRs were processed entirely by 1300. This has been monitored continuously since and has operated as expected. Please accept our apologies for any inconvenience this caused.
We believe this is now resolved and are continuing to monitor the situation. The Portal and API should function as expected, although CDRs will remain backlogged on some accounts for a short time. Will will update this incident further when complete.
We are experiencing performance issues with our primary database cluster and are investigating as a priority. As a result, you may experience issues with the performance of the API and Portal; new provisioning and reconfiguration of numbering is not possible, and CDRs are currently lagging behind live. Normal calling and messaging services are unaffected, as these are completely isolated from the primary database. Please accept our apologies for any inconvenience this causes. We will update this incident with further information as soon as it becomes available.
Report: "Support Centre SSL"
Last updateOur Support Centre SSL certificate has changed. This may generate a warning in some browsers, depending on your configuration, or you may notice the "Green Padlock" provided by EV Certificates on some browsers has been replaced. This does not affect the security of your account or tickets, nor does it affect any calls made using TLS or access to the API or Portal even were strict validation is in use. Due to an error made by COMODO when issuing our SSL certificate, as part of Comodo/Sectigo CA internal audit they have uncovered an encoding error in the certificate used for our support site, support.simwood.com. Although this is an encoding issue and does not affect website security, the CA/B Forum Baseline requirements for the Issuance and Management of Extended Validation Certificates require that the CA corrects the error by revoking the previously issued certificate and issuing a new certificate. Unfortunately, they have only given just over 24 hours notice of this, and will not be able to issue a new certificate before the previous one is revoked, as the re-issuance process takes around one working day. As a result we have temporarily moved the Support Centre to use a certificate provided by LetsEncrypt (https://letsencrypt.org) which is an automated CA operated by the non-profit Internet Security Research Group. This does not affect the API or Portal, which use a different SSL certificate unaffected by this issue.
Report: "Database cluster issues"
Last updateBilling has fully caught up. Thanks for your patience.
Failover is largely complete and CDRs are now being processed.
We are about to commence failover to the standby cluster as this query rollback is showing no signs of concluding. Once this is concluded we'll mark this incident as 'monitoring'. There are several million CDRs to catch up on so we will leave it unresolved until they are processed.
This remains ongoing but we are making progress. The offending query remains on one node and continues to be in the process of rolling back. Unfortunately, rolling back is less efficient than the problem it caused in the first place. Note this is not an issue with the query per-se (a single row delete) but an internal Galera issue triggered by it. Until this rollback completes the cluster remains effectively write locked but serviceable for reads. We know why this happened and how to prevent it going forwards and have backup nodes with current data ready to takeover should we decide to fail-over from the existing cluster. As we have no idea whatsoever how long the trigger query will take to roll back on the final node, we have held off failing over in anger in the hope it may be soon, but cannot delay indefinitely. Call traffic remains unaffected and our ops team have been handling most urgent customer issues such as locked balances. We will therefore continue monitoring and update here should anything change.
Whilst not affecting call traffic, we are presently unable to write to our primary database cluster. This is due to an overnight job triggering a bug. The query will eventually work through but we have no way presently of determining how long that will take. We are meanwhile investigating more invasive options. In the interim, this means portal, API and administration options which would normally update the database (e.g. billing, number allocation and pre-pay top-ups) are delayed or non-functional. We're sorry for any impact this will have but, to repeat, call traffic is not affected.
Report: "Database cluster issues"
Last updateThe database issues were resolved as of 2154 UK time and CDRs were catching up thereafter. All has been back to normal for some time and we're therefore closing this incident. Thanks for your patience.
The database appears to be recovering and we're continuing to monitor the situation.
We are monitoring a situation with our primary database cluster. We have an expectation of this remedying itself in the next few hours but have an action plan in place for overnight if it does not. In the interim, whilst there is zero impact on production call traffic CDRs, number provisioning and other configuration changes will be delayed. Given the late hour and non-impact on call traffic, we do not intend sending notifications for updates until resolution or a dramatic change in circumstances. We will however update this page where possible.
Report: "London - Reduced fibre network redundancy"
Last updateThis link has been brought back into service, restoring full redundancy around our London ring.
We are seeing one of our physical fibre spans between Telehouse North and Interxion hard down currently, suggesting physical disturbance. We have preemptively shut down links using this whilst it is investigated and corrected. This reduces redundancy around our London ring but as we have multiple other paths, no service impact is expected.
Report: "Elevated PDD"
Last updateWhilst we have very few credible examples here, and all of them demonstrate non-compliance with our interop, we have been able to investigate this issue and pushed some code changes. We use anycast at every level of our stack with micro-services consumed by voice nodes anycasted and Redis nodes consumed by them similarly anycasted. It is, therefore, incumbent upon call-routing nodes to monitor the health of the services they're consuming, and fail-over should the IP address respond but the service not be available. We've found that at certain times, usually after the Redis master has performed a backup, the Redis slaves which are consumed by call-routing show a slight increase in latency. This increase in latency was slight (sub-second) but the tolerances before failover were too tightly set. This caused call-routing to return a failure response forcing a lookup against a backup [unicast] instance. However, this response was malformed but valid, causing the voice routing node to actually fail the call, and our edge proxy to try another. Further, that voice routing node would be taken out of service for a few seconds, causing something of a cascade which manifested in increased PDD. In the first instance, we have pushed a change which prevents the trigger false positive here, i.e. more tolerance of latency increase, and a properly formed failure response. We have however tasked further improvements to prevent the increasing latency in the first place. Lastly, we do need to highlight that this was only present in one particular site, which anyone conforming to our interop would not have been sending traffic to. The root cause of traffic ending up here in many cases appears to be very old versions of Asterisk which do not respect DNS TTL and will continue to cache a host-name until restarted. Others are hardcoding IP addresses. Whilst there was limited scope for some inbound calls to have been affected, customers who were sending outbound traffic according to our interop, using equipment which respects DNS record expiry, were unaffected.
We are still investigating this but are pleased to say it was short-lived and we have no examples since 12.35BST. We have the grand sum of 7 example calls with PCAPs, after stripping out other issues such as invalid numbers or unrelated interop issues. We are working through those, our own telemetry and monitoring but so far, we have not found the cause. As an aside, all remaining example calls were forced to our London site, either owing to stale DNS or non-use of FQDNs.
Whilst aggregate volumes look normal, some customers have reported high PDD or timeouts on certain calls. We're investigating.
Report: "Elevated PDD"
Last updateWe believe this is now resolved and apologise to anyone affected. Please remember to use DNS in accordance with our interop guidance and to enable speedier resolution of issues like this. We still saw more traffic hitting London than other availability zones even after swinging it away in DNS, suggesting many are not. Thanks also to those in our community slack for the realtime feedback.
The DNS change at 11am BST (10 UTC) rectified this for all customers using out.simwood.com rather than forcing traffic to London. New nodes have since been deployed in London and are now carrying most traffic. Former nodes are being drained down and will be destroyed once clear.
Some of the nodes in our London availability zone are showing elevated load. DNS has already been changed to Slough and we're in the process of adding new nodes to London in order to cycle the existing out.
Whilst aggregate volumes look normal, some customers have reported high PDD or timeouts on certain calls. We're investigating.
Report: "Intermittent issues to UK 080 destinations."
Last updateThis has been resolved by rerouting affected destinations.
We're aware of intermittent issues reaching some UK 080 (Freephone) destinations. We're working to reroute traffic and this should be complete shortly. Inbound calls to Simwood or hosted 080 numbers are not affected.
Report: "Slough - spurious 503 errors"
Last updateAs discussed in the community Slack channel, we had credible reports of unexpected 503 errors from some calls routed to our Slough Availability Zone. Volumes were also showing as lower than normal in our own telemetry. On investigation, one of the edge routing containers was in an unusual state whereby it was unable to route calls onwards internally. It was up and otherwise responding normally. This was rectified by restarting it. We did not force a DNS change on this occasion as the issue was resolved quickly once credible reports were received, but appropriately configured customer equipment should have respected published SRV records and retried via another site. We will log a non-conformance report internally as our own monitoring should have detected this situation, and will need work to do so.
Report: "Intermittent Calls Failing"
Last updateThis incident is fully resolved and CDRs are caught up. In an overnight configuration change, one of our customers created a situation where for each inbound call to their numbers, they generated repeated attempts outbound to that number, sometimes thousands. This was configured across multiple numbers and affected two in separate incidents this morning. The consequence was tens of thousands of additional calls in-flight outbound at any one time, which were then coming back in to the Simwood network. To make matters worse, their outbound calls were also egressing over other routes, but looping back and coming into the Simwood network from various other carriers. We show over 1m of such calls in the first hours of this morning. The customer's equipment was then overloaded causing calls to eventually time-out, but of course compounding the number in-flight. Calls were rejected here due to rate and channel limits but the rate and amplification were such that this didn't totally alleviate the problem. From the Simwood side, this caused load issues predominantly in London which manifested as increased PDD for customers. Based on reports so far and what we've seen, other Availability Zones were unaffected but we were seeing this traffic across all of them. In both cases the numbers were blocked here which restored the situation to normal and, separately, BT NMC mitigated a problem caused for them by 'gapping' (a.k.a rate-limiting) the affected number range on their network, for calls destined for Simwood, having originated from other carriers we do not have bilaterals with.
Calls should be completing normally now but we’re aware there is a delay in CDRs being processed.
We are investigating sporadic reports of calls failing, we are investigating this and more information will be provided as soon as it becomes available.
Report: "Routing instability London"
Last updateCogent acknowledged this incident an hour or so after we mitigated it and have now provided the following RFO: > Your service connected at London Volta may have experienced connectivity issues for some minutes. > > During the execution of a non impacting maintenance in our network, we made a mistake and applied some configuration changes on the wrong device, causing the isolation of a node router. As soon as we realized about the mistake, we reverted the config changes on the affected device. > > This has been a human error. However we will review the maintenance process to see if there is any room of improvement to avoid a similar issue in the future. > > Apologies for any inconvenience caused. Thankfully, we have multiple transit providers and multiple connections to each distributed around the network, so shutting one down in this kind of situation, or even for a prolonged period, is a non-event. Further, only 10% of our traffic flows over transit - 90% is directly on-net or flows over bilateral peering. We continue to encourage customers to connect as directly as possible - either by being on-net directly, cross-connected to us in a common data centre, or having your colocation provider peer with us if they don’t already. Please speak to us if you’d like this. We will be testing Cogent’s fix and re-enabling this session later this evening.
Between 23.21 and 23.30 (UK time), customers connecting services in our London availability zone but reaching us over Cogent transit, may have seen instability. Cogent were announcing our routes but not passing traffic. The session was shut down from our side to alleviate things. Those on-net or reaching us over peering sessions (which is the majority) and other ultimate transit providers, or reaching other availability zones over Cogent would have been unaffected. We strongly encourage all customers to connect into us directly wherever possible.