Press enter to see results or esc to cancel.

Where did our Search go?

Azure Search will break your Sitecore 8.2 website

A few days ago, one of our Azure production environments running Sitecore suddenly stopped working. At first it seemed to be just a ‘regular’ hick up and Azure reported auto-recovery:

The unavailability was correct. But it didn’t recover as it should. And the remark that no action was required wasn’t quite true to say the least… the website went down and kept showing our neatly formatted 500 page. Application Insights quickly showed us that something was wrong with the indexes, as the following exception popped up continuously on the Content Delivery role:

However, we didn’t deploy new code to that environment lately, and nothing changed in the infrastructure. As it happened overnight, in the weekend, we knew also no content changes were made in Sitecore. We checked the indexes in Azure Search and saw some were missing. But not all. The ones that did exist worked fine. So the first thing that comes to mind is to rebuild the indexes (the Sitecore client and Control Panel still worked). When doing so, we saw the following exception:

I restarted the slot of the Content Management role and then, all indexes threw this exception, one by one:

So obviously, Sitecore thinks there is no Azure Search instance. Azure Search on the other hand showed no signs of malfunctioning. We’ve tried recreating indexes manually, recreating the Azure Search resource, and even recreating a full duplicate of the production environment in a new resource group, which was easy to do because the whole environment is scripted based on ARM and PowerShell, triggered by an Azure DevOps release pipeline. Still, no search. Remarkable ‘detail’: the Test and Acceptance environments where running fine! Same code, same ARM, same data. We were puzzled…

Out of the blue?

So why, out of all of our Sitecore environments running on Azure Web Apps (over 10 production websites) only this one stopped working? Why it happened overnight without a release? And why only the Production environment failed, and all new resources and environments we’ve created as well? Somehow, looking for a cause for the inability of Sitecore to ‘see’ Azure Search, thinking about connectivity and SSL thrust issues, we first checked all certificates, the root certificate, and finally stumbled upon the protocol. A few PowerShell tests, invoking Azure Search from the Web App, showed that only setting the SecurityProtocol to TLS1.2 explicitly made Azure Search respond. Bingo!

With this information, it only took us a minute to find this Knowledge Base article: https://kb.sitecore.net/articles/688903. It didn’t mention Azure Search, and the exception was different (which was one of the reasons it didn’t pop up when Googling for the issue), but the solution seemed applicable. We installed the patch, using the config and DLL that injected the custom processor setting the default protocol for outgoing traffic to TLS1.2, and there it was… Sitecore started rebuilding the indexes and our website came back online!

Just after fixing the issue, we found the following article: http://status.cloud.sitecore.net/maintenance/402e570b-2199-4336-886c-78f81b900ca0. This exactly describes our issue, but note that this doesn’t only apply to Managed Cloud (logically).

The Root Cause

Azure started rolling out TLS1.2 as the default (required) protocol for Azure Search. This started on February 26th and all instances are upgraded gradually, starting from US regions. All of our Sitecore implementations are running on Sitecore 9 by now, except this one. Which is the answer to why this issue only occured on this particular implementation: Sitecore 8.2 runs on .NET Framework 4.5.2, which doesn’t use the TLS1.2 protocol as its default. Newer versions of .NET, and thus Sitecore, do.

Probably, the hick up on Production caused a restart of the resources, introducing the new Azure Search version and its underlying architecture. That’s why Test and Acceptance were still running fine. Once we restarted those environments manually, Azure Search stopped working as well. I am not sure if it was a regular hick up causing the automatic roll out of the new Azure Search version, or the other way around, that the automatic roll out of the new version kicked our website into its failing state.

Either way, it was difficult to find the root cause as the exception didn’t tell much about what was going on (something concerning a connectivity issue would’ve at least pointed us in the right direction) and at the time, there wasn’t a lot of relevant info when Googling on it. Also, both Azure and Sitecore support didn’t come up with this as well. And after we solved the issue, a few other members of the Sitecore Slack Community reported the same issue as well. So by writing this elaborate blog on the issue, including the exceptions and behavior we saw, I’m hoping to add a bit of relevant information to the web, making this issue easier to trace back to its root cause for others bumping into the same.

Edit: I was notified that the same issue applies to running the Publishing Service on Sitecore 8.2.

Comments

Comments are disabled for this post