An internet connection is a complicated thing...

What happened to our internet? Our certificate Authority expired. It sounds simple enough, right? It sounds simple, but figuring this out was one of the most difficult IT experiences in my career. This was mostly due to the part of the intermittent outages with no clear cause. I thought about telling my complete 5-day horror story and what lead me to narrow down the cause, but for the sake of getting too long-winded I’ll just stick to my Wednesday evening network fun-time. 

Our primary domain controller also operates as a certificate authority server. A windows domain might need this to issue certificates to enable secure connections between devices. Here’s a Wikipedia article explaining what a Certificate Authority is, and here’s a Microsoft article demonstrating how to set one up. You might notice item 16 on that page, validity period. The certificate validity period on the main certificate had a validity period of 5 years, from 7/23/2014 to 7/23/2019. Needless to say, this IT person isn’t wondering about the certificate authority expiring every day given it’s a five year period, especially when it does expire, the consequences are not immediately apparent.

So what did this break? This broke the SSO Directory connector that our Sonicwall (firewall) uses to authenticate active directory users with our windows environment. That was also not glaringly obvious because when checking the connection status, all lights showed green. There were no errors in the Sonicwall interface pointing to this fact. The one thing that made it obvious to me that there was a problem in the firewall was the fact that on a computer experiencing connection issues, I could log in to the Sonicwall admin page, and that would instantly solve the connection problems. This applies the administrator internet profile, which bypasses the content filtering (CFS policies) altogether.

I call Sonicwall support, and after a 25-minute wait, they run a test connection revealing a failure to identify an AD user with the LDAP connector. The support agent recommends upgrading the SonicOS firmware to the latest version: 6.5.4.4 to see if this solves the problem. I opted to do this after hours and I’m so very glad I did as this process messed up another major component I’ll get to in a minute.

I upgrade the firmware and of course, the entire user interface is different. Everything I was familiar with browsing through and configuring in the last several years looks completely foreign to me. Not only that, the LDAP connector is finally providing feedback on the connection error. Not knowing how to solve this problem, I call back support and wait 35 minutes for an answer. This agent discovers that the TLS connection cannot find the LDAP server, however disabling TLS fixes the problem. He suggests I check my server logs, which I immediately find errors stating my certificate authority has expired. We leave this as-is, without TLS enabled to afford me some time to fix the CA expiry issue. This needs to be fixed so the SSO connection to the LDAP is secure, preventing man-in-the-middle attacks from compromising credentials.

I assume this solves our internet problems. Users are now recognized inside the Sonicwall, they should, therefore, get their correct CFS policies based on their AD user group just as before. My internet is working (because I was operating as administrator) so everyone else’s should be fine right? As I pack up my things and head out the door, I notice the screen in the hallway has a Sonicwall error on it. Maybe that’s leftover from our troubleshooting process. Maybe I should test a user’s internet before I leave. I hop on a computer and test a teacher account I recently created and get blocked from everything! I attempt to navigate to our school website richlandschools.org and it’s blocked due to the category “Education.” We’ve got a major problem.

I log in to the Sonicwall, search for the CFS policies that might be blocking a category such as education, and discover 64 individual policies arbitrarily named. I had 4 before the firmware update! Panic sets in. I call Sonicwall support again. Twenty minutes later the agent assists me by explaining how these policies are defined. The entire structure of how these are created and applied has changed. I start to recognize what the policies were probably named before the update by which groups they are applied to and suddenly the list of 64 makes sense. One policy created only allows radicalism and extremism and nothing else! I don’t think we’re going to need that policy here. I rename the policies, cut out all of the erroneously applied policies, and simplify it back down to 6. Finally, the internet is working for users correctly. I mention to the agent that my CA was broken. He even walks me through fixing that problem by renewing my certificate authority on my server and re-enabling TLS successfully!

What did I learn from all of this? I should have checked my server logs first. I should have known this as it was a daily task at a previous IT position. The Sonicwall contains logs too however I am less familiar with analyzing them. I’ve been lucky in the past fixing problems by simply restarting servers and switches. When that fails, check the logs and don’t be afraid to call support!