Juniper EX3400: How to Recover from PoE Firmware Upgrade Failure

Did you know Juniper EX switches have PoE firmware updates to be applied?

Chelsea Lately - Great question. I had no idea.

Well, I didn’t until about a year ago when I did an upgrade and was checking on PoE power. Looking at the controller info from show poe controller, I noticed the following:

Juniper poe firmware available

Huh. Ok. Well, I’ve got a eight unit stack here, and the Juniper EX software upgrade is usually pretty solid, so let’s upgrade it — and it goes off without a hitch.

Fast forward nine months later, and I’m running into strange issues with PoE and Mercury door controllers, particularly model ‘MRE62E’. Basically the Juniper switches won’t provide power to this model, but the older MRE52’s had no problem. Checking out the firmware version using show chassis firmware detail, I noticed that the switch had the older 1.x firmware and not the new 2.x.

PoE firmware 1.6.1.21.1

 

Alrighty then — let’s upgrade this stack. I upgrade the software using the latest JTAC recommended version (staying in 15.x), then upgrade the PoE firmware — no problem. Door controller is now getting power, I see a MAC address. Everything is hunky dory.

Now let’s upgrade this other stack.

No problems on EX software upgrade. Great. Now upgrade PoE firmware…

Ten minutes later, I get the following on the terminal:

Magic Thread Message

Of note, and the thing that made me panic, was that out of nine switches in the stack, only one came back online. Checking the firmware versions, I see the following:

Various PoE firmware versions, some missing, some 0.0.0.0.0, only one 2.x

Okay… F***. Well, let’s reboot the stack; perhaps a reboot is needed*. After reboot, I get the following:

PoE Device Fail on FPC 8. All but FPC 2 are missing.WTF.

Guy shaking head mouthing WTF

In the past when I’ve done a PoE firmware upgrade (between now and when I first learned about it), I had no recourse but to RMA the switch. Well in this case, I don’t have eight spare switches to fill this temporarily while I wait for an RMA! WTF am I going to do?!

Solving the PoE Firmware Upgrade Failure

If you’re in the same situation as I was in, take a deep breath — you’re not dead in the water.

There are two scenarios for a PoE firmware upgrade failure that I’ve encountered, and I have a solution for both:

  • PoE Firmware Failure #1 – After firmware upgrade, you see a mixed result of firmware versions, some being 0.0.0.0.0, some being correct (2.1.1.19.3**), and some missing/blank (see picture above showing mixed/missing versions)
  • PoE Firmware Failure #2 – Perhaps you did as I did and rebooted and the PoE controller shows one with the message ‘DEVICE_FAILED’ (see above)

Solution for PoE Firmware Failure #1

If you encounter this failure, DON’T REBOOT THE STACK. You’ll make your life harder if you do.

Next, Juniper TAC (finally) has a solution — and it requires remote/on-site hands. If you’re going on-site or working with someone remotely, get yourself a cup of coffee (or beverage of choice) and some podcasts lined-up, because you’re going to be doing this awhile (~10 minutes for each switch/fpc).

From their site, the solution is the following (with my own notes):

  1. Power cycle the affected FPC (re-seat the power cord). Do not perform a soft reboot.
  2. After the FPC joins the VC or the standalone device reboots, execute one of the following commands in operational mode:

    OR

    JTAC Note: You need to change the fpc-slot number accordingly. Also, it is recommended that you push the PoE code one by one instead of adding all members in the virtual-chassis setup. (Emphasis mine)
  3. After the above command is executed, the FPC should automatically reboot. If not, reboot from the Command Line Interface.
    Note: Be patient and wait. No, seriously…wait. It takes awhile. If you need to reboot, you’re rebooting the whole unit AFAIK:
  4. After the FPC is online, check the PoE version with the show chassis firmware detail command. The PoE version should be the latest version (2.1.1.19.3) after the above steps are completed.
  5. If the version is correct, the PoE devices should work.
  6. Repeat the above steps to upgrade the PoE versions on other FPCs in the virtual-chassis setup.

The one thing to note that when it’s doing its upgrade is that you can see the progress with show poe controller, but at some point it will hang at 95%, then disappear, then come back, then the process will be complete — in other words…WAIT, unless you want to try out the solution for failure #2. 😆

Solution for PoE Firmware Failure #2

In this scenario, you rebooted the stack and something failed. The following is similar to solution #1, but the failed PoE controller requires to basically upgrade it twice. The steps:

  1. Execute the following command to reload the firmware on the FPC:

    Note: You need to change the fpc-slot number accordingly.
    The PoE controller will disappear when you run show poe controller, then come back and start upgrading like this:
    PoE firmware upgrading
  2. After the firmware upgrade completes, the firmware will likely be incorrect (it always was for me). Power cycle the affected FPC (re-seat the power cord). Do not perform a soft reboot.
  3. After the FPC joins the VC or the standalone device reboots, execute one of the following commands in operational mode:

    JTAC Note: You need to change the fpc-slot number accordingly. Also, it is recommended that you push the PoE code one by one instead of adding all members in the virtual-chassis setup. (Emphasis mine)
  4. After the above command is executed, the FPC should automatically reboot. If not, reboot from the Command Line Interface.
    Note: Be patient and wait. No, seriously…wait. It takes awhile. If you need to reboot, you’re rebooting the whole unit AFAIK:
  5. After the FPC is online, check the PoE version with the show chassis firmware detail command. The PoE version should be the latest version (2.1.1.19.3) after the above steps are completed.
  6. If the version is correct, the PoE devices should look like this:
    Successful PoE firmware upgrade
  7. Repeat the above steps to upgrade the PoE versions on other FPCs in the virtual-chassis setup.

Just like solution #1, one thing to note is that when it’s doing its upgrade you can see the progress with show poe controller, but at some point it will hang at 95%, then disappear, then come back, then the process will be complete — in other words…WAIT! You don’t really want to re-apply this whole process, do you?

Final Thoughts

Here’s the kicker for me: I’ve had this work just fine for stacks and single switches alone, and fail on stacks and single switches alone — I can’t find the common denominator here. Perhaps there’s a hardware build that has this more than others, but I can’t figure it out. The official documentation doesn’t hint on a best practice for this (other than maintenance hours), so I’m uncertain on the best approach.

Here’s some ideas I have to change my PoE firmware upgrade procedure (unsure if this will help):

  • Turning off PoE on all interfaces
  • Upgrading one at a time.
  • Trying an earlier version of the JTAC software, the going to the latest recommended. Example: I had no problems with 15.1X53-D59.4 or 15.1X53-D590, but the sample size for determining that is small (only two stacks attempted).

Time will tell.

Hope this helps! If it doesn’t I’d love to know the different experiences others have.

* I swear I saw a message that a reboot is required, but I can’t confirm this (I didn’t screencap it)

** There is a version 3.4.8.0.26, but that’s on the 18.x software version line, and it requires a whole different set of upgrade procedures. This is outside the scope of this post.

SCCM Peer Cache: When Reversing It Doesn’t Reverse It

(Note: For some reason I wrote this up in December 2017 and never published it. Maybe I forgot to add some links, but I put the work in and it seems to still be relevant. As noted in the bottom, this should have been resolved in 1802.)

Last week I had some SCCM woes with the peer cache feature, the gist of which is that application install steps during OSD would effectively stall out. Why? That was the great mystery that had me sweating overcaffinated bullets as people out in the field are notifying me and my boss that they can’t image, and of course at a time when certain important devices need to be imaged.

“Why in the world is this not working?” I asked myself. I can only presume it was the result of me enabling the feature across our organization, but there’s more to the story than that.

I know what you’re saying: “Did you freaking test it before deploying it?

Of course I did. I had spent the last few months testing BranchCache and Peer Cache in a lab setup and then in a local site. They both were working well, and I had no indications that either was causing a problem. In fact, I was able to measure noticeable improvements in application and software update delivery as a result of enabling the changes! However, I never had an issue with OSD in my lab or at the site I tested, and so I had no idea to expect it.

What I encountered in CAS.log during OSD was this on all the affected machines:

And quite a bit more than that, but this is what peer caching is supposed to do. It effectively creates a bunch of mini-DPs across your boundary group, but there’s one problem that I didn’t take into consideration, and it’s why my environment that I tested in didn’t have this problem but the problem appeared in production: we have a TON, and I mean a TON, of laptops, and those laptops are mostly in carts powered off or (hopefully) sleeping. So peer caching may not work for us.

But then why didn’t the distribution point take over? Why didn’t the client download from it? No idea, but I needed to move on, fast.

After seeing those logs (note the name of the URL has “BranchCache” in it, but it’s actually peer cache, but I didn’t know this at the time) and knowing the change I made recently, I figured I’ll just reverse the changes and it’ll be all good, right?

Thumbs Up.
We got this. We’ll just reverse the changes.

Wrong.

Well then what the hell is going on?

What the hell?

Feeling even more under the gun now that I’m completely baffled with what’s happening, I engage with Microsoft Premier support because I feel that I could keep plugging away and googling the problem to death, or I could cut to the chase and get Microsoft involved.

Microsoft gets in touch with me, and after going over all the information I sent them and looking over the logs I was noticing, the tech fairly quickly identifies the issue as being a problem with the current build of peer cache (as of 2018.11.01-ish). Apparently even though peer cache is disabled in client policy, the changes don’t actually work and the database in SCCM still contains all the super peer entries. The fix that resolved it was to delete the super peers out of the DB with these SQL query/commands:

Bam! The problem was solved. Mostly. Kind of. The tech thought OSD was working, so it must be fixed.

The problem though is that the database keeps getting full of super peer information, so it needs to be routinely cleared out, and the super per clients need to update their super peer state. So after following these two blogs, and then getting annoyed with cleaning the DB manually and updating the collection, I put together this crude script as a scheduled task to take care of it.

(Edit 20180525): To run this script, you’ll need a few prereqs:

  • PowerShell 5.1. This was tested running on that version. You can find your version by typing $PSVersionTable in a PowerShell terminal. This may work on earlier versions, but I never tested this on earlier versions.
  • SCCM Admin console installed on the machine you’ll run this from.
  • You need the SQLServer module installed. Assuming you’re on PowerShell 5.1, you can get it by just running ‘Install-Module SQLServer’, then import it in with ‘Import-SQLServer’.
  • Finally, you’ll need to adjust the script for your own local information (site code, servers, etc.)

(Edit2 20180529): After reading this over again, it might be helpful if I explain what my script does, at least a high-level. The comments in the code explains what it does at a line-by-line level. What the script below does:

  • Imports modules needed (SCCM and SQL)
  • Reads superPeers.txt and performs a SQL query to get current Super Peers, then concatenates both ingests
  • Creates a SCCM collection based on the resourceIDs that we just ingested
  • Invokes a client update notification telling the Super Peers to update their client policies
  • Keeps a list of all resourceIDs used for this process
  • Deletes the Super Peers and Super Peers mappings from the database

The basic idea is to get these various devices out there to update themselves and to clear them out of the database, otherwise other devices may try to still use them as Super Peer/mini-DP.

Next, what I’ve done is run this script in an elevated prompt, and then let it do it’s thing.

Script:

Update: As of December 2017, the issue still persisted, which might have been because the clients weren’t getting their client policies updated, so the Microsoft tech had me recreated some of the client policies and deploy them. The issue seems to have been fixed as those dang laptops start getting powered on. The tech also informed me that this behavior is resolved in SCCM 1802.

Also, I suspect that the issue was not only due to laptops becoming superpeers and not being powered on, but also because the boundary groups configured were too broad and spanned too many sites. Not the primary issue, but it definitely contributed to it.

We have continued to use BranchCache and it’s amazing how well BranchCache is working in our organization, even with a ton laptops in carts (45-53% of content source comes from BranchCache at these sites).

AudioCodes Mediant 1000 One-Way Outbound Audio on SIP Trunk

Had a strange issue recently when I was setting up a SIP trunk between two Mediant 1000s (M1K for shorthand). The SIP trunk was causing one-way audio issues in which I could receive media/RTP from the other side, but from the new M1K, I wasn’t sending any RTP packets whatsoever. It was the most odd thing because this SIP trunk didn’t have anything special about it since it was within a secure layer 2 network (no auth, no TLS).

I had to engage AudioCodes about the issue because I was completely puzzled. This isn’t complicated (relatively speaking); point the SIP trunk to the next hop, and assuming the network configuration is correct, there shouldn’t be an issue. When you did a Wireshark capture, it showed SIP traffic, but no RTP whatsoever:

audiocodesm1k_nortpout

After going through the initial process of getting the usual responses from AudioCodes to adjust IP profile, adjust this, adjust other things that I’ve already done or are non-consequential to the issue I’m having, they finally set a remote support session.

Within minutes, the tech identified the issue.

The network card that you purchase from AudioCodes comes with four ethernet ports, and those are configured in two-pairs for redundancy, which in my case was GE_7_1 and GE_7_2 as one pair, GE_7_3 and GE_7_4 as another pair. In my situation I reconfigured port 7_1 and 7_2 to be independent ports operating in what AudioCodes calls ‘Single’ mode.

Here’s the problem: in version 6.8 of the M1K software, you can configure the ports to operate this way in the GUI, but the software doesn’t actually support this function.

Why would the software allow you to configure it one way, but not support it in the back end? No idea. I’ll chalk it up to the same reason why you can use the ‘Search’ button on the top left, find settings that you actually don’t have support for and can’t find by just clicking around, configure those settings, and those settings won’t actually work.

audiocodessearchbutton

Anyways, here’s the solution: you can either stick with 6.8 and just move the ethernet group to use GE_7_3 (or any other odd-numbered interface on a network card), or upgrade to 7.0 that actually supports this configuration.

My configuration ended up looking something like this:

audiocodesethernetgroups

Hope that helps someone out there.