(Note: For some reason I wrote this up in December 2017 and never published it. Maybe I forgot to add some links, but I put the work in and it seems to still be relevant. As noted in the bottom, this should have been resolved in 1802.)
Last week I had some SCCM woes with the peer cache feature, the gist of which is that application install steps during OSD would effectively stall out. Why? That was the great mystery that had me sweating overcaffinated bullets as people out in the field are notifying me and my boss that they can’t image, and of course at a time when certain important devices need to be imaged.
“Why in the world is this not working?” I asked myself. I can only presume it was the result of me enabling the feature across our organization, but there’s more to the story than that.
I know what you’re saying: “Did you freaking test it before deploying it?
Of course I did. I had spent the last few months testing BranchCache and Peer Cache in a lab setup and then in a local site. They both were working well, and I had no indications that either was causing a problem. In fact, I was able to measure noticeable improvements in application and software update delivery as a result of enabling the changes! However, I never had an issue with OSD in my lab or at the site I tested, and so I had no idea to expect it.
What I encountered in CAS.log during OSD was this on all the affected machines:
<![LOG[ Matching DP location found 0 - https://machine1.contoso.org:8003/sccm_branchcache$/content_87fa3d3b-4e22-4378-928e-fe79b2852a4f (Locality: ADSITEPEER)]LOG]!><time="17:07:20.657+360" date="11-02-2017" component="ContentAccess" context="" type="1" thread="3804" file="downloadcontentrequest.cpp:1020">
<![LOG[ Matching DP location found 1 - https://machine2.contoso.org:8003/sccm_branchcache$/content_87fa3d3b-4e22-4378-928e-fe79b2852a4f (Locality: ADSITEPEER)]LOG]!><time="17:07:20.657+360" date="11-02-2017" component="ContentAccess" context="" type="1" thread="3804" file="downloadcontentrequest.cpp:1020">
<![LOG[ Matching DP location found 2 - http://dp02.contoso.org/sms_dp_smspkg$/content_87fa3d3b-4e22-4378-928e-fe79b2852a4f.1 (Locality: ADSITE)]LOG]!><time="17:07:20.657+360" date="11-02-2017" component="ContentAccess" context="" type="1" thread="3804" file="downloadcontentrequest.cpp:1020">
And quite a bit more than that, but this is what peer caching is supposed to do. It effectively creates a bunch of mini-DPs across your boundary group, but there’s one problem that I didn’t take into consideration, and it’s why my environment that I tested in didn’t have this problem but the problem appeared in production: we have a TON, and I mean a TON, of laptops, and those laptops are mostly in carts powered off or (hopefully) sleeping. So peer caching may not work for us.
But then why didn’t the distribution point take over? Why didn’t the client download from it? No idea, but I needed to move on, fast.
After seeing those logs (note the name of the URL has “BranchCache” in it, but it’s actually peer cache, but I didn’t know this at the time) and knowing the change I made recently, I figured I’ll just reverse the changes and it’ll be all good, right?
Well then what the hell is going on?
Feeling even more under the gun now that I’m completely baffled with what’s happening, I engage with Microsoft Premier support because I feel that I could keep plugging away and googling the problem to death, or I could cut to the chase and get Microsoft involved.
Microsoft gets in touch with me, and after going over all the information I sent them and looking over the logs I was noticing, the tech fairly quickly identifies the issue as being a problem with the current build of peer cache (as of 2018.11.01-ish). Apparently even though peer cache is disabled in client policy, the changes don’t actually work and the database in SCCM still contains all the super peer entries. The fix that resolved it was to delete the super peers out of the DB with these SQL query/commands:
delete from SuperPeers
delete from SuperPeerContentMap
Bam! The problem was solved. Mostly. Kind of. The tech thought OSD was working, so it must be fixed.
The problem though is that the database keeps getting full of super peer information, so it needs to be routinely cleared out, and the super per clients need to update their super peer state. So after following these two blogs, and then getting annoyed with cleaning the DB manually and updating the collection, I put together this crude script as a scheduled task to take care of it.
(Edit 20180525): To run this script, you’ll need a few prereqs:
- PowerShell 5.1. This was tested running on that version. You can find your version by typing $PSVersionTable in a PowerShell terminal. This may work on earlier versions, but I never tested this on earlier versions.
- SCCM Admin console installed on the machine you’ll run this from.
- You need the SQLServer module installed. Assuming you’re on PowerShell 5.1, you can get it by just running ‘Install-Module SQLServer’, then import it in with ‘Import-SQLServer’.
- Finally, you’ll need to adjust the script for your own local information (site code, servers, etc.)
(Edit2 20180529): After reading this over again, it might be helpful if I explain what my script does, at least a high-level. The comments in the code explains what it does at a line-by-line level. What the script below does:
- Imports modules needed (SCCM and SQL)
- Reads superPeers.txt and performs a SQL query to get current Super Peers, then concatenates both ingests
- Creates a SCCM collection based on the resourceIDs that we just ingested
- Invokes a client update notification telling the Super Peers to update their client policies
- Keeps a list of all resourceIDs used for this process
- Deletes the Super Peers and Super Peers mappings from the database
The basic idea is to get these various devices out there to update themselves and to clear them out of the database, otherwise other devices may try to still use them as Super Peer/mini-DP.
Next, what I’ve done is run this script in an elevated prompt, and then let it do it’s thing.
# Set Date for future use
$date = Get-Date -Format yyyyMMdd.HHmm
# Import ConfigMgr Conosle Module
Import-Module "$($ENV:SMS_ADMIN_UI_PATH)\..\ConfigurationManager.psd1" # Import the ConfigurationManager.psd1 module
# Import SQLServer Module (Forgot this, thank you RiDER)
# Starting transcript to keep track of what the heck is going on
Start-Transcript -Path "<path to file>\superPeerCacheCleanup\superPeerLog_($date).txt"
# Setting global 'WhatIf' and 'Verbose' parameters for testing or output
$WhatIfPreference = $false
$VerbosePreference = "Continue"
# Collection name that will contain peers
$collectionName = "Super Peers"
# Getting contents of text file that already contains Super Peers that we've already queried for
$superPeers = Get-Content "<path to file>\superPeerCacheCleanup\superPeers.txt"
# Run SQL query to get the resourceIDs of the Super Peers, and adding a comma to the end of resourceID gathered
$resourceIDS = (Invoke-Sqlcmd -Query "select * from SuperPeers" -ServerInstance "localhost" -Database "<SCCM DB>" | select resourceId -ExpandProperty resourceid) -join ","
# Combine the contents of the Super Peer text file and SQL query into an array
$newResourceIDS = $superPeers + "," + $resourceIDS
# Create the query rule that we'll use to indicate the membership for the SCCM collection
# This query sets the membership based on the resourceIDs that we gathered and concatenated earlier
$collectionQueryRule = "select SMS_R_SYSTEM.ResourceID,SMS_R_SYSTEM.ResourceType,SMS_R_SYSTEM.Name,SMS_R_SYSTEM.SMSUniqueIdentifier,SMS_R_SYSTEM.ResourceDomainORWorkgroup,SMS_R_SYSTEM.Client `
from SMS_R_System where SMS_R_System.ResourceId in (" + $newResourceIDS + ") order by SMS_R_System.Name"
# Set the PSPath Site Code Location. This is needed because running the SQL query changes the path to 'SQLSERVER'
# Probably a better way of doing this, but this works for this purpose
# Capture the collection query rule into a variable. I couldn't get the pipe to work correctly for removing the rule
# so I'm just capturing it as a variable.
$membershipRule = Get-CMCollectionQueryMembershipRule -CollectionName $collectionName
# Remove the collection query membership rule in order to create and update the collection with a new one
Remove-CMCollectionQueryMembershipRule -CollectionName $collectionName -RuleName $membershipRule.RuleName -Confirm:$false -force
# Updating the colection with the new query membership rule that we create above
Add-CMDeviceCollectionQueryMembershipRule -CollectionName $collectionName -RuleName "Super Peers $($date)" -QueryExpression $collectionQueryRule -Confirm:$false
# Tell SCCM to update the membership of the SCCM collection
Invoke-CMCollectionUpdate -Name $collectionName
# Pausing for a moment to allow SCCM to update the membership of the collection. This is an arbitrary time; could be shorter/longer.
Start-Sleep -Seconds 60
# Creating a backup of the old Super Peer list
Copy-Item "<path to file>\superPeerCacheCleanup\superPeers.txt" "<path to file>\superPeerCacheCleanup\superPeersOld.txt" -Force
# Deleting the super peer list.
Remove-Item "<path to file>\superPeerCacheCleanup\superPeers.txt" -Force
# Creating a new Super Peer list based on combining the old values and new from the SQL query
Add-Content -Value $newResourceIDS -Path "<path to file>\superPeerCacheCleanup\superPeers.txt" -Force
# Sending a client notification in order to tell the new Super Peer clients to run the Super Peer state script
Invoke-CMClientNotification -DeviceCollectionName "Super Peers" -NotificationType RequestMachinePolicyNow
# Deleting the Super Peer values from the SCCM DB
Invoke-Sqlcmd -Query "delete from SuperPeers" -ServerInstance "localhost" -Database "CM_PRI"
Invoke-Sqlcmd -Query "delete from SuperPeerContentMap" -ServerInstance "localhost" -Database "CM_PRI"
# Ending the transcript
Update: As of December 2017, the issue still persisted, which might have been because the clients weren’t getting their client policies updated, so the Microsoft tech had me recreated some of the client policies and deploy them. The issue seems to have been fixed as those dang laptops start getting powered on. The tech also informed me that this behavior is resolved in SCCM 1802.
Also, I suspect that the issue was not only due to laptops becoming superpeers and not being powered on, but also because the boundary groups configured were too broad and spanned too many sites. Not the primary issue, but it definitely contributed to it.
We have continued to use BranchCache and it’s amazing how well BranchCache is working in our organization, even with a ton laptops in carts (45-53% of content source comes from BranchCache at these sites).