After a long long period with no post, here is a new article with an issue I've encountered on SCOM 2019 UR2.
All starts with a single administration action in the SCOM console: Approving agent in pending update after having applied patch on Management Servers. I had to apply 'Update for event log channel in System Center Operations Manager 2019 (KB4601269)' to fix CVE-2021-1728 | System Center Operations Manager Elevation of Privilege Vulnerability.
You all know that 'Approving' an Agent in pending required update state results in giving credentials that have administrative right on the targeted agent and click on Update button. This generaly run smoothly.
In my case, Agent Management task is failing with an access denied message and this generate an error event 10607, source Health Service Modules in the Operations Manager Event Log :
The Operations Manager Server cannot process the install/uninstall request for computer <Computer Name> due to failure of operating system version verification.
Operation: Agent Install
Install account: <Admin Account>
Error Code: 80070005
Error Description: Access is denied.
I reminded me that we have implemented an Authentication Policy with 'Enforce Policy restriction' mode in Active Directory to restrict usage of NTLM on our Admin Accounts.
Note: This action has been done because of a prioritized recommendation in the Azure AD Security assessment : Place privileged users in the Protected Users Group. The protected users groups provides additional security, because users can only authenticate using Kerberos (everything else is blocked) and hardenning is applied to the Kerberos authentication used by enforcing AES encryption.
Checking on Domain Controller, I've found that an event 4625 is raised with status "0xc000006e" for the account used in the deployment task at the same date and time the failing task.
It indicates that the account is trying a network logon (type 3) with authentication Package NTLM, authentication information is valid but some user account restriction has prevented successful authentication.
The Event description adds that ‘NTLM authentication failed because access control restrictions are required.’ and it gives the name of our Authentication Policy.
NTLM usage block is a known consequence of the Authentication Polices configuration and reverting the Authentication Policy from ‘Enforce Policy Restriction’ to ‘Only audit policy restriction’ for the account use in the deployment task is solving the issue. In audit mode, we always see event in AD for NTLM usage but NTLM authentication stay allowed.
I was thinking that SCOM uses whatever mechanism is available to open a RPC connection at start the installation and then open an SMB connection to copy updates but this shows that without NTLM V2, (NTLMV1 was disabled since year now in our environment), the deployment cannot be successful.
Are we talking here about a SCOM bug ?
To be sure I was not missing something in the configuration or having a misconfiguration somewhere, I've open a Microsoft case for this issue.
I have reproduced the issue in order to create traces and the support confirmed me that
- The deployement start by creating an RPC connection and this connection is well using Kerberos authentication.
- Then it uses an SMB session by using IP address and since IP address alway use NTLM, kerberos is not used at this step.
At this step, Microsoft support indicates that no configuration can be done in SCOM as it is by design, however, we can configure the operating System to use Kerberos even for IP adresses by following this article:
Configuring Kerberos for IP Adress | Microsoft Docs
After applying the change, new traces have been done and analysed by Microsoft and we can see in network trace that the SMB is well initiated using Kerberos ! However, we can see later in the trace that we have another RPC session, starts with authentication using Kerberos, but later in the same session, it reverts to NTLM, even though the session was successfully running on Kerberos.
It seems where are here at the limits of the product and the product teams answered that they have never been asked for a similar requirement nor was it tested, thus what makes it unsupported.
To minimize the issue, because agent maintenance or update is not a frequent process (UR3 have to be applied at this time and I'll have the same issue), I've been advised the following:
- For NON-Domain Controller servers, use a domain account that have local administrator right. That's true but in my case, the SCOM environment is dedicated to Active directory servers and the few non-DC servers are monitored using System account to avoid managing a domain account that have local admin rights...
- For DC servers, use a Domain admin account that is allowed to used NTLM!!! And if getting an exception is not option, a workarround is to deploy the update through group policy (or similar approach) or event manually.
Feel free to vote for it, the more votes the request gets the more action the product group will take upon it.
This posting is provided "AS IS" with no warranties.