This old CBT bug that puts your backups at risk!

Recently an old, old CBT (Changed Block Tracking) bug has been discovered. This bug exists since ESX 4.x! As a reminder, CBT keeps track of the changed blocks of a virtual machine disk, which enables incremental backups of virtual disks. As far as I know, all software vendors rely on CBT to provide virtual machine backup, so that’s a big issue!

The problem appears when virtual disks are extended past specific disk sizes: 128GB, 256GB, 512GB or 1024GB. When this happens, CBT transmits incorrect information and backups become corrupt. The very bad part is that the bug is completely silent and you won’t notice until you (try to) restore the virtual machine.

Some examples to understand the problem:

100 GB disk extended to 130 GB = BUG (we crossed the 128 GB boundary).
30 GB disk extended to 40 GB = no problem.
130 GB disk extended to 150 GB = no problem.
250 GB disk extended to 350 GB = BUG (we crossed the 256 GB boundary).
260 GB disk extended to 410 GB = no problem.
create a new disk of 130 GB = no problem (only disk extensions are an issue).

I find it surprising that the bug took so much time to surface, but now it’s here. I guess that there will be a fix soon, but this future patch will not fix your existing, corrupt backups. We have to do something!

The only available fix today is to reset CBT, which will force a new full backup and clear the inconsistency. This could generate a lot of backup trafic if you have many impacted virtual machines, but there is no other choice yet. Let’s see how to proceed.

Identify the impacted virtual machines

Well, that may look easy, but it’s not. There are no virtual machine logs that will trace disk capacity changes. Maybe your monitoring tool could help you. If not, the best option is to list all virtual machines with at least one disk greater than 128 GB. The following script can help you for that.

# Variables
$vCenterName = "vcenter_fqdn"
$vCenterLogin = "vcenter_login"
$vCenterPassword = "vcenter_password"
# Treshold for disk detection (in GB)
$SizeLimit = 128

# script start
#Adding VMware cmdlets
$VMwareManagement = Get-PSSnapin | where {$_.name -match "VMware.VimAutomation.Core"}
if (!$VMwareManagement) {
Add-PSSnapin VMware.VimAutomation.Core
}

# we get the host list from the datacenter
Connect-VIServer -server $vCenterName -User $vCenterLogin -Password $vCenterPassword
$HostList = @()
$VMs = get-vm

# add VMs with disks greater than 128 GB to the list
foreach ($VM in $VMs){
   $vmdks = $VM | get-harddisk
   foreach ($vmdk in $vmdks) {
       if ($vmdk -ne $null) {
        [INT]$vmdkSize = $vmdk.capacityKB/1MB
        if ($vmdkSize -gt $SizeLimit) {
            if ($HostList -notcontains $VM.Name) {
                $HostList += $VM.Name
                }            
            }
        }
   }
}

# Write on screen
Write-Host There are $HostList.Count servers with disks bigger than 128 GB :
$HostList | Sort-Object | Write-Host

# Export to csv (uncomment the two lines)
#$HostList = $HostList | Sort-Object
#$HostList -join "," >> c:\HostList.csv

So now you have a list of all VMs that could be affected by the bug (if their disk has been extended as described above). At this point, you have two options:

Option 1 : analyze the list and identify the VMs that could have the problem. Reset CBT for these VMs.
- Pros:
  - Limit the initial backup trafic.
- Cons
  - You leave room for a human mistake.
Option 2 : reset CBT for all VMs.
- Pros:
  - No need to think
  - Quickest method.
- Cons:You will create new full backups for VMs who don’t have the bug.

Let’s work on both options.

Reset CBT for individual VMs

That’s the best choice if you know exactly which VMs could be affected or if you’re afraid of doing too many new full backups at the same time. The following script will allow you to reset CBT on a per-VM basis.

# Variables
$vCenterName = "vcenter_fqdn"
$vCenterLogin = "vcenter_login"
$vCenterPassword = "vcenter_password"

# script start
# Adding VMware cmdlets
$VMwareManagement = Get-PSSnapin | where {$_.name -match "VMware.VimAutomation.Core"}
if (!$VMwareManagement) {
Add-PSSnapin VMware.VimAutomation.Core
}

# connect to the vCenter
Connect-VIServer -server $vCenterName -User $vCenterLogin -Password $vCenterPassword

#we loop in order to do several VMs without restarting the script
do {
    # check for valid servername
    do {$VMName = Read-Host 'On which machine do you want to reset CBT?'}
    until (get-VM $VMName)

    # reset CBT
    $vm = get-vm $VMName
    if ($vm.ExtensionData.Config.ChangeTrackingEnabled -eq $true) {
        Write-Host "    CBT is enabled for $VMName. We are going to reset it."
        $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
        $spec.ChangeTrackingEnabled = $false
        $vm.ExtensionData.ReconfigVM($spec)
        #apply setting by creating and removing a snapshot
        $snap=$vm | New-Snapshot -Name 'Disable CBT'
        $snap | Remove-Snapshot -confirm:$false
        #check
        $vm = get-vm $VMName
        if ($vm.ExtensionData.Config.ChangeTrackingEnabled -eq $true) {
            Write-Host -ForegroundColor Yellow "    There was an error. CBT was not reset."
            }
        else {
            Write-Host -ForegroundColor Green "    CBT was successfully reset."
            }
        }
    else {
        Write-Host "    CBT is not enabled for $VMName. No action is required."
        }
$Repeat = Read-Host 'Do you want to reconfigure another machine? (y/n)'
}
while ($Repeat -eq "y")

Remark: if you read carefully, you will see that we disable CBT, but that we don’t enable it again. That’s because CBT is enabled by the backup tool at first backup.

CBT mass-reset

In this case we are just going to reset CBT on all VMs with a disk bigger than 128GB. This will generate a full backup for VMs who don’t need it for sure, but at least you are sure to include all impacted VMs! You can use this script.

# Variables
$vCenterName = "vcenter_fqdn"
$vCenterLogin = "vcenter_login"
$vCenterPassword = "vcenter_password"
# Treshold for disk detection (in GB)
$SizeLimit = 128

# script start
#Adding VMware cmdlets
$VMwareManagement = Get-PSSnapin | where {$_.name -match "VMware.VimAutomation.Core"}
if (!$VMwareManagement) {
Add-PSSnapin VMware.VimAutomation.Core
}

# we get the host list from the datacenter
Connect-VIServer -server $vCenterName -User $vCenterLogin -Password $vCenterPassword
$VMs = get-vm
$HostList = @()

# cycle through all VMs and get VMs with a disk bigger than size limit
foreach ($vm in $VMs){
   $vmdks = $vm | get-harddisk
   foreach ($vmdk in $vmdks) {
       if ($vmdk -ne $null) {
        [INT]$vmdkSize = $vmdk.capacityKB/1MB
        if ($vmdkSize -gt $SizeLimit) {
            if ($HostList -notcontains $VM.Name) {
                $HostList += $VM.Name
                }            
            }
        }
   }
}

# cycle through our list and reset CBT
foreach ($esxname in $HostList) {
    $esx = get-vm $esxname
        #check if CBT is enabled
        if ($esx.ExtensionData.Config.ChangeTrackingEnabled -eq $true) {
            Write-Host "    CBT is enabled for $esxname. We are going to reset it."
            $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
            $spec.ChangeTrackingEnabled = $false
            $esx.ExtensionData.ReconfigVM($spec)
            #apply setting by creating and removing a snapshot
            $snap=$esx | New-Snapshot -Name 'Disable CBT'
            $snap | Remove-Snapshot -confirm:$false
            #check
            $esx = get-vm $esxname
            if ($esx.ExtensionData.Config.ChangeTrackingEnabled -eq $true) {
                Write-Host -ForegroundColor Yellow "    There was an error. CBT was not reset."
                }
            else {
                Write-Host -ForegroundColor Green "    CBT was successfully reset."
                }
            }
        else {
            Write-Host "    CBT is not enabled for $esxname. No action is required."
            }
}
Write-Host "Done."

Next steps

With CBT reset, you are sure to get consistent backups again. But the bug itself isn’t solved yet! If you extend your disks past the bug-territory boundaries, your backups will become corrupt again. Therefore, you may have to reset CBT from time to time until the bug is definitely solved!

Update 26/02/15: the bug has been solved in versions 5.0 patch 10, 5.1 update 3 and 5.5 patch 4. This will only prevent the bug from happening, you still need to reset CBT on disks encoutering the issue. There is no correction planned for ESX4!