Fix inconsistent Openstack volumes and instances from Cinder and Nova via the database

22-12-2014 | Remy van Elst


Table of Contents


When running Openstack, sometimes the state of a volume or an instance can be inconsistent on the cluster. Nova might find a volume attached while Cinder says the volume is detached or otherwise. Sometimes a volume deletion hangs, or a detach does not work. If you've found and fixed the underlying issue (lvm, iscsi, ceph, nfs etc...) you need to bring the database up to date with the new consistent state. Most of the time a reset-state works, sometimes you need to manually edit the database to correct the state. These snippets show you how.

Please note that it is important to find and fix the underlying issue. If you for example have a volume which hangs on detaching, resetting the database is a quick hack and not a real fix. Make sure you first fix the underlying issue and cause before you update the database.

These examples were tested with all components on Juno and on Icehouse with MySQL as the backing database.

Please be extermely carefull with these examples.

Delete an instance

Your NFS backing storage might have crashed halfway during a VM delete. You've manually deleted all the related files (disk, config etc) and removed the VM domain from the backing hypervisor (virsh, esxi etc). However nova show still sees the VM as active (or error). A nova reset-state --active doesn't fix the delete part. The following query can be used to set an instance as deleted:

$ mysql nova_db
> update instances set deleted='1', vm_state='deleted', deleted_at='now()'' where uuid='$vm_uuid' and project_id='$project_uuid';

Normaly a nova delete $uuid is the correct way to delete a VM.

If you want to actually delete a from the database instead of marking it as deleted, the following queries should do that:

$ mysql nova_db
> delete from instance_faults where instance_faults.instance_uuid = '$vm_uuid';
> delete from instance_id_mappings where instance_id_mappings.uuid = '$vm_uuid';
> delete from instance_info_caches where instance_info_caches.instance_uuid = '$vm_uuid';
> delete from instance_system_metadata where instance_system_metadata.instance_uuid = '$vm_uuid';
> delete from security_group_instance_association where security_group_instance_association.instance_uuid = '$vm_uuid';
> delete from block_device_mapping where block_device_mapping.instance_uuid = '$vm_uuid';
> delete from fixed_ips where fixed_ips.instance_uuid = '$vm_uuid';
> delete from instance_actions_events where instance_actions_events.action_id in (select id from instance_actions where instance_actions.instance_uuid = '$vm_uuid');
> delete from instance_actions where instance_actions.instance_uuid = '$vm_uuid';
> delete from virtual_interfaces where virtual_interfaces.instance_uuid = '$vm_uuid';
> delete from instances where instances.uuid = '$vm_uuid';

Change the compute host of a VM

A nova migrate or nova resize might have failed. The disk could be already migrated or still on your shared storage but nova is confused. Make sure the VM domain is only one compute node (preferably the on it came from, use nova migration-list to find that out) and the backing disk/config files are also only on one hypervisor node (lsof and tgt-adm are your friends here). The following query changes the VM hypervisor host for nova:

$ mysql nova_db
> update instances set host='compute-hostname.domain',node='compute-hostname.domain' where uuid='$vm_uuid' and project_id='$project_uuid';

Normally a nova migrate $vm_uuid or a nova resize $vm_uuid $flavor should be enough.

Set a volume as detached in Cinder

Your backing cinder storage might have issues or bugs which cause nova volume-detach $vm_uuid $volume_uuid to fail sometimes. It might be detached in Nova but still have the state Detaching in Cinder. Make sure the VM domain has the actual disk removed. Also check our backing storage (ceph, lvm, iscsi etc..) to make sure it is actually detached and not in use anymore.

Try a cinder reset-state --state available $volume_uuid first. If that fails, the following cinder mysql query sets the Cinder state to available:

$ mysql cinder_db
> update cinder.volumes set attach_status='detached',status='available' where id ='$volume_uuid';

Absolutely make sure that there is no data being written from to the volume, it might cause data loss otherwise.

Do note that the cinder python api (import cinderclient.v2) also has the cinder.volumes.detach(volume_id) call. You do need to write some tooling around that. (http://docs.openstack.org/developer/cinder/devref/volume.html?highlight=detach_volume).

Detach a volume from Nova

Sometimes the volume is detached from Cinder but Nova still shows it as attached. Same caution warnings as above count, make sure you check your backing storage first to see if the volume is actually detached and not in use, data loss otherwise.

The followng query removes the nova block device mapping:

$ mysql nova_db
> delete from block_device_mapping where not deleted and volume_id='$volume_uuid' and project_id='$project_uuid';

The correct way is, of course, nova volume_detach $vm_uuid $volume_uuid.

If you use virsh make sure you also nova reboot --hard $vm_uuid to rebuild the virsh domain. If you don't do that, the volume might fail to attach because virsh can't attach it at the mount point (/dev/vdX) since it thinks it is already in use.

Delete a volume from Cinder

It might be that a volume has an error deleting. It ends up in the Error_deleting state. Try a cinder reset-state --state available $volume_uuid first. If all fails, check your backing storage to see what happened and if the volume is actually removed or not. If not, remove it. Then you can update the cinder database to set it as deleted:

$ mysql cinder_db
> update volumes set deleted=1,status='deleted',deleted_at=now(),updated_at=now() where deleted=0 and id='$volume_uuid';

The correct way is cinder delete $volume_uuid.

Word of caution

If you have these inconsistencies you have bigger problems you need to fix instead of manually setting state and updating components. Openstack should make that part easier, remember?

If you execute these queries wrong you can cause serious data loss!

Check your logging, set it to debug everywhere and get a reproducable scenario. Then find a solution, report a bug, test the fix and deploy it in your test, accept and then production environment.


Tags: cinder, cloud, compute, iscsi, lvm, mysql, nfs, nova, openstack, volume, zfs,