Skip to main content

Raymii.org Logo (IEC resistor symbol)logo

Quis custodiet ipsos custodes?
Home | About | All pages | RSS Feed | Gopher

Burn in testing for new Hypervisor and Storage server hardware

Published: 08-04-2017 | Author: Remy van Elst | Text only version of this article


Table of Contents


This article talks over how and why to do burn in testing on hypervisor andstorage servers. I work at a fairly large cloud provider, where we have a lot ofhardware. Think thousands of hardware servers and multiple ten thousandharddisks. It's all technology, so stuff breaks, and at our scale, stuff breaksoften. One of my pet projects for the last period has been to automate the burn-in testing for our virtualisation servers and the storage machines. We runOpenStack and use KVM for the hypervisors and a combination of different storagetechnology for the volume storage servers. Before they go in production, theyare tested for a few days with very intensive automated usage. We've noticedthat they either fail then, or not. This saves us from having to migratecustomers off of new production servers just a few days after they've gone live.The testing is of course all automated.

If you like this article, consider sponsoring me by trying out a Digital OceanVPS. With this link you'll get $100 credit for 60 days). (referral link)

A very busy hypervisor node

Preface

This article is not a copy and paste tutorial. It's more a walkthrough of theprocesses used and the thought process behind it.

As said, I currently work at an OpenStack public cloud provider. That means thatyou can order a virtual server with us and get full administrative access to it.We use OpenStack, so you can also automate that part using the API and deployinstances automatically. Those virtual machines do need to run on actualhardware, which is where this article comes in.

The regular process for deploying new hardware is fully automated. We've gotcapacity management nailed down, once the OpenStack environment reaches acertain threshold, PDF's are automatically generated with investment requestsand sent off to the finance department and the hardware is ordered. Then aftersome time, the nodes are shipped to the datacenters. Our datacenter team handlesthe racking and stacking. They setup the remote out of band management (ILO,iDrac, IPMI) and put the credentials into the PXE deployment servers (MaaS).

The machine get's installed with the required OS automatically and after thatthe software stack (OpenStack) is installed. The machine firmwares are thenupdated to the latest versions, the rest of the cluster configuration is updatedand it's ready to go. This is all done using Ansible, after the racking andstacking no human is involved. The only thing I have to do is to enable thenova or cinder service in OpenStack for that machine, and it's ready to go.

The machine is automatically put into our monitoring system as well. We not onlymonitor the software side of things, like cpu load, disk usage, networkconnections, required services running, but also the hardware itself. Either viathe remote out of band management or vendor provided tools (omreport anyone?).Which means that when a disk breaks, or a faulty memory module is found, ourmonitoring system alerts us and takes action automatically. When defective disksare detected, for example, the vendor automatically gets an RMA sent from ourmonitoring. Once a week a bunch of disks arrive at the office or the datacenter, depending on the vendor, and the datacenter team replace the faultyones. Even the list with disks to replace by that team it automatically sentfrom the monitoring.

This level of automation is required when you reach a scale like this. Byautomating all of this, our sysadmin team can focus on other things than thegruntwork of installing software or ordering hardware. This level of automationand monitoring also provides a layer to build stuff on top of, which we will bedoing here.

The Problem

Stuff breaks. It's technology, so just like your car, things break. That's not aproblem if you've built your environment redundantly and highly available, butmy experience is that not a lot of people do that.

Hardware breakage usually doesn't mean downtime right away. Most parts areredundant. Multiple network interfaces that are bonded. Hard disks are in a formof RAID. Multiple PSU's on different power feeds. Multiple CPU's. Same goes forthe network and other hardware. If a NIC fails, the bond will make sure thesystem keeps working. A drive dies? One of the spares is automatically put inthe vdev or array and is rebuilt. Power goes out or a PSU blows? The other feedkeeps running. However, it does mean that the faulty part needs to be replacedand possibly that customers that have instances or storage running on thehypervisor need to be migrated off there.

Migrating instances and storage is something that we do a lot. Not just whenthere are problems with hardware, also for regular maintenance. Servers needupdates to their software and firmware. The updates are done with Ansible andare automated. We've written software that checks if there is enough clustercapacity and no issues in the monitoring. If so, it checks if a node hadfirmware updates outstanding, or more than 50 packages or a security packageupdate outstanding, and if so it schedules a node for emptying out.

We use live migrates and that process it automated as well, but it does addextra workload. Especially larger Windows VM's on KVM tend to lock up and breakduring a live-migrate, those need some tender love and care. Migrating storage(OpenStack Cinder Volumes) goes fine, never had any issues with that.

Depending on the used configuration, OpenStack can use local storage or Ceph.With Ceph, live migrated of Volumes and Instances is very easy and fast. Localstorage takes longer, since then the disks need to be copied over.

Once the node is empty, it's updated, both software and firmware, with the sameplaybooks we use when a new node is installed. When that's done, it's rebootedand enabled back in OpenStack. This process is done a few times a day, makingsure all the hardware is updated regularly. Due to our environment gettingbigger and bigger, it takes longer to update all nodes, so we are testing if wecan do more than one node at a time.

New hardware, or hardware that has had issues and got replacement parts, tendsto break more often than hardware that we have in use for a few months. Mostlyhard disks or memory modules (ECC RAM), but I've also seen PSU's blow, and oncewe had drops of solder breaking off a fibre channel NIC and causing a short. Assaid, it's technology, so stuff breaks. All covered by warranty, so not a bigproblem.

Simulating usage patterns

New hardware, or hardware that is put back into production after a repair wasn'tgetting a burn in test beforehand. Mostly because the amount of problems weexperienced was small, just a one off most of the time. Now we are a lot larger,almost every week there is new hardware to be racked, so we see more problems.Just a scale issue, nothing to worry about. I suspect that when you are a largecar shop you get more faulty returns than when your a local garage around thecorner.

Since we saw that hardware can break when it's being used regularly, we alsomust test it using mostly regular usage patterns. For compute nodes that means,install OpenStack and run VM's. For storage machines, that means, installOpenStack and the specific storage (ZFS, Ceph, Linux LVM) and generate IO.

Because we do want to stress the node a bit, we generate usage that under normalconditions would count as abuse. We thought of bitcoin miners or BOINC(SETI@HOME), but decided that that wouldn't be reproducable enough. Therefore wewent with regular tools, like stress, stress-ng, memtest, dd andiperf.

Using OpenStack Userdata we provide a script to the instance that installs thepackages and runs them in a specified order for a set amount of time. In myexample every tool, testing a specific aspect (CPU, RAM, etc) runs for 15minutes and then continues on to the next part. By creating a VM every 10minutes, all usage patterns are equal. With that I mean that the CPU isn'thammered for 15 minutes, then the RAM, then the NIC. No, one instance ishammering the disk while another is using all it's CPU.

The below image shows htop on one of our empty compute nodes:

One thing we do not test enough is a huge amount of small VM's, thus having alot of hardware interrupts and context switching. My testing uses a few largeinstances, which, in our case, tests the stuff we need. These burn in tests havesaved us over two dozen nodes with issues going into production in the firstthree months of using this new procedure. In saved-man-hours on RMA andreplacement that's almost three people fulltime for a week. Huge cost-savings.

Let's continue on to the actual testing to see how much this node can behammered.

Compute servers: stress, stress-ng and other tools

For the compute node benchmarking I'm using an Ubuntu 14.04 VM. I'm not able toshare the Ansible playbooks we are using, but I can give you the manualcommands. It's not that hard to put one and one together and create your ownversion. Or, if you don't get new hardware so often, just do it manually.

In our specific OpenStack deployments we have an administrative user. We needthis to override the scheduler and control on which compute node the VM's getspawned. We also have a special flavor with no IOPS or other usage limits.Regular customer instances have CPU, network and iops limits set and that wedon't want in this case.

By specifing the specific node in the nova boot command, we can force a VM toget spawned on that hypervisor:

nova boot --flavor="nolimit" --image "Ubuntu 14.04" --nic net-id=00000000-0000-0000-0000-000000000000  --availability-zone NL1:compute-x-y

This does allow you to overload the hypervisor with instances. In our case iteither cannot boot them and the logs say, cannot allocate memory. Or, the out-of-memory killer on the hypervisor just kills the qemu process and theinstance is in state stopped. The nova instance action list doesn't show thataction (because it went outside of nova).

Out of memory

Inside the instances it's important to disable the linux OOM killer. Otherwiseit will stop your stress tests. You can do this on the hypervisor as well, butthat might have unexpected side effects. Make sure you have out of band accessso that you can reboot a node when it has hung itself up to a tree.

Here's how to disable the OOM killer inside of your VM:

sysctl vm.overcommit_memory=2

If you want it to survive a reboot, place it in sysctl.conf:

echo "vm.overcommit_memory=2" >> /etc/sysctl.conf

Install packages

The software we're using is all in the Ubuntu repositories. Install thepackages:

apt-get -y install vnstat memtester stress stress-ng iperf

CPU

The first part takes the CPU and uses stress to generate usage:

stress --cpu $(nproc) --timeout 900

nproc gives me the number of cores available, which is usable for the VM.

Stress with the --cpu parameter spins up processes in a tight loop calculatingthe sqrt() of a random number acquired with rand().

Memory

For RAM, I found stress not getting that much result as its modern counterpartstress-ng did. stress doesn't have an option to utilize the RAM, stress-ngdoes:

stress-ng -vm 60 --vm-bytes 1G -t 900s --vm-method zero-one  --metrics-brief

As per the manpage, -vm starts N workers continuously callingmmap(2)/munmap(2) and writing to the allocated memory. --vm-method zero-onesets all memory bits to zero and then checks if any bits are not zero. Next, setall the memory bits to one and check if any bits are not one. Simple buteffective test. My instance has 64 GB RAM so 60 workers using 1 GB each willfill the RAM up nicely.

This memory abuse has triggered a lot of servers where one of the RAM DIMM's wasbad. Even though we always have ECC RAM. Only after a few hours of running, notright away.

Disk

In our setup Ceph is mostly used, but we do also have local disks or volumeservers (boot from volume) with ZFS. When a compute node has local disks we testthem, otherwise it isn't much use since the storage servers are stresseddifferently.

stress has two different use cases for disk IO. The first does actual writes.

stress --hdd $(nproc) --hdd-bytes 100G --timeout 900 

--hdd spawns N workers spinning on write()/unlink(). --hdd-bytes write Bbytes per hdd worker (default is 1GB).

stress --io $(nproc) --timeout 900 

--io spawns N workers spinning on sync().

Below is a screenshot of iotop on the compute node when a burn-in test isrunning on disk io with these tests:

There have been four SSD's so far that needed replacing when these tests ran fora day. On average that suprised me, since I suspected it to be a lot more.

Network

For the sake of testing we also do a speedtest to an internal iperf server. Ihaven't seen any network cards fail yet. It is however nice that we can utilizethe network card to it's fullest potential. I did find one card that had beenconfigured in 10 mbit half-duplex mode because of a bad cable while running thistest. However I was looking at it then, that's not something the monitoringtools report on (yet):

iperf --port 5002 --client speedtest.serverius.net --dualtest --time 900 --format M 

I replaced our iperf server address with a public iperf server address.

Storage servers: dd, dd and some more dd

Our storage servers are tested by doing a lot of disk activity. Since we do wantto simulate actual usage, we install them as OpenStack Cinder servers and createvolumes on them. Those volumes are then attached to an instance. On theinstance, using zfs, we create a big pool (named tank) (no raidZ ormirroring) and one dataset (named bullet). Then, using dd, we first writezero's for a few hours. Afterwards, also using dd and random data fromopenssl (/dev/urandom is too slow), actual data is written.

This could of course also be done with regular LVM volumes or mdraid. Thetests however are adapted from work I already had laying around, that playbookalready did zfs, so not much use in re-inventing the wheel.

I'm using Ubuntu 16.04 because of the ZFS support in there. First install it:

apt-get install zfsutils-linux zfs

Using parted with ansible, create a gpt partition:

- name: partition gpt  parted:    device: "{{ item }}"    label: gpt    state: present  with_items:    - /dev/sdb     - /dev/sdc     - /dev/sdd     - /dev/...

Create the zpool and the dataset (again, with Ansible):

- name: Create the zpool       shell: |           zpool create tank /dev/sdb /dev/sdc /dev/sdd /dev/... && \     zfs set compression=lz4 tank && \           zfs set atime=off tank     - name: Create fs  zfs:    name: tank/bullet    state: present

It will be automatically mounted and available under /tank/bullet.

The first dd command:

dd if=/dev/zero of=/tank/bullet/bench bs=20000M count=1024 conv=fdatasync,notrunc

In my case the pool is 20 TB. Change where needed. Writing zero's will go fullspeed. If you want to test actual data writing, and thus bypass any cache orraid controllers, you need to generate random data. /dev/random and/dev/urandom are way to slow. I've found the following command to getacceptable speeds:

openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero > /tank/bullet/bench

This will use the AES-NI CPU instruction if your VM supports it. Let this runfor a few hours and you will have all your disks written to. Do note that a fullzfs will be extremely slow and behave weirdly. Throw away the VM and boot up anew one, rinse and repeat.

The result

Writing zero's at near-line speed

A busy hypervisor node

An even more busy hypervisor node

You might get crashing servers or overheated switches. At one of my previousemployers we actually found out that the fan rig wasn't correctly wired due toCPU heat alarms going off in the ipmi. Once again, make sure you have goodmonitoring and out of band access. This will happen often:

Which is of course a good thing when the hardware is not yet in use.

Tags: blog, cinder, cloud, compute, nova, openstack, storage