A health check is a special type of test job, designed to validate that the a test device and the infrastructure around it are suitable for running LAVA tests. Health checks jobs are run periodically to check for equipment and/or infrastructure failures that may have happened. If a health check fails for any device, that device will be automatically taken offline. Reports are available which show these failures and track the general health of the lab.
https://validation.linaro.org/scheduler/reports
For any one day where at least one health check failed, there is also a table providing information on the failed checks:
https://validation.linaro.org/scheduler/reports/failures?start=-1&end=0&health_check=1
Health checks are defined in
/etc/lava-server/dispatcher-config/health-checks
according to the template
used by the device. Health checks are run as the lava-health user.
Note
To generate the filename of the health check of a V2 device, the
scheduler takes the name of the template extended in the device dictionary
(for instance qemu.jinja2
for qemu devices) and replace the extension
with .yaml
. The health check will be called
/etc/lava-server/dispatcher-config/health-checks/qemu.yaml
. The health
check job database field in the device-type is used for devices which are
able to run V1 test jobs. To migrate health checks out of the database, use
the lava-server manage migrate-health-checks
command. For more
information, see
https://lists.linaro.org/pipermail/lava-announce/2017-March/000027.html
Note
Admins can temporarily disable health checks for all devices of a given type in the device-type admin page.
Note
Before enabling a pipeline health check, ensure that all devices of the specified type have been enabled as pipeline devices or the health check will force any remaining devices offline.
It is recommended that the YAML health check follows these guidelines:
The rest of the job needs no changes.
The health check needs to at least check that the device will boot and deploy a test image. Multiple deploy tasks can be set, if required, although this will mean that each health check takes longer.
Wherever a particular device type has common issues, a specific test for that behaviour should be added to the health check for that device type.
It is a mistake to think that lava_test_shell should not be run in health checks. The consequence of a health check failing is that devices of the specified type will be automatically taken offline but this applies to a job failure, not a fail result from a single lava-test-case.
It is advisable to use a minimal set of sanity check test cases in all health checks, without making the health check unnecessarily long:
- test:
timeout:
minutes: 5
definitions:
- repository: git://git.linaro.org/qa/test-definitions.git
from: git
path: ubuntu/smoke-tests-basic.yaml
name: smoke-tests
These tests run simple Debian/Ubuntu test commands to do with networking and
basic functionality - it is common for linux-linaro-ubuntu-lsusb
and/or
linux-linaro-ubuntu-lsb_release
to fail as individual test cases but these
failed test cases will not cause the health check to fail or cause devices
to go offline.
Using lava_test_shell
in all health checks has several benefits:
lava_test_shell
lava_test_shell
can provide a lot more information than
simply booting an image and each device type can have custom tests to pick
up common hardware issuesSee also Writing a LAVA test shell definition.
When a device is taken online in the web UI, there is an option to skip the manual health check. Health checks will still run in the following circumstances when “Skip Health check” has been selected: