The LCOO working group (https://wiki.openstack.org/wiki/LCOO) has a spec (https://review.openstack.org/#/c/443504) out for an extreme/destructive testing. We will have a demo of the new framework and some sample test cases.
The forum will be to discuss the what sort of tests the community wishes to be in scope.
- What sort of tests do you run today that are destructive in nature?
- What is your desired workflow for issues that come up?
- How do you determine success/failure of your testing scenarios?
- Do you publicly publish your results (if so where)?
- What KPI do you evaluate today?
- What workloads do you run and against what architectures?
The forums focus will be to take stock of existing work and to chart the path of the project for future work so that contributions and development are in line with what the community wants.
The entire scope of "extreme testing" is fairly large and splits into 3 major parts:
1. References: Extreme testing is non-deterministic. Such testing generally is valid with the following reference.
Architecture: Software and hardware architecture of the deployed cloud.
Workload: Proposed workload injected into the control and data plane.
KPI/SLO: The KPI/SLO that are measured for the reference architecture(s) under the reference workload(s).
2. Test Suite: What test scenarios are we trying to get done?
Control Plane Performance: Benchmarks for control plane performance should be derived from an Eris test suite.
Data Plane Performance: Benchmarks for data plane performance should be derived from an Eris test suite.
Resiliency to Failure: Failure injection scenarios coupled with control and data plane load should provide KPI on the cloud installation’s resiliency to failure.
Resource scale limits: Identify limits of how much we can scale resources. Examples include: what is the max memory for VMs? How many computes can we support? How many subnets can be created? What is the max size of a cinder volume? How many cinder volumes, etc.?
Resource concurrency limits: Identify limits of how many concurrent operations can be handled per resource. Examples include: Reconfiguring a network on a large tenant of 300+ VMs – how many concurrent operations can the single subnet handle?
Operational readiness: This has different meanings for open source Eris vs. an AT&T version. For open source Eris this will include a smoke test of a specific number of tests to run at the OpenStack QA gate. For AT&T it will include an expanded set of tests for running in production (including destructive tests).
3. Frameworks & Tools: How do we enable the test suite?
Repeatable Experiments: Eris should have the capability to create repeatable experiments and reliably reproduce results of non-deterministic test scenarios.
Test creation: Eris should have the capability to create test cases using an open specification like YAML or JSON and encourage test case reuse.
Test orchestration: Eris should have the capability to orchestrate test scenarios of various types (distributed load, underlay faults, etc.)
Extensibility: The framework should be extensible for various open source and proprietary tools (e.g. plugin to use HP Perf. center instead of Openstack/rally for load injection, plugin by Juniper for router fault injection, cLCP support, vLCP support, etc.)
Automation: The entire test orchestration and validation should be automated by the orchestration mechanism (no eyeballing graphs to check for success/failure, it should be determined by automated KPI/SLO comparison)
Simulators & Emulators: Competent simulators and emulators to provide scale testing (e.g. 10,000 compute nodes, 1 million VMs, etc.).