Operating RabbitMQ at large scale comes with it's own set of challenges. This talk will take you on the journey Cisco faced with operating a large (800+ node) environment inside a single RabbitMQ cluster. We will share the pains, lessons learned and best practices to stabilize and improve messaging performance and reliability.
This talk includes:
- OpenStack service configurations related to messaging
- Kombu driver enhancements
- Considerations when virtualizing the control plane, and how default network buffer settings can be insufficient.
- RabbitMQ Erlang arguments related to TCP_USER_TIMEOUT and their impact
- The overhead of Queue Mirroring
- Kernel level network settings to improve RabbitMQ failover and provide faster service re-connect
- Alerting and Monitoring RabbitMQ
- Recovering from a cluster partition
- Architectural decisions
Attendees will walk away with best practices and configurations they can make to improve the reliabilty and perfomance of messaging in OpenStack.