Out teams hopes to get Project Meniscus into OpenStack incubator status so right from the start we’ve been using OpenStack software components like Oslo Logging, Oslo-Config, the same Python modules that other OpenStack projects use as well as and OpenStack best practices (we are learning as we go along).
I’m going to take this time to do an update on what we are working on and how others can get involved.
Note: All comments here are my own, I do not speak for or represent my employer Rackspace.
We began Project Meniscus just a few months ago. Since that time we’ve fleshed out much of the architecture and scaling issues we foresee. We’ve created a grid of virtual machines (powered by OpenStack) to handle the expected traffic from logging customers, or tenants as we refer to them. Our grid consists of numerous workers each with their own persona. The front line workers are Correlators (syslog, http), then we have Normalizers, Storage, Tenant, Coordinators, and Broadcasters with possibly more to come. Each persona has a specific job to and once that has been completed the log data is moved to the next worker in the pipeline.
As traffic increases or decreases we plan to have logic in-place to dynamically increase or decrease the amount of workers in the “elastic” grid. If, for example we see the need to increase the amount of Normalizer workers then we can automatically spin up new virtual machines and assign them the Normalizer persona.
The incoming logging data (syslog) will arrive via TCP or HTTP. The structured or unstructured log data will quickly be parsed and converted into CEE format and moved through the grid being compressed or encrypted as required. If a tenant has requested their data be durable then we will take special procedures to ensure the message is not lost along the way and will be guaranteed to get into the data store. Tenants will be able to verify their message(s) arrived into the data store by running a simple HTTP GET against the public API using a job id. Once the log data has been stored you will be able to query and perform metrics on your log data. Log data is first stored in a short term data store (for a limited time) and then moved into a longer term storage container for a time period the tenant specifies.
Any new worker that spins up it will pair with a Coordinator first to receive it’s configuration and routing information of up stream workers. We also make workers responsible for reporting any downed workers. If a downed worker is found then it is reported to the Coordinator(s). Eventually the Coordinator with repeated reports of a worker being down will remove that worker from the routing list and dispatch a message to the Broadcaster to inform the gird workers to update their routing list. This allows us to simply work around any downed worker and continue to process log data. If at any point the network between workers goes down we start caching the log data locally. Once the network (or whatever issue) is resolved we send the log data in the local cache to the next worker in the pipeline. This allows us to keep the service running through most unforeseen issues.
Keep in mind that there is more to it than this but that is the high level view of it.
A lot more work has yet to be done but we are hoping to take on our first internal alpha customer in the next couple months. From there we can work out any issues and modify the code base as needed.
Feel free to contribute pull requests or talk to us via the group. If you do contribute code please make sure it is verified with a unit test.