In the last blog I suggested that analytical capabilities need to move to the core of resource managers. This is very much needed for autonomous controlled large scale systems which figure out the biggest chunk of decisions to be made themselves. While the benefits from this might be obvious, the how to inject the insights/intelligence back into the resource manager might not be. Hence this blog post series documenting a bit how to let systems (they are just examples – trying to cover most domains :-)) like Kubernetes, OpenStack, Mesos, YARN and OpenLava make smarter decisions.
The blog posts are going to cover some generic concepts as well as point to specific documentation bits of the individual resource managers. Some of this is already covered in past blog posts but to recap let’s look at the 5(+1) Ws for resource managers decision making (click to skip to the technical details):
- What decision needs to be made? – Decisions – and the actuations they lead too – can roughly be categorized into: doing initial placement of workloads on resources, the re-balancing of workload and resource landscapes (through either pausing/killing, migrating or tuning resource and workloads) and capacity planning activities (see ref).
- Who is involved? – The two driving forces for data center resource management are the customer and the provider. The customer looking for good performance and user experience while the provider looking for maximizing his ROI & lowering TCO of his resources. The customer is mostly looking for service orchestration (e.g. doesn’t care where and how the workload runs, as long as it performs and certain policies and rules – like for auto-scaling are adhered – see e.g. google’s instance size recommendation feature) while the provider looks at resource orchestration of larger scale geo-distributed infrastructures with multiple workloads from different customers (tenants are not equal btw – some are low playing non important workloads/customers some high paying important workloads/customers with priorities and SLAs).
- When does the decision/actuation apply? – Decisions can either be made immediately (e.g. an initial placement) or be more forward/backward looking (e.g. handle a maintenance/forklift upgrade request for certain resources).
- Where does the decision need to be made?- This is probably one of the most challenging questions. First of all this covers the full stack from physical resources (e.g. compute hosts, air-conditioning, …), software defined resources (e.g. virtual machines (VM), containers, tasks, …) all the way to the services the customers are running, as well as across domains of compute (e.g CPUs, VMs, containers, …), network (e.g. NICs, SDN, …) and storage (e.g. Disks, block/object storage, …). Decisions are done on individual resource, aggregated, group, data center or a global level. For example the NIC, the Virtual machine/container/tasks hosting the workload, or even the power supply can be actuated upon (feedback control is great for this). The next level actuations can be carried out on the aggregate level – in which a set of resources make up a compute hosts, ToR-switch, SAN (e.g. by tuning the TCP/IP stack in the kernel). Next up is the group level for which e.g. polices across a set of aggregates can be defined (e.g. over-subscription policy for all Xeon E5 CPUs, a certain rack determined to run small unimportant jobs vs. a rack needing to run high performance workloads). Next is the data center level for which we possibly want to enforce certain efficiency goals driven by business objective (e.g. lowering the PuE). Finally the global level captures possible multiple distributed data centers for which decisions need to be made which enable e.g. high availability and fault tolerance.
- Why does the decision need to be made? Most decision are made for efficiency reasons derived from business objectives of the provider and customer. This means ultimately the right balance between the customer deploying the workload and asking for performance and SLA compliance (customers tend to walk away if the provider doesn’t provide a good experience) and the provider improving TCO (not being able to have a positive cashflow normally lead to a provider running out of business).
- How is the decision/actuation made? This is the focus for this article series. In case it is determined a decision needs to be made, it needs to be clear on how to carry out the actual actuation(s) for all the kinds of decision that can be made described above.
Decision most of the time cannot be made generic – e.g. decisions made in HPC/HTC systems do not necessarily apply to a telco environments in which the workloads and resource are different. Hence the context of workloads and resource in place play a huge role. Ultimately Analytics which embraces the context (in all sorts and forms: deep/machine learning, statistical modelling, artificial intelligence, …) of the environment can drive the intelligence in the decision making through insights. This can obviously in multiple places/flows (see the foreground and background flow concepts here) and ultimately enables autonomous control.
Enhancing Kubernetes
For the Kubernetes examples let’s focus on a crucial decision point – doing the initial placement of a workloads (aka a POD in Kubernetes language) in a cluster. Although much of today’s research focuses on initial placement I’d urge everybody not to forget about all the other decisions that can be made more intelligent.
Like most Orchestrators and Schedulers Kubernetes follows a simple approach of filtering and ranking. After shortlisting possible candidates, the first step involves filtering those resource which do not meet the workloads demands. The second step involves prioritization (or ranking) of the resources best suited.
This general part is described nicely in the Kubernetes documentation here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md
This filtering part is mostly done based on capacities, while the second can involve information like the utilization. If you want to see this code have a look at the generic scheduling implementation: here. The available algorithms for filtering (aka predicates) and prioritization can be found here. The default methods that Kubernetes filters upon can be seen here: here – the default prioritization algorithms here: here. Note that weights can be applied to the algorithms based on your own needs as a provider. This is a nice way to tune and define how the resource under the control of the provider can be used.
While the process and the defaults already do a great job – let’s assume you’ve found a way on when and how to use an accelerator. Thankfully like most scheduling systems the scheduler in Kubernetes is extendable. Documentation for this can be found here. 3 ways are possible:
- recompile and alter the scheduler code,
- implement your own scheduler completely and run it in parallel,
- or implement an extension which the default scheduler calls when needed.
The first option is probably hard to manage in the long term, the second option requires you to deal with the messiness or concurrency while the third option is interesting (although adds latency to the process of scheduling due to an extra HTTP(s) call made). The default scheduler can basically call an external process to either ‘filter’ or ‘prioritize’. In the first case a list of possible candidate hosts is returned, in the the second case a prioritized list if returned. Now unfortunately the documentation get’s a bit vague, but luckily some code is available from the integration tests. For example here you can see some external filtering code, and here the matching prioritization code. Now that just needs to be served up over HTTP and you are ready to go, next to adding some configurations documented here.
So now an external scheduler extension can make a decisions if an accelerator should be assigned to a workload or not. The intelligent decision implemented in this extender could e.g. decide if an SR-IOV port is needed based on a bandwidth requirement, or if it is even a good idea to assign a Accelerator to a workload par the previous example.
Corrections, feedback and additional info are more then welcome. I’ve some scheduler extender code running here – but that is not shareable yet. I will update the post once I’ve completed this. In the next posts OpenStack (e.g. service like Nova, Watcher, Heat and Neutron), Mesos (how e.g. allocator modules can be used to inject smarts) and OpenLava (for which e.g. elims can be used to make better scheduling decisions) and obviously others will be introduced