A Taxonomy of Schedulers -Operating Systems, Clusters and Big Data Frameworks Leszek Sliwko Abstract-This review analyzes deployed and actively used workload schedulers' solutions and presents a taxonomy in which those systems are divided into several hierarchical groups based on their architecture and design. While other taxonomies do exist, this review has focused on the key design factors that affect the throughput and scalability of a given solution, as well as the incremental improvements which bettered such an architecture. This review gives special attention to Google's Borg, which is one of the most advanced and published systems of this kind.

Keywords: schedulers, workload, cluster, cloud, big data, borg.


# I.

Taxonomy of Schedulers lthough managing workload in a Cloud system is a modern challenge, scheduling strategies are a well-researched field as well as being an area where there has been considerable practical implementation. This background review started by analyzing deployed and actively used solutions and presents a taxonomy in which schedulers are divided into several hierarchical groups based on their architecture and design. While other taxonomies do exist (e.g., Krauter et   Tyagi and Gupta, 2018), this review has focused on the most important design factors that affect the throughput and scalability of a given solution, as well as the incremental improvements which bettered such an architecture.

Figure 1 visualizes how the schedulers' groups are split. The sections which follow discusses each of these groups separately. 


# Metacomputing

The concept of connecting computing resources has been an active area of research for some time. The term 'metacomputing' was established as early as 1987 (Smarr and Catlett, 2003) and since then the topic of scheduling has been the focus of many research projects, such as (i) service localizing idle workstations and utilizing their spare CPU cycles -HTCondor (Litzkow et al., 1988)


# ; (ii) the Mentat -a

Author: Axis Applications Ltd, London, Uk. e-mail: Lsliwko@gmail.com parallel run-time system developed at the University of Virginia (Grimshaw, 1990); (iii) blueprints for a national supercomputer (Grimshaw et al., 1994), and (iv) the Globus metacomputing infrastructure toolkit (Foster and Kesselman, 1997).

Before the work of Foster et al. (2001), there was no clear definition to what 'grid' systems referred. Following this publication, the principle that grid systems should allow a set of participants to share several connected computer machines and their resources became established. A list of rules defines these shared system policies. This includes which resources are being shared, who is sharing these resources, the extent to which they can use those resources, and what quality of service they can expect.

As shown in the following sections, the requirements of a load balancer in a decentralized system varies significantly compared to scheduling jobs on a single machine (Hamscher et al., 2000). One significant difference is the network resources, in that transferring data between machines is expensive because the nodes tend to be geographically distributed. In addition to the high-impact spreading of tasks across networked machines, the load balancer in Clusters generally provides a mechanism for faulttolerance and user session management. The sections below also explain the workings of several selected current and historical schedulers and distributed frameworks. If we can understand these, we will know more about how scheduling algorithms developed over time, as well as the different ways they have been conceptualized. This paper does not purport to be a complete taxonomy of all available designs, but rather presents an analysis of some of the most important concepts and aspects of the history of schedulers.


# III.


# OS Schedulers

The Operating System (OS) Scheduler, also known as a 'short-term scheduler' or 'CPU scheduler', works within very short time frames, i.e., time-slices. During scheduling events, an algorithm must examine planned tasks and assign them appropriate CPU times (Bulpin, 2005; Arpaci-Dusseau and Arpaci- Dusseau, 2015). This setting requires schedulers to use highly optimized algorithms with very small overheads. Process schedulers face the challenge of how to maintain the balance between throughput and responsiveness (i.e., minimum latency). Prioritizing the execution of processes with a higher sleep/processing ratio is the way this is generally achieved (Pabla, 2009). At present, the most advanced strategies also take into consideration the latest CPU core where the process ran the previous time, which is known as 'Non-Uniform Memory Access (NUMA) awareness'. The aim is to reuse the same CPU cache memory wherever possible (Blagodurov et al., 2010). The memory access latency differences can be very substantial, for example ca. 3-4 cycles for L1 cache, ca. 6-10 cycles for L2 cache and ca. 40-100 cycles for L3 cache (Drepper, 2007). NUMA awareness also involves prioritizing the act of choosing a real idle core which must occur before its logical SMT sibling, also known as 'Hyper-Threading (HT) awareness'. Given this, NUMA awareness is a crucial element in the design of modern OS schedulers. With a relatively high data load to examine in a short period, implementation needs to be strongly optimized to ensure faster execution.

OS Schedulers tend to provide only a very limited set of configurable parameters, wherein the access to modify them is not straightforward. Some of the parameters can change only during the kernel compilation process and require rebooting, such as compile-time options CONFIG_FAIR_USER_SCHED and CONFIG_FAIR_CGROUP_SCHED, or on the fly using the low-level Linux kernel's tool 'sysctl'.


# a) Cooperative Multitasking

Early multitasking Operating Systems, such as Windows 3.1x, Windows 95, 96 and Me, Mac OS before X, adopted a concept known as Cooperative Multitasking or Cooperative Scheduling (CS). In early implementations of CS, applications voluntarily ceded CPU time to one another. This was later supported natively by the OS, although Windows 3.1x used a nonpre-emptive scheduler which did not interrupt the program, wherein the program needed to explicitly tell the system that it no longer required the processor time. Windows 95 introduced a rudimentary pre-emptive scheduler, although this was for 32-bit applications only (Hart, 1997). The main issue in CS is the hazard caused by the poorly designed program. CS relies on processes regularly giving up control to other processes in the system, meaning that if one process consumes all the available CPU power then all the systems will hang.


# b) Single Queue

Before Linux kernel version 2.4, the simple Circular Queue (CQ) algorithm was used to support the execution of multiple processes on the available CPUs. A Round Robin policy informed the next process run (Shreedhar, 1995). In kernel version 2.2, processes were further split into non-real/real-time categories, and scheduling classes were introduced. This algorithm was replaced by O(n) scheduler in Linux kernel versions 2.4-2.6. In O(n), processor time is divided into epochs, and within each epoch every task can execute up to its allocated time slice before being pre-empted. At the beginning of each epoch, the time slice is given to each task; it is based on the task's static priority added to half of any remaining time-slices from the previous epoch (Bulpin, 2005). Thus, if a task does not use its entire time slice in the current epoch, it can execute for longer in the next one. During a scheduling event, an O(n) scheduler requires iteration through all the process which are currently planned (Jones, 2009), which can be seen as a weakness, especially for multi-core processors.

Between Linux kernel versions 2.6-2.6.23 came the implementation of the O(1) scheduler. O(1) further splits the processes list into active and expired arrays. Here, the arrays are switched once all the processes from the active array have exhausted their allocated time and have been moved to the expired array. The O(1) algorithm analyses the average sleep time of the process, with more interactive tasks being given higher priority to boost system responsiveness. The calculations required are complex and subject to potential errors, where O(1) may cause non-interactive behavior from an interactive process (Wong et al., 2008;Pabla, 2009).


# c) Multilevel Queue

With Q(n) and O(1) algorithms failing to efficiently support the applications' interactivity, the design of OS Scheduler evolved into a multilevel queue. In this queue, repeatedly sleeping (interactive) processes are pushed to the top and executed more frequently. Simultaneously, background processes are pushed down and run less frequently, although for extended periods.

Perhaps the most widespread scheduler algorithm is Multilevel Feedback Queue (MLFQ), which is implemented in all modern versions of Windows NT (2000, XP, Vista, 7 and Server), Mac OS X, NetBSD and Solaris kernels (up to version 2.6, when it was replaced with O(n) scheduler). MLFQ was first described in 1962 in a system known as the Compatible Time-Sharing System (Corbató et al., 1962). Fernando Corbató was awarded the Turing Award by the ACM in 1990 'for his pioneering work organizing the concepts and leading the development of the general-purpose, large-scale, time-sharing and resource-sharing computer systems, CTSS and Multics'. MLFQ organizes jobs into a set of queues Q 0 , Q 1 , ?, Q i wherein a job is promoted to a higher queue if it does not finish within 2 i time units. The algorithm always processes the job from the front of the lowest queue, meaning that short processes have preference. Although it has a very poor worst-case scenario, MLFQ turns out to be very efficient in practice (Becchetti et al., 2006).

Staircase Scheduler (Corbet, 2004), Staircase Deadline Scheduler (Corbet, 2007), Brain F. Scheduler (Groves et al., 2009) and Multiple Queue Skiplist Scheduler (Kolivas, 2016) constitute a line of successive schedulers developed by Con Kolivas since 2004 which are based on a design of Fair Share Scheduler from Kay and Lauder (1988). None of these schedulers have been merged into the source code of mainstream kernels. They are available only as experimental '-ck' patches. Although the concept behind those schedulers is similar to MLFQ, the implementation details differ significantly. The central element is a single, ranked array of processes for each CPU ('staircase'). Initially, each process (P1, P2, ?) is inserted at the rank determined by its base priority; the scheduler then picks up the highest ranked process (P) and runs it. When P has used up its time slice, it is reinserted into the array but at a lower rank, where it will continue to run but at a lower priority. When P exhausts its next time-slice, its rank is lowered again. P then continues until it reaches the bottom of the staircase, at which point it is moved up to one rank below its previous maximum and is assigned two time-slices. When P exhausts these two time-slices, it is reinserted once again in the staircase at a lower rank. When P once again reaches the bottom of the staircase, it is assigned another time-slice and the cycle repeats. P is also pushed back up the staircase if it sleeps for a predefined period. The result of this is that that interactive tasks which tend to sleep more often should remain at the top of the staircase, while CPU-intensive processes should continuously expend more time-slices but at a lower frequency. Additionally, each rank level in the staircase has its quota, and once the quota is expired all processes on that rank are pushed down.

Most importantly, Kolivas' work introduced the concept of 'fairness'. What this means is that each process gets a comparable share of CPU time to run, proportional to the priority. If the process spends much of its time waiting for I/O events, then its spent CPU time value is low, meaning that it is automatically prioritized for execution. When this happens, interactive tasks which spend most of their time waiting for user input get execution time when they need it, which is how the term 'sleeper fairness' derives. This design also prevents a situation in which the process is 'starved', i.e., never executed.


# d) Tree-Based Queue

While the work of Con Kolivas has never been merged into the mainstream Linux kernel, it has introduced the central concept of 'fairness', which is the crucial feature of the design of most current OS schedulers. At the time of writing, Linux kernel implements Completely Fair Scheduler (CFS), which was developed by Ingo Molnár and introduced in kernel version 2.6.23. A central element in this algorithm is a self-balancing red-black tree structure in which processes are indexed by spent processor time. CFS implements the Weighted Fair Queueing (WFQ) algorithm, in which the available CPU time-slices are split between processes in proportion to their priority weights ('niceness'). WFQ is based on the idea of the 'ideal processor', which means that each process should have an equal share of CPU time adjusted for their priority and total CPU load (Jones, 2009;Pabla, 2009). Lozi et al. (2016) presents an in-depth explanation of the algorithm's workings, noting potential issues regarding the CFS approach. The main criticism revolves around the four problematic areas:

? Group Imbalance -The authors' experiments have shown that not every core of their 64-core machine is equally loaded: some cores run only one process or sometimes no processes at all, while the rest of the cores were overloaded. It seems that the scheduler was not balancing the load because of the hierarchical design and complexity of the load tracking metric. To limit the complexity of the scheduling algorithm, the CPU cores are grouped into scheduling groups, i.e., nodes. When an idle core attempts to steal work from another node, it compares only the average load of its node with that of its victim's node. It will steal work only if the average load of its victim's group is higher than its own. The result is inefficiency since idle cores will be concealed by their nodes' average load. ? Scheduling Group Construction -This concern relates to the way scheduling groups are constructed which is not adapted to modern NUMA machines. Applications in Linux can be pinned to a subset of available cores. CFS might assign the same cores to multiple scheduling groups with those groups then being ranked by distance. This could be nodes one hop apart, two hops apart and so on. This feature was designed to increase the probability that processes would remain close to their original NUMA node. However, this could result in the application being pinned to particular cores which are separated by more than one hop, with work never being migrated outside the initial core. This might mean that an application uses only one core. ? Overload-on-Wakeup -This problem occurs when a process goes to sleep on a particular node and is then awoken by a process on the same node. In such a scenario, only cores in this scheduling group will be considered to run this process. The aim of this optimization is to improve cache utilization by running a process close to the waker process, meaning that there is the possibility of them sharing the last-level memory cache. However, the might be the scheduling of a process on a busy core when there are idle cores in alternative nodes, resulting in the severe underutilization of the machine. ? Missing Scheduling Domains -This is the result of a line of code omission while refactoring the Linux kernel source code. The number of scheduling domains is incorrectly updated when a particular code is disabled and then enabled, and a loop exits early. As a result, processes can be run only on the same scheduling group as their parent process.

Lozi et al. ( 2016) have provided a set of patches for the above issues and have presented experimental results after applying fixes. They have also made available a set of tools on their site which could be used to detect those glitches early in the Linux kernel lifecycle. Moreover, it has been argued that the sheer number of optimizations and modifications implemented into CFS scheduler changed the initially simple scheduling policy into one which was very complex and bug-prone. As of 12 th February 2019, there had been780 commits to CFS source code ('fair.c' file in github.com/torvalds/linux repository) since November 2011. As such, an alternative approach is perhaps required, such as a scheduler architecture based on pluggable components. This work demonstrates the immerse complexity of scheduling solutions catering to the complexities of modern hardware.


# IV.


# Cluster Schedulers

There are many differences between distributed computing and traditional computing. For example, the physical size of the system means that there may be thousands of machines involved, with thousands of users being served and millions of API calls or other requests needing processing. While responsiveness and low overheads are often the focus of process schedulers, the focus of cluster schedulers is to focus upon high throughput, fault-tolerance, and scalability. Cluster schedulers usually work with queues of jobs spanning to hundreds of thousands, and indeed sometimes even millions of jobs. They also seem to be more customized and tailored to the needs of the organization which is using them.

Cluster schedulers often provide complex administration tools with a wide spectrum of configurable parameters and flexible workload policies. All configurable parameters can generally be accessed via configuration files or the GUI interface. However, it appears that site administrators seldom stray from a default configuration (Etsion and Tsafrir, 2005). The most used scheduling algorithm is simply a First-Come-First-Serve (FCFS) strategy with backfilling optimization. Another challenge, although one which is rarely tackled by commercial schedulers, is minimizing total power consumption. Typically, idle machines consume around half of their peak power (McCullough et al., 2011). Therefore, a Data Center can decrease the total power it consumes by concentrating tasks on fewer machines and powering down the remaining nodes (Pinheiro et al., 2001;Lang and Patel, 2010).

The proposed grouping of Cluster schedulers loosely follows the taxonomy presented in Schwarzkopf et al. (2013).


# a) Monolithic Scheduler

The earliest Cluster schedulers had a centralized architecture in which a single scheduling policy allocated all incoming jobs. The tasks would be picked from the head of the queue and scheduled on system nodes in a serial manner by an allocation loop. Examples of centralized schedulers include Maui (Jackson et al., 2001) and its successor Moab (Adaptive Computing, 2015), Univa Grid Engine (Gentzsch, 2001) Monolithic schedulers implement a wide array of policies and algorithms, such as FCFS, FCFS with backfilling and gang scheduling, Shortest Job First (SJF), and several others. The Kubernetes (Greek: '??????????') scheduler implements a range of scoring functions such as node or pod affinity/anti-affinity, resources best-fit and worst-fit, required images locality, etc. which can be additionally weighted and combined into node's score values (Lewis and Oppenheimer, 2017). As an interesting note -one of the functions (Balanced Resource Allocation routine) implemented in Kubernetes evaluates the balance of utilized resources (CPU and memory) on a scored node.

Monolithic schedulers often face a 'head-ofqueue' blocking problem, in which shorter jobs are held when a long job is waiting for a free node. To try and counter this problem, the schedulers often implement 'backfilling' optimization, where shorter jobs are allowed to execute while the long job is waiting. Perhaps the most widespread scheduler is Simple Linux Utility for Resource Management (SLURM) (Yoo et al., 2003). SLURM uses a best-fit algorithm which is based on either Hilbert curve scheduling or fat tree network topology; it can scale to thousands of CPU cores (Pascual, 2009). At the time of writing, the fastest supercomputer in the world is Sunway TaihuLight (Chinese: '???????'), which uses over 40k CPU processors, each of which contains 256 cores. Sunway TaihuLight's workload in managed by SLURM (TOP500 Project, 2017).

The Fuxi (Chinese: '??') scheduler presents a unique strategy in that it matches newly-available resources against the backlog of tasks rather than matching tasks to available resources on nodes. This technique allowed Fuxi to achieve very high utilization of Cluster resources, namely 95% utilization of memory and 91% utilization of CPU. Fuxi has been supporting Alibaba's workload since 2009, and it scales to ca. 5k nodes (Zhang et al., 2014).

While Cluster scheduler designs have generally moved towards solutions which are more parallel, as demonstrated in the next subsection, centralized architecture is still the most common approach in High-Performance Computing. Approximately half the world's supercomputers use SLURM as their workload manager, while Moab is currently deployed on about 40% of the top 10, top 25 and top 100 on the TOP500 list (TOP500 Project, 2017).


# b) Concurrent Scheduling

Historically, monolithic schedulers were frequently built on the premise of supporting a single 'killer-application' (Barroso et al., 2003). However, the workload of the data center has become more heterogeneous as systems and a modern Cluster system runs hundreds of unique programs with distinctive resource requirements and constraints. A single code base of centralized workload manager means that it is not easy to add a variety of specialized scheduling policies. Furthermore, as workload size is increased, the time to reach a scheduling decision is progressively limited. The result of this is a restriction in the selection of scheduling algorithms to less sophisticated ones, which affects the quality of allocations. To tackle those challenges, the Cluster schedulers developed designs which are more parallel.


# i. Statically Partitioned

The solution to the numerous policies and the lack of parallelism in central schedulers was to split Cluster into specialized partitions and manage them separately. Quincy (Isard et al., 2009), a scheduler managing workload of Microsoft's Dryad, follows this approach.

The development of an application for Dryad is modeled as a Directed Acyclic Graph (DAG) model in which the developer defines an application dataflow model and supplies subroutines to be executed at specified graph vertices. The scheduling policies and tuning parameters are specified by adjusting weights and capacities on a graph data structure. The Quincy implements a Greedy strategy. In this approach, the scheduler assumes that the currently scheduled job is the only job running on a cluster and so always selects the best node available. Tasks are run by remote daemon services. From time to time these services update the job manager about the execution status of the vertex, which in the case of failure might be reexecuted. Should any task fail more than a configured number of times, the entire job is marked as failed (Isard et al., 2007).

Microsoft has built several frameworks on top of Dryad, such as COSMOS (Helland and  The main criticism of the static partitioning is inflexibility, that is, the exclusive sets of machines in a Cluster are dedicated to certain types of workload. That might result in a part of scheduler being relatively idle, while other nodes are very active. This issue leads to the Cluster's fragmentation and the suboptimal utilization of available nodes since no machine sharing is allowed.


# ii. Two-Level Hierarchy

The solution to the inflexibility of static partitioning was to introduce two-level architecture in which a Cluster is partitioned dynamically by a central coordinator. The actual task allocations take place at the second level of architecture in one of the specialized schedulers. The first two-level scheduler was Mesos (Hindman et al., 2011). It was developed at the University of California (Berkeley) and is now hosted in the Apache Software Foundation. Mesos was a foundation base for other Cluster systems such as Twitter's Aurora (Aurora, 2018) and Marathon (Mesosphere, 2018).

Mesos introduces a two-level scheduling mechanism in which a centralized Mesos Masteracts as a resource manager. It dynamically allocates resources to different scheduler frameworks via Mesos Agents, e.g., Hadoop, Spark and Kafka. Mesos Agents are deployed on cluster nodes and use Linux's cgroups or Docker container (depending upon the environment) for resource isolation. Resources are distributed to the frameworks in the form of 'offers' which contain currently unused resources. Scheduling frameworks have autonomy in deciding which resources to accept and which tasks to run on them.

Mesos is most effective when tasks are relatively small, short-lived and have a high resource churn rate, i.e., they relinquish resources more frequently. In the current version (1.4.1), only one scheduling framework can examine a resource offer at any given time. This resource is effectively locked for the duration of a scheduling decision, meaning that concurrency control is pessimistic. Campbell (2017) presents several practical considerations for using Mesos in the production environment, in addition to advice on best practice. Two-level schedulers offered a working solution to the lack of parallelization found in central schedulers and the low efficiency of statically partitioned Clusters. Nevertheless, the mechanism used causes resources to remain locked at the same time a specialized scheduled examines the resources offer. This means the benefits from parallelization are limited due to pessimistic locking. Furthermore, the schedulers do not coordinate with each other and must rely on a centralized coordinator to make them offers. This further restricts their visibility of the resources in a Cluster.


# iii. Shared State

To address the limited parallelism of the twolevel scheduling design, the alternative approach taken by some organizations was to redesign schedulers' architecture into several schedulers, all working concurrently. The schedulers work on a shared Cluster's state information and manage their resources' reservations using an optimistic concurrency control method. A sample of such systems includes: Microsoft's Apollo (Boutin et  By default, Nomad runs one scheduling worker per CPU core. Scheduling workers pick job submissions from the broker queue and then submit it to one of the three schedulers: a long-lived services scheduler, a short-lived batch jobs scheduler and a system scheduler, which is used to run internal maintenance routines. Additionally, Nomad can be extended to support custom schedulers. Schedulers process and generate an action plan, which constitutes a set of operations to create new allocations, or to evict and update existing ones (HashiCorp, 2018).

Microsoft's Apollo design seems to be primarily tuned for high tasks churn, and at peak times is capable of handling more than 100k of scheduling requests per second on a ca. 20k nodes cluster. Apollo uses a set of per-job schedulers called Job Managers (JM) wherein a single job entity contains a multiplicity of tasks which are then scheduled and executed on computing nodes. Tasks are generally short-lived batch jobs (Boutin et al., 2014). Apollo has a centralized Resource Monitor (RM), while each node runs its Process Node (PN) with a queue of tasks. Each PN is responsible for local scheduling decisions and can independently reorder its job queue to allow smaller tasks to be executed immediately, while larger tasks wait for resources to become available. In addition, PN computes a wait-time matrix based on its queue which publicizes the future availability of the node's resources. Scheduling decisions are made optimistically by JMs based on the shared cluster's resource state, which is continuously retrieved and aggregated by RM.

Furthermore, Apollo categorizes tasks as 'regular' and 'opportunistic'. Opportunistic tasks are used to fill resource gaps left by regular tasks. The system also prevents overloading the cluster by limiting the total number of regular tasks that can be run on a cluster. Apollo implements locality optimization by taking into consideration the location of data for a given task. For example, the system will score nodes higher if the required files are already on the local drive as opposed to machines needing to download data (Boutin et al., 2014).

Historically, Omega was a spinoff from Google's Borg scheduler. Despite the various optimizations acquired by Borg over the years, including internal parallelism and multi-threading, to address the issues of head-of-line blocking and scalability problems, Google decided to create an Omega scheduler from the ground up (Schwarzkopf et al., 2013). Omega introduced several innovations, such as storing the state of the cluster in a centralized Paxos-based store that was accessed by multiple components simultaneously. Optimistic locking concurrency control resolved the conflicts which emerged. This feature allowed Omega to run several schedulers at the same time and improve the throughput. Many of Omega's innovations have since been folded into Borg (Burns et al., 2016).

Omega's authors highlight the disadvantages of the shared state and parallel reservation of resources, namely: (i) the state of a node could have changed considerably when the allocation decision was being made, and it is no longer possible for this node to accept a job; (ii) two or more allocations to the same node could have conflicted and both scheduling decisions are nullified; and (iii) this strategy introduces significant difficulties when gang-scheduling a batch of jobs as (i) or (ii) are happening (Schwarzkopf et al., 2013).

In this research, Google's Borg received special attention, as one of the most advanced and published schedulers. Moreover, while other schedulers are designed to support either a high churn of short-term jobs, e.g., Microsoft's Apollo (Boutin et al., 2014), Alibaba's Fuxi (Zhang et al., 2014), or else a limited number of long-term services, such as Twitter's Aurora (Aurora, 2018), Google's engineers have created a system which supports a mixed workload. Borg has replaced two previous systems, Babysitter and the Global Work Queue, which were used to manage longrunning services and batch jobs separately (Burns et al., 2016). Given the significance of Borg's design for this research, it is discussed separately in section 2.4.


# iv. Decentralised Load Balancer

The research (Sliwko, 2018) proposes a new type of Cluster's workload orchestration model in which the actual scheduling logic is processed on nodes themselves. This is a significant step towards completely decentralized Cluster orchestration. The cluster state is retrieved from a subnetwork of BAs, although this system does not rely on the accuracy of this information and uses it exclusively to retrieve an initial set of candidate nodes where a task could potentially run. The actual task to machine matching is performed between the nodes themselves. As such, this design avoids the pitfalls of the concurrent resource locking, which includes conflicting scheduling decisions and the non-current state of nodes' information. Moreover, the decentralization of the scheduling logic also lifts complexity restrictions on scheduling logic, meaning that a wider range of scheduling algorithms can be used, such as metaheuristic methods.


# c) Big Data Schedulers

In taxonomy presented in this paper, Big Data schedulers are visualized as a separate branch from Cluster Schedulers. Although Big Data Schedulers seem to belong to one of the Cluster schedulers designs discussed previously, this separation signifies their overspecialization, and that only a very restricted set of operations is supported (Isard et al., 2007;Zaharia et al., 2010). The scheduling mechanisms are often intertwined with the programming language features, with Big Data frameworks often providing their own API (Zaharia et al., 2009;White, 2012) and indeed sometimes even their own custom programming language, as seen with Skywriting in CIEL (Murray et al., 2011).

Generally speaking, Big Data frameworks provide very fine-grained control over how data is accessed and processed over the cluster, such as Spark RDD objects persist operations or partitioners (Zaharia et al., 2012). Such a deep integration of scheduling logic with applications is a distinctive feature of Big Data technology. At the time of writing, Big Data is also the most active distributed computing research area, with new technologies, frameworks and algorithms being released regularly.

Big Data is the term which describes the storage and processing of any data sets so large and complex that they become unrealistic to process using traditional data processing applications based on relational database management systems. It depends on the individual organization as to how much data is described as Big Data. The following examples provide an idea of scale: produces about fifteen petabytes of data per year (White, 2012). As a result of a massive size of the stored and processed data, the central element of a Big Data framework is its distributed file system, such as Hadoop Distributed File System (Gog, 2012), Google File System (Ghemawat et al., 2003) and its successor Colossus (Corbett et al., 2013). The nodes in a Big Data cluster fulfill the dual purposes of storing the distributed file system parts, usually in a few replicas for fault-tolerance means, and also providing a parallel execution environment for system tasks. The speed difference between locally-accessed and remotely stored input data is very substantial, meaning that Big Data schedulers are very focused on providing 'data locality', which means running a given task on a node where input data are stored or are in the closest proximity to it. The origins of the Big Data technology are in the 'MapReduce' programming model, which implements the concept of Google's inverted search index. Developed in 2003 (Dean and Ghemawat, 2010) and later patented in 2010 (U.S. Patent 7,650,331), the Big Data design has evolved significantly in the years since. It is presented in the subsections below.


# i. Mapreduce

MapReduce is the most widespread principle which has been adopted for processing large sets of data in parallel. Originally, the name MapReduce only referred to Google's proprietary technology, but the term is now broadly used to describe a wide range of software, such as Hadoop, CouchDB, Infinispan, and MongoDB. The most important features of MapReduce are its scalability and fine-grained fault-tolerance. The 'map' and 'reduce' operations present in Lisp and other functional programming languages inspired the original thinking behind MapReduce (Dean and Ghemawat, 2010):

? 'Map' is an operation used in the first step of computation and is applied to all available data that performs the filtering and transforming of all keyvalue pairs from the input data set. The 'map' operation is executed in parallel on multiple machines on a distributed file system. Each 'map' task can be restarted individually, and a failure in the middle of a multi-hour execution does not require restarting the whole job from scratch. ? The 'Reduce' operation is executed after the 'map' operations complete. It performs finalizing operations, such as counting the number of rows matching specified conditions and yielding fields frequencies. The 'Reduce' operation is fed using a stream iterator, thereby allowing the framework to process the list of items one at the time, thus ensuring that the machine memory is not overloaded (Dean and Ghemawat, 2010;Gog, 2012).

Following the development of the MapReduce concept, Yahoo! engineers began the Open Source project Hadoop. In February 2008, Yahoo! announced that its production search index was being generated by a 10k-core Hadoop cluster (White, 2012). Subsequently, many other major Internet companies, including Facebook, LinkedIn, Amazon and Last.fm, joined the project and deployed it within their architectures. Hadoop is currently hosted in the Apache Software Foundation as an Open Source project.

As in Google's original MapReduce, Hadoop's users submit jobs which consist of 'map' and 'reduce' operation implementations. Hadoop splits each job into multiple 'map' and 'reduce' tasks. These tasks subsequently process each block of input data, typically 64MB or 128MB (Gog, 2012). Hadoop's scheduler allocates a 'map' task to the closest possible node to the input data required -so-called 'data locality' optimization. In so doing, we can see the following allocation order: the same node, the same rack and finally a remote rack (Zaharia et al., 2009). To further improve performance, the Hadoop framework uses 'backup tasks' in which a speculative copy of a task is run on a separate machine. The purpose of this is to finish the computation more quickly. If the first node is available but behaving poorly, it is known as a 'straggler', with the result that the job is as slow as the misbehaving task. This behavior can occur for many reasons, such as faulty hardware or misconfiguration. Google estimated that using 'backup tasks' could improve job response times by 44% (Dean and Ghemawat, 2010).

At the time of writing, Hadoop comes with a selection of schedulers, as outlined below: ? 'FIFO Scheduler' is a default scheduling system in which the user jobs are scheduled using a queue with five priority levels. Typically, jobs use the whole cluster, so they must wait their turn. When another job scheduler chooses the next job to run, it selects jobs with the highest priority, resulting in low-priority jobs being endlessly delayed (Zaharia et al., 2009;White, 2012). ? 'Fair Scheduler' is part of the cluster management technology Yet Another Resource Negotiator (YARN) (Vavilapalli et al., 2013), which replaced the original Hadoop engine in 2012. In Fair Scheduler, each user has their own pool of jobs, and the system focuses on giving each user a proportional share of cluster resources over time. The scheduler uses a version of 'max-min fairness' (Bonald et al., 2006) with minimum capacity guarantees that are specified as the number of 'map' and 'reduce' task slots to allocate tasks across users' job pools. When one pool is idle, and the minimum share of the tasks slots is not being used, other pools can use its available task slots. ? 'Capacity Scheduler' is the second scheduler introduced within the YARN framework. Essentially, this scheduled is a number of separate MapReduce engines, which contains FCFS scheduling for each user or organization. Those queues can be hierarchical, with a queue having children queues, and with each queue being allocated task slots capacity which can be divided into 'map' and 'reduce' tasks. Task slots allocation between queues is similar to the sharing mechanism between pools found in Fair Scheduler (White, 2012).

The main criticism of MapReduce is the acyclic dataflow programming model. The stateless 'map' task must be followed by a stateless 'reduce' task, which is then executed by the MapReduce engine. This model makes it challenging to repeatedly access the same dataset, a common action during the execution of iterative algorithms (Zaharia et al., 2009).


# ii. Iterative Computations

Following the success of Apache Hadoop, several alternative designs were created to address Hadoop's suboptimal performance when running iterative MapReduce jobs. Examples of such systems include HaLoop (Bu et al., 2010) and Spark (Zaharia et al., 2010).

HaLoop has been developed on top of Hadoop, with various caching mechanisms and optimizations added. This makes the framework loop-aware, for example by adding programming support for iterative application and storing the output data on the local disk. Additionally, HaLoop's scheduler keeps a record of every data block processed by each task on physical machines. It considers inter-iteration locality when scheduling tasks which follow. This feature helps to minimize costly remote data retrieval, meaning that tasks can use data cached on a local machine (Bu et al., 2010;Gog, 2012).

Similar to HaLoop, Spark's authors noted a suboptimal performance of iterative MapReduce jobs in the Hadoop framework. In certain kinds of application, such as iterative Machine Learning algorithms and interactive data analysis tools, the same data are repeatedly accessed in multiple steps and then discarded; therefore, it does not make sense to send it back and forward to a central node. In such scenarios, Spark will outperform Hadoop (Zaharia et al., 2012).

Spark is built on top of HDSF, but it does not follow the two-stage model of Hadoop. Instead, it introduces resilient distributed datasets (RDD) and parallel operations on these datasets (Gog, 2012):

? 'reduce' -combines dataset elements using a provided function; ? 'collect' -sends all the elements of the dataset to the user program; ? 'foreach' -applies a provided function onto every element of a dataset.


# Spark provides two types of shared variables:

? 'accumulators' -variables onto each worker can apply associative operations, meaning that they are efficiently supported in parallel;

? 'broadcast variables' -sent once to every node, with nodes then keeping a read-only copy of those variables (Zecevic, 2016).

The Spark job scheduler implementation is conceptually similar to that of Dryad's Quincy. However, it considers which partitions of RDD are available in the memory. The framework then re-computes missing partitions, and tasks are sent to the closest possible node to the input data required (Zaharia et al., 2012).

Another significant feature implemented in Spark is the concept of 'delayed scheduling'. In situations when a head-of-line job that should be scheduled next cannot launch a local task, Spark's scheduler delays the task execution and lets other jobs start their tasks instead. However, if the job has been skipped long enough, typically a period of up to ten seconds, it launches a non-local task. Since a typical Spark workload consists of short tasks, meaning that it has a high task slots churn, tasks have a higher chance of being executed locally. This feature helps to achieve 'data locality' which is nearly optimal, and which has a very small effect on fairness; in addition, the cluster throughput can be almost doubled, as shown in an analysis performed on Facebook's workload traces (Zaharia et al., 2010).


# iii. Distributed Stream Processing

The core concept behind distributed stream processing engines is the processing of incoming data items in real time by modelling a data flow in which there are several stages which can be processed in parallel. Other techniques include splitting the data stream into multiple sub-streams and redirecting them into a set of networked nodes (Liu and Buyya, 2017).

Inspired by Microsoft's research into DAG models (Isard et al., 2009), Apache Storm (Storm) is a distributed stream processing engine used by Twitter following extensive development (Toshniwal et al., 2014). Its initial release was 17 September 2011, and by September 2014 it had become open-source and an Apache Top-Level Project.

The defined topology acts as a distributed data transformation pipeline. The programs in Storm are designed as a topology in the shape of DAG, consisting of 'spouts' and 'bolts': ? 'Spouts' read the data from external sources and emit them into the topology as a stream of 'tuples'. This structure is accompanied by a schema which defines the names of the tuples' fields. Tuples can contain primitive values such as integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. Additionally, custom serializers can be defined to interpret this data. ? The processing stages of a stream are defined in 'bolts' which can perform data manipulation, filtering, aggregations, joins, and so on. Bolts can also constitute more complex transforming structures that require multiple steps (thus, multiple bolts). The bolts can communicate with external applications such as databases and Kafka queues (Toshniwal et al., 2014).

In comparison to MapReduce and iterative algorithms introduced in the subsections above, Storm topologies, once created, run indefinitely until killed. Given this, the inefficient scattering of application's tasks among Cluster nodes has a lasting impact on performance. Storm's default scheduler implements a Round Robin strategy. For resource allocation purposes, Storm assumes that every worker is homogenous. This design results in frequent resource over-allocation and inefficient use of inter-system communications (Kulkarni et al., 2018). To try and solve this issue, more complex solutions are proposed such as D-Storm (Liu and Buyya, 2017). D-Storm's scheduling strategy is based on a metaheuristic algorithm Greedy, which also monitors the volume of the incoming workload and is resource-aware.


# Typical examples of Storm's usage include:

? Processing a stream of new data and updating databases in real time, for example in trading systems wherein data accuracy is crucial; ? Continuously querying and forwarding the results to clients in real time, for example streaming trending topics on Twitter into browsers, and ? A parallelization of a computing-intensive query on the fly, i.e., a distributed Remote Procedure Call (RPC) wherein a large number of sets are probed (Marz, 2011).

Storm has gained widespread popularity and is used by companies such as Groupon, Yahoo!, Spotify, Verisign, Alibaba, Baidu, Yelp, and many more. A comprehensive list of users is available at the storm.apache.org website.

At the time of writing, Storm is being replaced at Twitter by newer distributed stream processing engine -Heron (Kulkarni et al., 2018) which continues the DAG model approach, but focuses on various architectural improvements such as reduced overhead, testability, and easier access to debug data.

V.


# Google's Borg

To support its operations, Google utilizes a high number of data centers around the world, which at the time of writing number sixteen. Borg admits, schedules, starts, restarts and monitors the full range of applications run by Google. Borg users are Google developers and system administrators, and users submit their workload in the form of jobs. A job may consist of one or more tasks that all run the same program (Burns et al., 2016).


# a) Design Concepts

The central module of the Borg architecture is BorgMaster, which maintains an in-memory copy of most of the state of the cell. This state is also saved in a distributed Paxos-based store (Lamport, 1998). While BorgMaster is logically a single process, it is replicated five times to improve fault-tolerance. The main design priority of Borg was resilience rather than performance. Google services are seen as very durable and reliable, the result of multi-tier architecture, where no component is a single point of failure exists. Current allocations of tasks are saved to Paxos-based storage, and the system can recover even if all five BorgMaster instances fail. Each cell in the Google Cluster in managed by a single BorgMaster controller. Each machine in a cell runs BorgLet, an agent process responsible for starting and stopping tasks and also restarting them should they fail. BorgLet manages local resources by adjusting local OS kernel settings and reporting the state of its node to the BorgMaster and other monitoring systems.

The Borg system offers extensive options to control and shape its workload, including priority bands for tasks (i.e., monitoring, production, batch, and best effort), resources quota and admission control. Higher priority tasks can pre-empt locally-running tasks to obtain the resources which are required. The exception is made for production tasks which cannot be preempted. Resource quotas are part of admission control and are expressed as a resource vector at a given priority, for some time (usually months). Jobs with insufficient quotas are rejected immediately upon submission. Production jobs are limited to actual resources available to BorgMaster in a given cell. The Borg system also exposes a web-based interface called Sigma, which displays the state of all users' jobs, shows details of their execution history and, if the job has not been scheduled, also provides a 'why pending?' annotation where there is guidance about how to modify the job's resource requests to better fit the cell (Verma et al., 2015).

The dynamic nature of the Borg system means that tasks might be started, stopped and then rescheduled on an alternative node. Google engineers have created the concept of a static Borg Name Service (BNS) which is used to identify a task run within a cell and to retrieve its endpoint address. The BNS address is predominantly used by load balancers to transparently redirect RPC calls to the endpoint of a given task. Meanwhile, the Borg's resource reclamation mechanisms help to reclaim under-utilized resources from cell nodes for non-production tasks. Although in theory users may request high resource quotas for their tasks, in practice they are rarely fully utilized continuously. Instead, they have peak times of the day or are used in this way when coping with a denial-ofservice attack. BorgMaster has routines that estimate resource usage levels for a task and reclaim the rest for low-priority jobs from the batch or the best effort bands (Verma et al., 2015).


# b) Jobs Schedulers

Early versions of Borg had a simple, synchronous loop that accepted jobs requests and evaluated on which node to execute them. The current design of Borg deploys several schedulers working in parallel -the scheduler instances use a shared state of the available resources, but the resource offers are not locked during scheduling decisions (optimistic concurrency control). Where there is a conflicting situation where two or more schedulers allocate jobs to the same resources, all the jobs involved are returned to the jobs queue (Schwarzkopf et al., 2013).

When allocating a task, Borg's scheduler scores a set of available nodes and selects the most feasible machine for this task. Initially, Borg implemented a variation of the Enhanced Parallel Virtual Machine algorithm (E-PVM) (Amir et al., 2000) for calculating the task allocation score. Although this resulted in the fair distribution of tasks across nodes, it also resulted in increased fragmentation and later difficulties when fitting large jobs which required the most of the node's resources or even the whole node itself. An opposite to the E-PVM approach is a best-fit strategy, which, in turn, packs tasks very tightly. The best-fit approach may result in the excessive pre-empting of other tasks running on the same node, especially when the user miscalculates the resources required, or when the application has frequent load spikes. The current model used by Borg's scheduler is a hybrid approach that tries to reduce resource usage gaps (Verma et al., 2015).

Borg also takes advantage of resources preallocation using 'allocs' (short for allocation). Allocs can be used to pre-allocate resources for future tasks to retain resources between restarting a task or to gather class-equivalent or related tasks, such as web applications and associated log-saver tasks, onto the same machine. If an alloc is moved to another machine, its tasks are also rescheduled.

One point to note is that, similar to MetaCentrum users (Klusá?ek and Rudová, 2010), Google's users tend to overestimate the memory resources needed to complete their jobs, to prevent jobs being killed due to exceeding the allocated memory. In over 90% of cases, users overestimate how many resources are required, which in certain cases can waste up to 98% of the requested resource (Moreno et al., 2013;Ray et al., 2017).


# c) Optimisations

Over the years, Borg design has acquired several optimizations, namely: ? Score caching -checking the node's feasibility and scoring it is a computation-expensive process. Therefore, scores for nodes are cached and small differences in the required resources are ignored; While the Borg architecture remains heavily centralized, this approach does seem to be successful. Although this eliminates head-of-line job blocking problems and offers better scalability, it also generates additional overheads for solving resource collisions. Nevertheless, the benefits from better scalability often outweigh the incurred additional computation costs which arise when scalability targets are achieved (Schwarzkopf et al., 2013).


# VI.


# Summary and Conclusions

This paper has presented a taxonomy of available schedulers, ranging from early implementations to modern versions. Aside from optimizing throughput, different class schedulers have evolved to solve different problems. For example, while OS schedulers maximize responsiveness, Cluster schedulers focus on scalability, provide support a wide range of unique (often legacy) applications, and maintain fairness. Big Data schedulers are specialized to solve issues accompanying operations on large datasets, and their scheduling mechanisms are often extensively intertwined with programming language features.

Table 1 presents a comparison of the presented schedulers with their main features and deployed scheduling algorithms: 'map' task processes roughly the same amount of data (input data block size is constant), while 'reduce' task requirements shall be directly correlated to the size of returned data.

OS schedulers have evolved in such a way that their focus is on maximizing responsiveness while still providing good performance. Interactive processes which sleep more often should be allocated time-slices more frequently, while background processes should be allocated longer, but less frequent execution times. CPU switches between processes extremely rapidly which is why modern OS scheduling algorithms were designed with very low overhead (Wong et al., 2008;Pinel et al., 2011). Most end-users for this class of schedulers are non-technical. As such, those schedulers usually have a minimum set of configuration parameters (Groves et al., 2009).

OS scheduling was previously deemed to be a solved problem (Torvalds, 2001), but the introduction and popularization of multi-core processors by Intel (Intel Core?2 Duo) and AMD (AMD Phenom? II) in the early 2000s enabled applications to execute in parallel. This meant that scheduling algorithms needed to be reimplemented tobe efficient once more. Modern OS schedulers also consider NUMA properties when deciding which CPU core the task will be allocated to. Furthermore, the most recent research explores the potential application of dynamic voltage and frequency scaling technology in scheduling to minimize power consumption by CPU cores (Sarood et al., 2012;Padoin et al., 2014). Given that it is hard to build a good universal solution which caters to the complexities of modern hardware, it is reasonable to develop the modular scheduler architecture suggested in Lozi et al. (2016).

Cluster schedulers have a difficult mission in ensuring 'fairness'. In this context, namely a very dynamic environment consisting of variety of applications, fairness means sharing cluster resources proportionally while simultaneously ensuring a stable throughput. Cluster systems tend to allow administrators to implement complex resource sharing policies with multiple input parameters (Adaptive Computing, 2002). Cluster systems implement extensive fault-tolerance strategies and sometimes also focus on minimizing power consumption (Lang and Patel, 2010). Surprisingly, it appears that the most popular scheduling approach is a simple FCFS strategy with variants of backfilling. However, due to the rapidly increasing cluster size, the current research focuses on parallelization, as seen with systems such as Google's Borg and Microsoft's Apollo.

Big Data systems are still rapidly developing. Nodes in Big Data systems fulfil the dual purposes of storing distributed file system parts and providing a parallel execution environment for system tasks. Big Data schedulers inherit their general design from the cluster system's jobs schedulers. However, they are usually much more specialized for the framework and are also intertwined with the programming language features. Big Data schedulers are often focused on 'locality optimization' or running a given task on a node where input data is stored or in the closest proximity to it.

The design of modern scheduling strategies and algorithms is a challenging and evolving field of study. While early implementations often used simplistic approaches, such as a CS, modern solutions use complex scheduling schemas. Moreover, the literature frequently mentions the need for a modular scheduler architecture (Vavilapalli et al., 2013;Lozi et al., 2016) which could customize scheduling strategies to hardware configuration or applications.
1![Figure 1: Schedulers taxonomy](image-2.png "Figure 1 :")


? Relaxed randomization -instead of evaluating atask against all available nodes, Borg examinesmachines in random order until it finds enoughfeasible nodes. It then selects the highest scoringnode in this set.
1SchedulerclassRequirements knownpre-executionFault-tolerancemechanismsConfigurationCommon algorithmsScheduling decisionoverheadDesign focus(aside throughput)OS SchedulersNoNoSimple (compile-time and runtime parameters)CS, CQ, MLFQ, O(n), O(1), Staircase, WFQvery low -low? ? ? ?single machine NUMA awareness Responsiveness simple configurationCluster SchedulersYes 1 YesComplex (configuratio n files and GUI)FCFS (backfilling and gang-scheduling), SJF, Best-Fit, Scoringlow -high? ? ? ?distributed nodes fairness complex sharing policy power consumptionBig Data SchedulersYes 2 YesComplex (configuratio n files and GUI)F Best-Fit, FCFS i (locality and gang-scheduling), Greedy, Fairlow -medium? ? ?specialized frameworks parallelism distributed storagedata1.S h d l
			© 2019 Global JournalsA Taxonomy of Schedulers -Operating Systems, Clusters and Big Data Frameworks
		
		
* 
	
		Marathon: A container orchestration platform for Mesos and DC/OS
		0.19.0. 2
		
	
		Aurora
		
			December 5. 2018. January 10. 2018. February 7. 2018
			Mesosphere, Inc
		
	
	References Références Referencias 1


* 
	
		Maui Administrator's Guide
		
		
			May 16. 2002. November 5. 2014
			Adaptive Computing Enterprises, Inc
		
	
	Version 3.2


* 
	
		Nomad -Easily Deploy Applications at Any Scale
		
		
			March 19. 2018
		
	
	Version 0.7.1


* 
	
		Adaptive Computing Enterprises, Inc
		
		
			November 2015. November 15. 2016
		
	
	Administration Guide 5.1.2


* 
	
		An opportunity cost approach for job assignment in a scalable computing cluster
		
			YairAmir
		
		
			BaruchAwerbuch
		
		
			AmnonBarak
		
		
			RSeanBorgstrom
		
		
			ArieKeren
		
		
			2000
			11
			
		
* 
	
		Operating systems: Three easy pieces
		
			Arpaci-Dusseau
		
		
			HRemzi
		
		
			AndreaCArpaci-Dusseau
		
		
			2015
			Arpaci-Dusseau Books
		
	
* 
	
		Web search for a planet: The Google cluster architecture
		
			LuizBarroso
		
		
			JeffreyAndré
		
		
			UrshölzleDean
		
		
			2003
			23
			
		
* 
	
		Average-case and smoothed competitive analysis of the multilevel feedback algorithm
		
			LBecchetti
		
		
			StefanoLeonardi
		
		
			AlbertoMarchetti-Spaccamela
		
		
			GuidoSchäfer
		
		
			Tjarkvredeveld
		
	
		Mathematics of Operations Research
		
			31
			1
			
			2006
		
	
* 
	
		A case for NUMA-aware contention management on multicore systems
		
			SergeyBlagodurov
		
		
			SergeyZhuravlev
		
		
			AlexandraFedorova
		
		
			AliKamali
		
	
		Proceedings of the 19th international conference on Parallel architectures and compilation techniques
				the 19th international conference on Parallel architectures and compilation techniques
		
			ACM
			2010
			
		
* 
	
		The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters
		
			Bode
		
		
			DavidMBrett
		
		
			RickyHalstead
		
		
			ZhouKendall
		
		
			DavidLei
		
		
			Jackson
		
	
		Annual Linux Showcase & Conference
				
			2000
		
	
* 
	
		A queueing analysis of max-min fairness, proportional fairness and balanced fairness
		
			ThomasBonald
		
		
			LaurentMassoulié
		
		
			AlexandreProutiere
		
		
			Jormavirtamo
		
	
		Queueing systems
		
			53
			1
			
			2006
		
	
* 
	
		Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing
		
			EricBoutin
		
		
			WeiJaliyaekanayake
		
		
			BingLin
		
		
			JingrenShi
		
		
			ZhengpingZhou
		
		
			MingQian
		
		
			LidongWu
		
		
			Zhou
		
	
		OSDI
				
			2014
			14
			
		
* 
	
		HaLoop: Efficient iterative data processing on large clusters
		
			YingyiBu
		
		
			BillHowe
		
		
			MagdalenaBalazinska
		
		
			MichaelDErnst
		
	
		Proceedings of the VLDB Endowment
				the VLDB Endowment
		
			2010
			3
			
		
* 
	
		Operating system support for simultaneous multithreaded processors
		
			JamesRBulpin
		
		No. UCAM-CL-TR-619
		
			2005
		
		
			University of Cambridge, Computer Laboratory
		
	
* 
	
		
			BrendanBurns
		
		
			BrianGrant
		
		
			DavidOppenheimer
		
		
			EricBrewer
		
		
			JohnWilkes
		
		
			;Borg
		
		
			Omega
		
		
			Kubernetes
		
	
		Communications of the ACM
		
			59
			5
			
			2016
		
	
* 
	
		Distributed Scheduler Hell
		
			MatthewCampbell
		
	
		DigitalOcean. SREcon17 Asia/Australia
				
			May 24, 2017
		
	
* 
	
		An experimental time-sharing system
		
			FernandoJCorbató
		
		
			MarjorieMerwin-Daggett
		
		
			RobertCDaley
		
	
		Proceedings of the
				the
		
			ACM
			May 1-3, 1962. 1962
			
		
	spring joint computer conference


* 
	
		The staircase scheduler
		
			JonathanCorbet
		
		
		LWN.net
				
			June 2, 2004. September 25, 2017
		
	
* 
	
		The Rotating Staircase Deadline Scheduler
		
			JonathanCorbet
		
		
		LWN.net. March
				
			2007. September 25, 2017
			6
		
	
* 
	
		Spanner: Google's globally distributed database
		
			JamesCCorbett
		
		
			JeffreyDean
		
		
			MichaelEpstein
		
		
			AndrewFikes
		
		
			ChristopherFrost
		
		
			JeffreyJohnFurman
		
		
			SanjayGhemawat
		
	
		ACM Transactions on Computer Systems (TOCS)
		
			31
			3
			8
			2013
		
	
* 
	
		MapReduce: a flexible data processing tool
		
			JeffreyDean
		
		
			SanjayGhemawat
		
	
		Communications of the ACM
		
			53
			1
			
			2010
		
	
* 
	
		What every programmer should know about memory
		
			UlrichDrepper
		
	
		Red Hat, Inc
		
			11
			2007
			2007
		
	
* 
	
		A short survey of commercial cluster batch schedulers
		
			YoavEtsion
		
		
			DanTsafrir
		
		
			2005
			44221
			
		
			School of Computer Science and Engineering, the Hebrew University of Jerusalem
		
	
* 
	
		Globus: A metacomputing infrastructure toolkit
		
			IanFoster
		
		
			CarlKesselman
		
	
		The International Journal of Supercomputer Applications and High Performance Computing
		
			11
			2
			
			1997
		
	
* 
	
		The anatomy of the grid: Enabling scalable virtual organizations
		
			IanFoster
		
		
			CarlKesselman
		
		
			StevenTuecke
		
	
		The International Journal of High Performance Computing Applications
		
			15
			3
			
			2001
		
	
* 
	
		
			Gabriel
		
		
			GrahamEEdgar
		
		
			GeorgeFagg
		
		
			Bosilca
		
		
			JackJTharaangskun
		
		
			Dongarra
		
		
			MJeffrey
		
		
* 
	
		Open MPI: Goals, concept, and design of a next generation MPI implementation
		
			VishalSquyres
		
		
			Sahay
		
	
		European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting
				Berlin Heidelberg
		
			Springer
			2004
			
		
* 
	
		Sun grid engine: Towards creating a compute power grid
		
			WolfgangGentzsch
		
	
		Proceedings. First IEEE/ACM International Symposium on
				First IEEE/ACM International Symposium on
		
			IEEE
			2001. 2001
			
		
	Cluster Computing and the Grid


* 
	
		The Google file system
		
			SanjayGhemawat
		
		
			HowardGobioff
		
		
			Shun-TakLeung
		
	
		ACM SIGOPS operating systems review
				
			ACM
			2003
			37
			
		
* 
	
		Dron: An Integration Job Scheduler
		
			IGog
			
				Imperial College London
			
		
			2012
		
	
* 
	
		The Mentat run-time system: support for medium grain parallel computation
		
			AndrewSGrimshaw
		
	
		Distributed Memory Computing Conference
				
			IEEE
			1990. 1990
			2
			
		
	Proceedings of the Fifth


* 
	
		Legion: The next logical step toward a nationwide virtual computer
		
			AndrewSGrimshaw
		
		
			AWilliam
		
		
			JamesCWulf
		
		
			AlfredCFrench
		
		
			PaulWeaver
		
		
			ReynoldsJr
		
		CS-94-21
		
			1994
		
		
			University of Virginia
		
	
	Technical Report


* 
	
		BFS vs. CFS -Scheduler Comparison
		
			TaylorGroves
		
		
			JeffKnockel
		
		
			EricSchulte
		
		
			11 December 2009
		
		
			The University of New Mexico
		
	
* 
	
		Evaluation of jobscheduling strategies for grid computing
		
			Hamscher
		
		
			UweVolker
		
		
			AchimSchwiegelshohn
		
		
			RaminyahyapourStreit
		
		GRID 2000
		
			2000
			
		
* 
	
		Win32 systems programming
		
			JohnsonMHart
		
		
			1997
			Addison-Wesley Longman Publishing Co., Inc
		
	
* 
	
		Cosmos: Big Data and Big Challenges
		
			PatHelland
		
		
			HarrisEd
		
		
			October 26, 2011
		
		
			Stanford University
		
	
* 
	
		Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
		
			BenjaminHindman
		
		
			AndyKonwinski
		
		
			MateiZaharia
		
		
			AliGhodsi
		
		
			AnthonyDJoseph
		
		
			RandyHKatz
		
		
			ScottShenker
		
		
			IonStoica
		
	
		NSDI
				
			2011. 2011
			11
			
		
* 
	
		Dryad: distributed data-parallel programs from sequential building blocks
		
			MichaelIsard
		
		
			MihaiBudiu
		
		
			YuanYu
		
		
			AndrewBirrell
		
		
			DennisFetterly
		
	
		ACM SIGOPS operating systems review
				
			ACM
			2007
			41
			
		
* 
	
		Quincy: fair scheduling for distributed computing clusters
		
			MichaelIsard
		
		
			VijayanPrabhakaran
		
		
			JonCurrey
		
		
			UdiWieder
		
		
			KunalTalwar
		
		
			AndrewGoldberg
		
	
		Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
				the ACM SIGOPS 22nd symposium on Operating systems principles
		
			ACM
			2009
			
		
* 
	
		Core algorithms of the Maui scheduler
		
			DavidJackson
		
		
			QuinnSnell
		
		
			MarkClement
		
	
		Workshop on Job Scheduling Strategies for Parallel Processing
				Berlin, Heidelberg
		
			Springer
			2001
			
		
* 
	
		Inside the Linux 2.6 Completely Fair Scheduler -Providing fair access to CPUs since 2.6.23
		
			MJones
		
		
			Tim
		
	
		IBM Developer Works
				
			December 15, 2009
		
	
* 
	
		Workload management with LoadLeveler
		
			Kannan
		
		
			MarkSubramanian
		
		
			PeterRoberts
		
		
			DaveMayes
		
		
			JosephFBrelsford
		
		
			Skovira
		
	
		IBM Redbooks
		
			2
			2
			2001
		
	
* 
	
		A fair share scheduler
		
			JudyKay
		
		
			PiersLauder
		
	
		Communications of the ACM
		
			31
			1
			
			1988
		
	
* 
	
		The Use of Incremental Schedule-based Approach for Efficient Job Scheduling
		
			DaliborKlusá?ek
		
		
			HanaRudová
		
	
		Sixth Doctoral Workshop on Mathematical and Engineering Methods in Computer Science
				
			2010
		
	
* 
	
		Optimizing user oriented job scheduling within TORQUE
		
			DaliborKlusá?ek
		
		
			VáclavChlumský
		
		
			HanaRudová
		
	
		Super Computing the 25th International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13)
				
			2013
		
	
* 
	
		MuQSS version 0.114." -ck hacking
		
			ConKolivas
		
		linux-4.8-ck2
		
		
			October 21. 2016. December 8, 2016
		
	
* 
	
		A taxonomy and survey of grid resource management systems for distributed computing
		
			KlausKrauter
		
		
			RajkumarBuyya
		
		
			MuthucumaruMaheswaran
		
	
		Software: Practice and Experience
		
			32
			2
			
			2002
		
	
* 
	
		Twitter Heron: Stream processing at scale
		
			SanjeevKulkarni
		
		
			NikunjBhagat
		
		
			MaosongFu
		
		
			ChristopherVikaskedigehalli
		
		
			SaileshKellogg
		
		
			JigneshMMittal
		
		
			KarthikPatel
		
		
			SiddarthRamasamy
		
		
			Taneja
		
	
		Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
				the 2015 ACM SIGMOD International Conference on Management of Data
		
			ACM
			2015
			
		
* 
	
		The part-time parliament
		
			LeslieLamport
		
	
		ACM Transactions on Computer Systems (TOCS)
		
			16
			2
			
			1998
		
	
* 
	
		Energy management for mapreduce clusters
		
			WillisLang
		
		
			MJignesh
		
		
			Patel
		
	
		Proceedings of the VLDB Endowment
				the VLDB Endowment
		
			2010
			3
			
		
* 
	
		Advanced Scheduling in Kubernetes
		
			IanLewis
		
		
			DavidOppenheimer
		
		
		Kubernetes.io. Google, Inc
		
			March 31, 2017. January 4. 2018
		
	
* 
	
		Condor-a hunter of idle workstations
		
			MichaelJLitzkow
		
		
			MattWMironlivny
		
		
			Mutka
		
	
		8th International Conference on
				
			IEEE
			1988. 1988
			
		
	Distributed Computing Systems


* 
	
		D-Storm: Dynamic Resource-Efficient Scheduling of Stream Processing Applications
		
			XunyunLiu
		
		
			RajkumarBuyya
		
	
		2017 IEEE 23rd International Conference on
				
			IEEE
			2017
			
		
	Parallel and Distributed Systems (ICPADS


* 
	
		The Linux scheduler: a decade of wasted cores
		
			Jean-PierreLozi
		
		
			BaptisteLepers
		
		
			JustinFunston
		
		
			FabienGaud
		
		
			VivienQuéma
		
		
			AlexandraFedorova
		
	
		Proceedings of the Eleventh European Conference on Computer Systems
				the Eleventh European Conference on Computer Systems
		
			ACM
			2016
			1
		
	
* 
	
		A Storm is coming: more details and plans for release
		
			NathanMarz
		
		
			August 4, 2011. July 16. 2018
			Twitter, Inc
		
	
	Engineering Blog


* 
	
		Evaluating the effectiveness of model-based power characterization
		
			JohnCMccullough
		
		
			YuvrajAgarwal
		
		
			JaideepChandrashekar
		
		
			SathyanarayanKuppuswamy
		
		
			AlexCSnoeren
		
		
			RajeshKGupta
		
	
		USENIX Annual Technical Conf
				
			2011
			20
		
	
* 
	
		An approach for characterizing workloads in google cloud to derive realistic resource utilization models
		
			IsmaelMoreno
		
		
			PeterSolis
		
		
			PaulGarraghan
		
		
			JieTownend
		
		
			Xu
		
	
		IEEE 7th International Symposium on
				
			IEEE
			2013. 2013
			
		
* 
	
		CIEL: a universal execution engine for distributed data-flow computing
		
			DerekGMurray
		
		
			MalteSchwarzkopf
		
		
			ChristopherSmowton
		
		
			StevenSmith
		
		
			AnilMadhavapeddy
		
		
			StevenHand
		
	
		Proc. 8th ACM/USENIX Symposium on Networked Systems Design and Implementation
				8th ACM/USENIX Symposium on Networked Systems Design and Implementation
		
			2011
			
		
* 
	
		Building a virtual system of systems using Docker Swarm in multiple clouds
		
			NitinNaik
		
	
		Systems Engineering (ISSE), 2016 IEEE International Symposium on
				
			IEEE
			2016
			
		
* 
	
		Completely fair scheduler
		
			ChandandeepPabla
		
		
			Singh
		
	
		Linux Journal
		
			184
			4
			2009. 2009
		
	
* 
	
		Saving energy by exploiting residual imbalances on iterative applications
		
			EdsonLPadoin
		
		
			MárcioCastro
		
		
			LaércioLPilla
		
		
			OAPhilippe
		
		
			Jean-FrançoisNavaux
		
		
			Méhaut
		
	
		High Performance Computing (HiPC), 2014 21st International Conference on
				
			IEEE
			2014
			
		
* 
	
		Effects of topology-aware allocation policies on scheduling performance
		
			JosePascual
		
		
			JavierNavaridas
		
		
			JoseMiguel-Alonso
		
	
		Job Scheduling Strategies for Parallel Processing
				Berlin/Heidelberg
		
			Springer
			2009
			
		
* 
	
		A review on task performance prediction in multi-core based systems
		
			Pinel
		
		
			JohnatanEFrédéric
		
		
			PascalPecero
		
		
			SameeUBouvry
		
		
			Khan
		
	
		Computer and Information Technology (CIT)
				
			2011
		
	
* 
	
		IEEE 11th International Conference on
				
			IEEE
			2011
			
		
* 
	
		Load balancing and unbalancing for power and performance in clusterbased systems
		
			EduardoPinheiro
		
		
			RicardoBianchini
		
		
			EnriqueVCarrera
		
		
			TaliverHeath
		
	
		Workshop on compilers and operating systems for low power
				
			2001
			180
			
		
* 
	
		A simulation model for grid scheduling analysis and optimization
		
			Pop
		
		
			Florin
		
		
			GavrilCipriandobre
		
		
			ValentinGodza
		
		
			Cristea
		
	
		ELEC 2006. International Symposium on
				
			IEEE
			2006. 2006
			
		
* 
	
		Is High Performance Computing (HPC) Ready to Handle Big Data
		
			BiplobRRay
		
		
			MorshedChowdhury
		
		
			UsmanAtif
		
	
		In International Conference on Future Network Systems and Security
				Cham
		
			Springer
			2017
			
		
* 
	
		A taxonomy and survey on scheduling algorithms for scientific workflows in IaaS cloud computing environments
		
			MariaAlejandraRodriguez
		
		
			RajkumarBuyya
		
	
		Concurrency and Computation: Practice and Experience
		
			29
			8
			2017
		
	
* 
	
		Cool" Load Balancing for High Performance Computing Data Centers
		
			OsmanSarood
		
		
			PhilMiller
		
		
			EhsanTotoni
		
		
			LaxmikantVKale
		
	
		IEEE Transactions on Computers
		
			61
			12
			
			2012
		
	
* 
	
		Omega: flexible, scalable schedulers for large compute clusters
		
			MalteSchwarzkopf
		
		
			AndyKonwinski
		
		
			MichaelAbd-El-Malek
		
		
			JohnWilkes
		
	
		Proceedings of the 8th ACM European Conference on Computer Systems
				the 8th ACM European Conference on Computer Systems
		
			ACM
			2013
			
		
* 
	
		Efficient fair queueing using deficit round robin
		
			MadhavapeddiShreedhar
		
		
			GeorgeVarghese
		
	
		ACM SIGCOMM Computer Communication Review
				
			ACM
			1995
			25
			
		
* 
	
		New York Stock Exchange Oracle Exadata -Our Journey
		
			AjitSingh
		
		
			November 17, 2017. June 28. 2018
			Oracle, Inc
		
	
* 
	
		A Scalable Service Allocation Negotiation For Cloud Computing
		
			LeszekSliwko
		
	
		Journal of Theoretical and Applied Information Technology
		
			96
			
			2018
		
	
* 
	
		Taxonomies of workflow scheduling problem and techniques in the cloud
		
			SuchaSmanchat
		
		
			KanchanaViriyapant
		
	
		Future Generation Computer Systems
		
			52
			
			2015
		
	
* 
	
		Metacomputing
		
			LarrySmarr
		
		
			CharlesECatlett
		
	
		Grid Computing: Making the Global Infrastructure a Reality
				
			2003
			
		
* 
	
		Distributed computing in practice: the Condor experience
		
			DouglasThain
		
		
			ToddTannenbaum
		
		
			Mironlivny
		
		
			2005
			17
			
		
	Concurrency and computation: practice and experience


* 
	
		Re: Just a second ?" The Linux Kernel Mailing List
		
			LinusTorvalds
		
		
			December 15. 2001. September 27, 2017
		
	
* 
	
		Storm @Twitter
		
			AnkitToshniwal
		
		
			SiddarthTaneja
		
		
			AmitShukla
		
		
			KarthikRamasamy
		
		
			JigneshMPatel
		
		
			SanjeevKulkarni
		
		
			JasonJackson
		
	
		Proceedings of the 2014 ACM SIGMOD international conference on Management of data
				the 2014 ACM SIGMOD international conference on Management of data
		
			ACM
			2014
			
		
* 
	
		A Survey on Scheduling Algorithms for Parallel and Distributed Systems
		
			RinkiTyagi
		
		
			Santosh KumarGupta
		
	
		Silicon Photonics & High Performance Computing
				Singapore
		
			Springer
			2018
			
		
* 
	
		Apache hadoop yarn: Yet another resource negotiator
		
			VinodVavilapalli
		
		
			ArunCKumar
		
		
			ChrisMurthy
		
		
			SharadDouglas
		
		
			MahadevAgarwal
		
		
			RobertKonar
		
		
			ThomasEvans
		
		
			Graves
		
	
		Proceedings of the 4th annual Symposium on Cloud Computing
				the 4th annual Symposium on Cloud Computing
		
			ACM
			2013
			5
		
	
* 
	
		Large-scale cluster management at Google with Borg
		
			AbhishekVerma
		
		
			LuisPedrosa
		
		
			MadhukarKorupolu
		
		
			DavidOppenheimer
		
		
			EricTune
		
		
			JohnWilkes
		
	
		Proceedings of the Tenth European Conference on Computer Systems
				the Tenth European Conference on Computer Systems
		
			ACM
			2015
			18
		
	
* 
	
		Hadoop: The definitive guide
		
			TomWhite
		
		
			2012
			Reilly Media, Inc
		
	
* 
	
		Fairness and interactive performance of O(1) and cfslinux kernel schedulers
		
			CSWong
		
		
			IK TTan
		
		
			RDKumari
		
		
			JWLam
		
		
			WFun
		
	
		Information Technology, 2008. International Symposium on
				
			IEEE
			2008
			4
			
		
* 
	
		Scheduling pods on nodes
		
			DeepakVohra
		
	
		Kubernetes Management Design Patterns
				Berkeley, CA
		
			Apress
			2017
			
		
* 
	
		Scaling the Facebook data warehouse to 300 PB
		
			PamelaVagata
		
		
			KevinWilfong
		
		
			April 10. 2014. June 28. 2018
			Facebook, Inc
		
	
* 
	
		Slurm: Simple linux utility for resource management
		
			AndyBYoo
		
		
			AMorris
		
		
			MarkJette
		
		
			Grondona
		
	
		Workshop on Job Scheduling Strategies for Parallel Processing
				
			
* 
	
		
			Springer
		
		
			2003
			Berlin, Heidelberg
		
	
* 
	
		A taxonomy of scientific workflow systems for grid computing
		
			JiaYu
		
		
			RajkumarBuyya
		
	
		ACM Sigmod Record
		
			34
			3
			
			2005
		
	
* 
	
		Job scheduling for multi-user mapreduce clusters
		
			MateiZaharia
		
		
			DhrubaBorthakur
		
		
			JSenSarma
		
		UCB/EECS-2009-55
		
			2009
			47
		
		
			EECS Department, University of California, Berkeley
		
	
	Technical Report
	Khaled Elmeleegy, Scott Shenker, and Ion Stoica


* 
	
		Spark: Cluster computing with working sets
		
			MateiZaharia
		
		
			MosharafChowdhury
		
		
			MichaelJFranklin
		
		
			ScottShenker
		
		
			IonStoica
		
		
			2010
			10
			95
		
	
* 
	
		Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing
		
			MateiZaharia
		
		
			MosharafChowdhury
		
		
			TathagataDas
		
		
			AnkurDave
		
		
			JustinMa
		
		
			MurphyMccauley
		
		
			MichaelJFranklin
		
		
			ScottShenker
		
		
			IonStoica
		
	
		Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
				the 9th USENIX conference on Networked Systems Design and Implementation
		
			USENIX Association
			2012
			
		
* 
	
		Energy efficient computing, clusters, grids and clouds: A taxonomy and survey
		
			MuhammadZakarya
		
		
			LeeGillam
		
	
		Sustainable Computing: Informatics and Systems
		
			14
			
			2017
		
	
* 
	
		Spark in Action
		
			PetarZecevic
		
		
			MarkoBonaci
		
		
			2016
		
	
* 
	
		Fuxi: a fault-tolerant resource management and job scheduling system at internet scale
		
			ZhuoZhang
		
		
			ChaoLi
		
		
			YangyuTao
		
		
			RenyuYang
		
		
			HongTang
		
		
			JieXu
		
	
		Proceedings of the VLDB Endowment
				the VLDB Endowment
		
			2014
			7