I wrote this to facilitate my own work. All mistakes are my own.
Feedback and error corrections are always appreciated.
Grid Engine 6 is a distributed resource management (DRM) software layer developed
and distributed under an open source license. A commercial version is sold
by Sun Microsystems as "N1 Grid Engine 6" . The project lives at http://gridengine.sunsource.net.
It should also be noted that managing policies is one of the Grid Engine
admin tasks that is often easier and more straightforward when done via the
graphical "qmon" utility. The author of this document tends to work
remotely on clusters via low bandwidth SSH connections or VPN setups that
do not allow X11 trafic. The purpose of this document is to highlight the
specific command-line methods which sometimes can be under-documented in the
official Grid Engine manuals.
Configuration Goals & Resource Allocation Policy
(1) Create a Functional
Share policy using command-line Grid Engine
tools to enable resource allocation on a percentage basis between
(2) When the cluster is idle, anyone and any department can use cluster resources.
(3) When the cluster is busy, Departments get a percentage of available cluster
(4) When contention for resources exists on a busy cluster, running jobs
will not be killed or otherwise manipulated. The resource allocation will
be done only within the pending job list. This involves bumping up the priority
of pending jobs belonging to a departments with higher entitlement will occur.
Essentially we can't muck with running jobs because we have no clean way of
suspending, checkpointing or moving them.
(5) Users within each department should be considered equal from a resource
Desired cluster resource allocation mix:
unassigned: 18% of cluster resources
Dept_A : 18% of cluster resources
Dept_B : 18% of cluster resources
Dept_C : 11% of cluster resources
Dept_D : 35% of cluster resources
In an ideal world, share-tree is the policy that most people probably should
be using. It nicely remembers past usage and works to average out usage such
that eventually entitlements trend back over time to being in harmony with
the configured policies. Users and groups with little past usage are compensated
with higher resource allocation when they start submitting work. Heavy cluster
users will find their current entitlements dropping so the under-represented
users and groups can get up to speed more rapidly. It works, and it is fair.
Sadly though, even though users and managers understand share-tree when the
method is explained to them they tend to forget these details when they notice
their jobs pending in the wait list. Users who have been told to expect a
50% entitlement to cluster resources get frustrated when they launch their
jobs and don't get to take over half of the cluster instantly. Explaining
to them that the 50% entitlement is a goal that the scheduler is working to
meet "as averaged over time..." fall upon deaf ears. Heavy
users get upset to learn that their current entitlement is being "penalized"
because their past usage greatly exceeded their alloted share. Cluster admins
then spend far too much time attempting to "prove" to the user community
that they are not getting short changed.
For a cluster administrator, it is often less hassle to dump the share-tree
and convert to a functional policy which has no concept or memory of past
cluster usage and simply tries to meet resource allocation policies each time
a scheduling run is performed. The resource allocation is far more obvious
and users can watch the pending list to see how the scheduler bumps jobs up
in the queue according to the configured entitlements.
I've given up using share-tree at customer sites and now pretty much use
the functional policy exclusively.
Implementation Step by Step
- Functional share policy activated within SGE scheduler
- 100,000 functional share tickets added to the pool
- Algorithm adjusted to make Department membership more important
- Algorithm adjusted to make user slightly more important
- Algorithm adjusted to make project and job less important
- User objects created within grid engine matching given user list.
- Assign arbitrary but equal number of user tickets to each user so they
are each treated equally within department.
- Departments created within grid engine matching given list
- Assign tickets to departments in proportional value to the total number
of available configured tickets.
Steps 1,2: Activate functional share resource allocation
The functional share policy is activated by adding tickets to the functional
share pool. The pool is defined as weight_tickets_functional
in the Grid Engine scheduler configuration.
Run the command :
Assign 100000 to value of weight_tickets_functional
Steps 3,4,5: Adjust algorithm weights for Department
The functional share algorithm can assign relative weight or importance values
to "user", "project", "department" and "job".
In the default configuration these values are all treated equally. The sum
of these 4 weights must add up to "1". The defaults are defined
in the scheduler configuration:
We want to make "Department" more important than anything else
while also slightly raising the importance of "user" because we
are going to give out some functional share tickets to users as well (to enforce
user equality within a department).
The new values (changed via "qconf
Update: Stephan Grell pointed out a huge weakness in the
suggested configuration if one only adjusts the parameters shown above.
By ignoring the other weight_*
parameters (weight_ticket, weight_priority,
weight_urgency, etc.) we enable a scenario by which a user can use
the POSIX Priority policy to bypass the intended resource allocation mix.
We need to either disable those mechanisms entirely or make them "less
important" within the scheduler than the functional ticketing scheme.
"...In your described setting a "qsub
-p 1000" or or a "qsub
-pe make 10" will invert your fair scheduling policy. If your
scheduling should only be based on the functional tickets, you need to
If you want to support the posix priority and/or urgency, their weight
values have to be a lot smaller, than the weight_ticket. Such as:
This allows a user to set the priorities within his jobs and he will not exceed his percentage from the ticket setup. The weight parameters are difficult to handle and can completely compromise the ticket configuration."
Stephan's suggestions have been taken into consideration. Since we want users
to be able to use the Priority mechanism to prioritize their own pending jobs
we are going to make changes to the scheduler configuration that keep the
weight_urgency and weight_priority
mechanisms enabled but "less important" overall than the functional
Verified by running the command "qconf
-ssconf" to view current config:
Steps 6,7: Creating users
The command "qconf -auser"
is run for each new username. We want to create user entries within Grid Engine
where each user has been allocated 100 functional share tickets. Giving the
users an equal number of shares should ensure that users are treated equally
within Department groups when it comes to resource entitlements.
The default user values are:
They need to be changed to:
I threw together a simple perl script to automate the process of adding users
with 100 functional share tickets. The script writes a template
to a temp location and then calls "qconf
-Auser /path-to-template" - Grid Engine will read
in the file and accept the new settings.
This is the script:
use POSIX qw(tmpnam);
my $tmp = POSIX::tmpnam();
print "User=($user), Configfile=($tmp)\n";
system("qconf -Auser $tmp");
This is what the script looks like when run for several users:
root# ./create-sge-user.pl userA
Creating user:email@example.com added "userA" to user list
root# ./create-sge-user.pl userB
Creating user:firstname.lastname@example.org added "userB" to user list
root# ./create-sge-user.pl userC
Creating user:email@example.com added "userC" to user list
root# ./create-sge-user.pl userD
added "userD" to user list
Steps 8,9: Creating and defining Department lists
Within Grid Engine DEPARTMENTS are considered to be userlists similar to access
control lists. To create a new userlist of
type department one would do:
For our example department "Dept_A":
qconf -mu Dept_A
And we set the values to:
The important values are:
"type" -- need to make this a DEPT rather than
"fshare" -- 18000 is 18% of the 100,000 available
functional share tickets
"entries" -- userA is the first configured member
of the department named "Dept_A". Additional usernames are comma-seperated.
To show the contents of a departmental user list:
root# qconf -su Dept_A
To show a list of all userset objects:
cat:~ root# qconf -sul
Note that the configueration goals called for roughly 18% of cluster resources
unassigned and available for general use.
This is what the pre-existing Department object "defaultdepartment"
is for. Any user not assigned to a given Department willl be considered for
scheduling purposes to be a member of the "defaultdepartment" group.
Although this document deals with the command-line methods for manipulating
the Department based Functional Share policy a screenshot is available showing
what these settings would look like when viewed via the graphical 'qmon' program.
The screenshot is quite large (~ 324KB) and can be accessed by clicking
on this link.