Skip to end of metadata
Go to start of metadata


SLURM computes the overall priority of each job based on six factors: job age,user farishare,job size, partition,QOS,TRES. The six factors can have values between 0 and 1 and are calculated as described below:

Age= The Value is based on the job pending time (since eligible) normalized against the PriorityMaxAge parameter, currently PriorityMaxAge is set to 7-00:00:00. 

JobSize = The job size factor correlates to the number of nodes or CPUs the job has requested, the larger the job, the closer to 1 is the jobsize factor. Currently the contribution from this factor is negligible.

Partition = The value is calculated as the ratio between the priority of the partition requested by the job against the maximum partition priority. Currently the max partition priority is set to 14 (for partition interactive) 

QOS = The Quality of Services factor is calculated as the ratio between the job's qos priority and the maximum qos priority. By default each job is submitted with the qos "normal" which has a zero priority value  

TRES = not currently active, should always be zero

FairShare =This value is proportional to the ratio of resources available to each users and the amount of resources that has been consumed by the user submitting the job, see below for details.


Each of these factors is then augmented by a custom multiplier in order to obtain the overall JobPriority value accordingly with the formula:

JobPriority=Age*PriorityWeightAge+

                     Fairshare*PriorityWeightFairShare+

                    JobSize*PriorityWeightJobSize+

                    Partition*PriorityWeightPartition+

                    QOS*PriorityWeightQOS+

                    TRES*PriorityWeightTRES


where the multipliers are currently set to the values:

PriorityWeightAge = 5000
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 100
PriorityWeightPartition = 4000
PriorityWeightQOS = 20000
PriorityWeightTRES = (null)

FairShare Calculation

Each user fairshare is currently calculated as 

F = 2**(-U/(S*d))

where: 

S is the normalized number of shares made available for each users. In our current setup all users get the same number of raw share

U is the normalized usage.  This is calculated as   U= Uh / Rh  where Uh is the user historical usage subject to the half-life decay  and Rh is the total historical usage across the cluster also subject to the half-life decay 

Uh and Rh are calculated as

Uh = Ucurrent_period + (0.5* Ulast_period)+((0.5**2)*Uperiod-2)+... 

Rh = Rcurrent_period + (0.5* Rlast_period)+((0.5**2)*Rperiod-2)+...

and the periods are based on the PriorityDecayHalfLife time interval, currently set to 6:00:00 (6 hours).  

Currently Usage is calculated as: Allocated_Ncpus*elapsed_seconds+Allocated_Mem_GB*0.0625*elapsed_seconds

d is the FairShareDampeningFactor. This is used to reduce the impact of resource consumption on the fairshare value and to account for the ratio of active users against total users. The value is currently set to 10 and it is dynamically changed as needed. 

The initial fairshare value (with zero normalized usage) for each user is equal to 1; if a user is consuming exactly his/her share amount of available resources then his/her fairshare value will be 0.5.

It takes approximately 48 hours for a fully depleted fairshare to return from 0 to 1, assuming no additional usage is being accumulated by the user during those ~48 hours.


Two useful commands to see the priority of pending jobs and fairshare are sprio and sshare




login02:~ sprio -l
         JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS        NICE                 TRES
        6444966    uid13      12489       5000       4061          0       3429          0           0
        6445056    uid13      13061       5000       4061          0       4000          0           0
        6445068    uid13      10775       5000       4061          0       1714          0           0
        6445078    uid13      10204       5000       4061          0       1143          0           0
        6445083    uid13      10204       5000       4061          0       1143          0           0
        6586939    uid45       6583       4812         57          0       1714          0           0
        6586940    uid45       6583       4812         57          0       1714          0           0
        6586941    uid45       6583       4812         57          0       1714          0           0
        6586942    uid45       6583       4812         57          0       1714          0           0
        6586943    uid45       6583       4812         57          0       1714          0           0
        6586944    uid45       6583       4812         57          0       1714          0           0
        6586945    uid32       6583       4812         57          0       1714          0           0
        6586946    uid32       6583       4812         57          0       1714          0           0
        6586947    uid32       6583       4812         57          0       1714          0           0
        6586948    uid32       6583       4812         57          0       1714          0           0




login02:~ sshare -u $USER -U
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
rccg                      rp189          1    0.000787         320      0.000002   0.999832


Partition Priority Tiers

The scheduler tries first to dispatch jobs in the partition interactive, then jobs in the partition priority and finally jobs submitted to all remaining partitions. As a consequence interactive and priority jobs will most likely be dispatched first, even if they have a lower overall priority than jobs pending on other partitions (short,medium,long,mpi,etc.).  



Backfill scheduling

Low priority jobs might be dispatched before high priority jobs only if doing so does not impact the expected start time of the high priority jobs and if the required resources by the low priority jobs are free and idle. 







  • No labels