Discussion:
[DRMAA-WG] DRMAA2 comments
Nadav Brandes
2011-04-29 11:10:39 UTC
Permalink
Hi guys,

My team and I have finished going over the latest draft of DRMAA2, and we
have some comments, suggestions and questions about it.
We want to hear your opinion about these issues.


1. Given a *jobId*, you can easily get its *Job* object using the method
*JobSession::getJobs(in JobInfo filter)*, if you give has as a filter a *
JobInfo* with the wanted *jobId* (maybe it would be an easier shorthand
if DRMAA had a method *JobSession::getJob(string jobId)*, but this is a
different issue). *But*, given a *jobArrayId*, there is no way to get its
*JobArray* object, which is a great limit of DRMAA that doesn't really
let users to use the *JobArray* feature in DRMAA as it is used in most
batch systems. I think that there should be added a similar method
*JobSession::getJobArrays(in
JobArrayInfo filter)*, or at least a method *JobSession::getJobArray(string
jobArrayId)*.
2. A very important feature that many batch systems support is the
ability to limit the number of jobs in a job array that may run
simultaneously (in LSF it's called "Slot Limit" and you can read about it at
http://www-cecpv.u-strasbg.fr/Documentations/lsf
/html/lsf6.1_admin/G_jobarrays.html#26618). I think that DRMAA can also
support this feature by:
1. Change the method *JobSession::runBulkJobs* so it will also accept
an optional argument *in long slotLimit* (if it's *UNSET* then no slot
limit will be assigned to the new job array).
2. Add a new method *JobArray::changeSlotLimit(in long slotLimit)*
3. There are some parameters that most batch systems allow changing for
already submitted jobs, but DRMAA doesn't support changing them. For
example, DRMAA doesn't let you change the priority or queue of an already
submitted jobs. I think that methods *Job::changePriority(in long
priority) *and *Job::changeQueue(in string queueName)* should be added.
4. Many batch systems allow rerunning existing jobs. Although DRMAA has a
field called *rerunnable* in the *JobTemplate* struct, it doesn't allow
users to actually rerun jobs. Maybe a method *Job::rerun()* could be
added to DRMAA.
5. I have a question. Does DRMAA support Generic Resources? (for example,
if I have a cluster where some of its nodes have GPU cards, and I want to
submit jobs that require a certain amount of GPUs, so I would like the batch
system to manage it for me, as many batch systems know how to manage).


Thank you for reading all of this. I would very like to hear what you think
about each of the bullets above.

Regards,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110429/a8c80193/attachment.html
Peter Tröger
2011-04-29 12:53:04 UTC
Permalink
Hi Nadav,

thanks (again) for your in-depth analysis. Here are my comments.
Given a jobId, you can easily get its Job object using the method JobSession::getJobs(in JobInfo filter), if you give has as a filter a JobInfo with the wanted jobId (maybe it would be an easier shorthand if DRMAA had a method JobSession::getJob(string jobId), but this is a different issue). But, given a jobArrayId, there is no way to get its JobArray object, which is a great limit of DRMAA that doesn't really let users to use the JobArray feature in DRMAA as it is used in most batch systems. I think that there should be added a similar method JobSession::getJobArrays(in JobArrayInfo filter), or at least a method JobSession::getJobArray(string jobArrayId).
Symmetry is always good, I see no problem with adding "JobSession::getJobArrays(in JobArrayInfo filter)".
Change the method JobSession::runBulkJobs so it will also accept an optional argument in long slotLimit (if it's UNSET then no slot limit will be assigned to the new job array).
Add a new method JobArray::changeSlotLimit(in long slotLimit)
This is what JobTemplate::maxSlots is expected to provide.
There are some parameters that most batch systems allow changing for already submitted jobs, but DRMAA doesn't support changing them. For example, DRMAA doesn't let you change the priority or queue of an already submitted jobs. I think that methods Job::changePriority(in long priority) and Job::changeQueue(in string queueName) should be added.
We discussed the general possibility of changing the attributes of running jobs. There are tons of issues with making such a concept available in a generalized API. One reason are hidden changes of attributes by the DRM system on queuing time - Grid Engine is one example. In such a case, you cannot know what kind of job attribute state your are actually changing. So you need better monitoring. And so on ... The possibilities and supported attributes for online changes also vary widely in the different systems.
For this reason, DRMAA intentionally leaves out the complete idea - at least until enough people complain ;-)
Many batch systems allow rerunning existing jobs. Although DRMAA has a field called rerunnable in the JobTemplate struct, it doesn't allow users to actually rerun jobs. Maybe a method Job::rerun() could be added to DRMAA.
The rerunnable flag is intended to allow the DRM system itself re-running a job. We never had a proposal for such a functionality from user perspective. What would be the expected job state flow in this case ? And what is the use case of having such functionality, if you don't have interactive job support ?
I have a question. Does DRMAA support Generic Resources? (for example, if I have a cluster where some of its nodes have GPU cards, and I want to submit jobs that require a certain amount of GPUs, so I would like the batch system to manage it for me, as many batch systems know how to manage).
Requesting non-standardized resource types and configurations is expected to be covered by the "jobCategory" concept. Examples for job categories are different MPI libraries, OpenMP environments, Java environments, or GPU environments. We hope to organize a community-based list of recommended job category names, which would raise the chances for portability with such job submission applications. Later DRMAA2 version then could integrate these names as official part of the spec.

Best regards,
Peter.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110429/298c83ab/attachment-0001.html
Nadav Brandes
2011-05-02 17:53:56 UTC
Permalink
Thanks for your quick response. Here is my response:

1. Awesome :)
2. As I understood from the spec, *JobTemplate::maxSlots* relates to the
number of cores requested on one machine for a single job, and not to the
number of jobs that may run at the same time. Am I wrong?
3. Fair enough. Let's hope that more people will complain about it ;-)
4. When working with big clusters and distribution of complicated jobs,
there are often cases when jobs might arbitrarily fail, for any temporary
reason such as network problems. For example, if one submits 1,000 jobs,
then 10 of them might just randomly fail, and have to be rerun in order to
finish the whole job-array running successfully. If DRMAA had a rerun
functionality, then the user could do something like this: (The example is
in Java)
1. *for (Job job : myJobArray.jobs) {
*
2. * if (job.getState() == JobState.FAILED)
*
3. * job.rerun(); // Will change the job state back to
QUEUED, and later on to RUNNING (the job will run again from the
beginning)
*
4. * }
*
5. *}*
5. *Sounds great.*


Regards,
Nadav

2011/4/29 Peter Tr?ger <peter at troeger.eu>
Post by Peter Tröger
Hi Nadav,
thanks (again) for your in-depth analysis. Here are my comments.
1. Given a *jobId*, you can easily get its *Job* object using the
method *JobSession::getJobs(in JobInfo filter)*, if you give has as a
filter a *JobInfo* with the wanted *jobId* (maybe it would be an easier
shorthand if DRMAA had a method *JobSession::getJob(string jobId)*, but
this is a different issue). *But*, given a *jobArrayId*, there is no
way to get its *JobArray* object, which is a great limit of DRMAA that
doesn't really let users to use the *JobArray* feature in DRMAA as it
is used in most batch systems. I think that there should be added a similar
method *JobSession::getJobArrays(in JobArrayInfo filter)*, or at least
a method *JobSession::getJobArray(string jobArrayId)*.
Symmetry is always good, I see no problem with adding
"JobSession::getJobArrays(in JobArrayInfo filter)".
1. A very important feature that many batch systems support is the
ability to limit the number of jobs in a job array that may run
simultaneously (in LSF it's called "Slot Limit" and you can read about it at
http://www-cecpv.u-strasbg.fr/Documentations/lsf
/html/lsf6.1_admin/G_jobarrays.html#26618). I think that DRMAA can also
1. Change the method *JobSession::runBulkJobs* so it will also
accept an optional argument *in long slotLimit* (if it's *UNSET*then no slot limit will be assigned to the new job array).
2. Add a new method *JobArray::changeSlotLimit(in long slotLimit)*
This is what JobTemplate::maxSlots is expected to provide.
1. There are some parameters that most batch systems allow changing for
already submitted jobs, but DRMAA doesn't support changing them. For
example, DRMAA doesn't let you change the priority or queue of an already
submitted jobs. I think that methods *Job::changePriority(in long
priority) *and *Job::changeQueue(in string queueName)* should be added.
We discussed the general possibility of changing the attributes of running
jobs. There are tons of issues with making such a concept available in a
generalized API. One reason are hidden changes of attributes by the DRM
system on queuing time - Grid Engine is one example. In such a case, you
cannot know what kind of job attribute state your are actually changing. So
you need better monitoring. And so on ... The possibilities and supported
attributes for online changes also vary widely in the different systems.
For this reason, DRMAA intentionally leaves out the complete idea - at
least until enough people complain ;-)
1. Many batch systems allow rerunning existing jobs. Although DRMAA has
a field called *rerunnable* in the *JobTemplate* struct, it doesn't
allow users to actually rerun jobs. Maybe a method *Job::rerun()* could
be added to DRMAA.
The rerunnable flag is intended to allow the DRM system itself re-running a
job. We never had a proposal for such a functionality from user perspective.
What would be the expected job state flow in this case ? And what is the use
case of having such functionality, if you don't have interactive job support
?
1. I have a question. Does DRMAA support Generic Resources? (for
example, if I have a cluster where some of its nodes have GPU cards, and I
want to submit jobs that require a certain amount of GPUs, so I would like
the batch system to manage it for me, as many batch systems know how to
manage).
Requesting non-standardized resource types and configurations is expected
to be covered by the "jobCategory" concept. Examples for job categories are
different MPI libraries, OpenMP environments, Java environments, or GPU
environments. We hope to organize a community-based list of recommended job
category names, which would raise the chances for portability with such job
submission applications. Later DRMAA2 version then could integrate these
names as official part of the spec.
Best regards,
Peter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110502/e1dbcaf4/attachment.html
Mariusz Mamoński
2011-05-04 19:07:01 UTC
Permalink
Post by Nadav Brandes
Awesome :)
As I understood from the spec, JobTemplate::maxSlots relates to the number
of cores requested on one machine for a single job, and not to the number of
jobs that may run at the same time. Am I wrong?
No, You are right (except the "one machine" and that slots may not
always mean cores)
Post by Nadav Brandes
Fair enough. Let's hope that more people will complain about it ;-)
When working with big clusters and distribution of complicated jobs, there
are often cases when jobs might arbitrarily fail, for any temporary reason
such as network problems. For example, if one submits 1,000 jobs, then 10 of
them might just randomly fail, and have to be rerun in order to finish the
whole job-array running successfully. If DRMAA had a rerun functionality,
then the user could do something like this: (The example is in Java)
for (Job job : myJobArray.jobs) {
?? if (job.getState() == JobState.FAILED)
????????? job.rerun(); // Will change the job state back to QUEUED, and
later on to RUNNING (the job will run again from the beginning)
???? }
}
this would require transitions from FAILED to other states -> FAILED
is not terminal -> avalanche... ;-)

but DRMS can be configured to do that on behalf of the user automagically
Post by Nadav Brandes
Sounds great.
Regards,
Nadav
2011/4/29 Peter Tr?ger <peter at troeger.eu>
Post by Peter Tröger
Hi Nadav,
thanks (again) for your in-depth analysis. Here are my comments.
Given a jobId, you can easily get its Job object using the method
JobSession::getJobs(in JobInfo filter), if you give has as a filter a
JobInfo with the wanted jobId (maybe it would be an easier shorthand if
DRMAA had a method JobSession::getJob(string jobId), but this is a different
issue). But, given a jobArrayId, there is no way to get its JobArray object,
which is a great limit of DRMAA that doesn't really let users to use the
JobArray feature in DRMAA as it is used in most batch systems. I think that
there should be added a similar method JobSession::getJobArrays(in
JobArrayInfo filter), or at least a method JobSession::getJobArray(string
jobArrayId).
Symmetry is always good, I see no problem with adding
"JobSession::getJobArrays(in JobArrayInfo filter)".
A very important feature that many batch systems support is the ability to
limit the number of jobs in a job array that may run simultaneously (in LSF
it's called "Slot Limit" and you can read about it at
http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618).
Change the method JobSession::runBulkJobs so it will also accept an
optional argument in long slotLimit (if it's UNSET then no slot limit will
be assigned to the new job array).
Add a new method JobArray::changeSlotLimit(in long slotLimit)
This is what JobTemplate::maxSlots is expected to provide.
There are some parameters that most batch systems allow changing for
already submitted jobs, but DRMAA doesn't support changing them. For
example, DRMAA doesn't let you change the priority or queue of an already
submitted jobs. I think that methods Job::changePriority(in long priority)
and Job::changeQueue(in string queueName) should be added.
We discussed the general possibility of changing the attributes of running
jobs. There are tons of issues with making such a concept available in a
generalized API. One reason are hidden changes of attributes by the DRM
system on queuing time - Grid Engine is one example. In such a case, you
cannot know what kind of job attribute state your are actually changing. So
you need better monitoring. And so on ... The possibilities and supported
attributes for online changes also vary widely in the different systems.
For this reason, DRMAA intentionally leaves out the complete idea - at
least until enough people complain ;-)
Many batch systems allow rerunning existing jobs. Although DRMAA has a
field called rerunnable in the JobTemplate struct, it doesn't allow users to
actually rerun jobs. Maybe a method Job::rerun() could be added to DRMAA.
The rerunnable flag is intended to allow the DRM system itself re-running
a job. We never had a proposal for such a functionality from user
perspective. What would be the expected job state flow in this case ? And
what is the use case of having such functionality, if you don't have
interactive job support ?
I have a question. Does DRMAA support Generic Resources? (for example, if
I have a cluster where some of its nodes have GPU cards, and I want to
submit jobs that require a certain amount of GPUs, so I would like the batch
system to manage it for me, as many batch systems know how to manage).
Requesting non-standardized resource types and configurations is expected
to be covered by the "jobCategory" concept. Examples for job categories are
different MPI libraries, OpenMP environments, Java environments, or GPU
environments. We hope to organize a community-based list of recommended job
category names, which would raise the chances for portability with such job
submission applications. Later DRMAA2 version then could integrate these
names as official part of the spec.
Best regards,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Mariusz
Mariusz Mamoński
2011-05-04 18:58:34 UTC
Permalink
Post by Nadav Brandes
Hi guys,
My team and I have finished going over the latest draft of DRMAA2, and we
have some comments, suggestions and questions about it.
We want to hear your opinion about these issues.
Given a jobId, you can easily get its Job object using the method
JobSession::getJobs(in JobInfo filter), if you give has as a filter a
JobInfo with the wanted jobId (maybe it would be an easier shorthand if
DRMAA had a method JobSession::getJob(string jobId), but this is a different
issue). But, given a jobArrayId, there is no way to get its JobArray object,
which is a great limit of DRMAA that doesn't really let users to use the
JobArray feature in DRMAA as it is used in most batch systems. I think that
there should be added a similar method JobSession::getJobArrays(in
JobArrayInfo filter), or at least a method JobSession::getJobArray(string
jobArrayId).
A very important feature that many batch systems support is the ability to
limit the number of jobs in a job array that may run simultaneously (in LSF
it's called "Slot Limit" and you can read about it at
http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618).
Change the method JobSession::runBulkJobs so it will also accept an optional
argument in long slotLimit (if it's UNSET then no slot limit will be
assigned to the new job array).
Torque also supports this feature. What about Grid Engine?
Post by Nadav Brandes
Add a new method JobArray::changeSlotLimit(in long slotLimit)
There are some parameters that most batch systems allow changing for already
submitted jobs, but DRMAA doesn't support changing them. For example, DRMAA
doesn't let you change the priority or queue of an already submitted jobs. I
think that methods Job::changePriority(in long priority) and
Job::changeQueue(in string queueName) should be added.
Many batch systems allow rerunning existing jobs. Although DRMAA has a field
called rerunnable in the JobTemplate struct, it doesn't allow users to
actually rerun jobs. Maybe a method Job::rerun() could be added to DRMAA.
I have a question. Does DRMAA support Generic Resources? (for example, if I
have a cluster where some of its nodes have GPU cards, and I want to submit
jobs that require a certain amount of GPUs, so I would like the batch system
to manage it for me, as many batch systems know how to manage).
Thank you for reading all of this. I would very like to hear what you think
about each of the bullets above.
Regards,
Nadav
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Mariusz
Daniel Gruber
2011-05-04 19:03:51 UTC
Permalink
Post by Mariusz Mamoński
Post by Nadav Brandes
Hi guys,
My team and I have finished going over the latest draft of DRMAA2, and we
have some comments, suggestions and questions about it.
We want to hear your opinion about these issues.
Given a jobId, you can easily get its Job object using the method
JobSession::getJobs(in JobInfo filter), if you give has as a filter a
JobInfo with the wanted jobId (maybe it would be an easier shorthand if
DRMAA had a method JobSession::getJob(string jobId), but this is a different
issue). But, given a jobArrayId, there is no way to get its JobArray object,
which is a great limit of DRMAA that doesn't really let users to use the
JobArray feature in DRMAA as it is used in most batch systems. I think that
there should be added a similar method JobSession::getJobArrays(in
JobArrayInfo filter), or at least a method JobSession::getJobArray(string
jobArrayId).
A very important feature that many batch systems support is the ability to
limit the number of jobs in a job array that may run simultaneously (in LSF
it's called "Slot Limit" and you can read about it at
http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618).
Change the method JobSession::runBulkJobs so it will also accept an optional
argument in long slotLimit (if it's UNSET then no slot limit will be
assigned to the new job array).
Torque also supports this feature. What about Grid Engine?
Grid engine have support for limiting the max. amount of *tasks* running
at the same time. Thats somewhat different.

Daniel
Post by Mariusz Mamoński
Post by Nadav Brandes
Add a new method JobArray::changeSlotLimit(in long slotLimit)
There are some parameters that most batch systems allow changing for already
submitted jobs, but DRMAA doesn't support changing them. For example, DRMAA
doesn't let you change the priority or queue of an already submitted jobs. I
think that methods Job::changePriority(in long priority) and
Job::changeQueue(in string queueName) should be added.
Many batch systems allow rerunning existing jobs. Although DRMAA has a field
called rerunnable in the JobTemplate struct, it doesn't allow users to
actually rerun jobs. Maybe a method Job::rerun() could be added to DRMAA.
I have a question. Does DRMAA support Generic Resources? (for example, if I
have a cluster where some of its nodes have GPU cards, and I want to submit
jobs that require a certain amount of GPUs, so I would like the batch system
to manage it for me, as many batch systems know how to manage).
Thank you for reading all of this. I would very like to hear what you think
about each of the bullets above.
Regards,
Nadav
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Mariusz
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Loading...