Discussion:
[DRMAA-WG] DRMAA2
Nadav Brandes
2010-12-13 18:47:43 UTC
Permalink
Hello,

I have few questions about version 2 of DRMAA. My main question is- when is
this version planned to be released?
I'm not sure whether I got to the right adress, so if I'm in the wrong
place, please let me know.

Thanks in advance,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20101213/a1647c5e/attachment.html
Daniel Templeton
2010-12-13 19:19:50 UTC
Permalink
Funny you should ask. Just this weekend we made an attempt to flesh
out the IDL in the wiki so that we can turn it into a full specification
document. The goal is to do that by the end of the year, or perhaps
more realistically, by the end of January.

Daniel
Post by Nadav Brandes
Hello,
I have few questions about version 2 of DRMAA. My main question is-
when is this version planned to be released?
I'm not sure whether I got to the right adress, so if I'm in the wrong
place, please let me know.
Thanks in advance,
Nadav
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20101213/9749436e/attachment.html
Nadav Brandes
2010-12-14 18:24:28 UTC
Permalink
I'm glad to hear that, I was concerned that it might take longer.

Now, in a little delay, allow me to introduce myself.

My name is Nadav Brandes, and I work for an HPC organization. We currently
work with SGE and LSF, and soon we also plan to work with MOAB (wrapping
Torque).



Recently we have a major interest in DRMAA, which we believe can make our
life with distributed resource managers a lot easier.

A few weeks ago, I downloaded a DRMAA implementation for SGE and tested it
for a while. I must admit that I was a little disappointed with the limited
interface and possibilities given by this version of DRMAA.

Now I saw the documentation of the planned interface for DRMAA2, and I find
it to be a great improvement, and very useful for my organization. I am
truly anxious to try it, and have some more questions about its release:

- Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
- Is it still possible to suggest ideas that we have about the interface
of DRMAA2? If so, how is it done? Is it customary to share ideas in this
forum, or do you prefer it to be done through Wiki?



Thanks,

Nadav



On Mon, Dec 13, 2010 at 9:19 PM, Daniel Templeton <
Funny you should ask. Just this weekend we made an attempt to flesh out
the IDL in the wiki so that we can turn it into a full specification
document. The goal is to do that by the end of the year, or perhaps more
realistically, by the end of January.
Daniel
Hello,
I have few questions about version 2 of DRMAA. My main question is- when is
this version planned to be released?
I'm not sure whether I got to the right adress, so if I'm in the wrong
place, please let me know.
Thanks in advance,
Nadav
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20101214/089efcd2/attachment.html
Nadav Brandes
2011-01-12 16:03:12 UTC
Permalink
Hello everyone,

I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).



If it's not too late, we have few questions/suggestions:

? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?

? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.

? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.

? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.

? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added to
the 'Job' interface.

? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.



This is it for now, thank for reading it.

I would like to hear what you think.



Best Regards,

Nadav


2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I find
it to be a great improvement, and very useful for my organization. I am
- Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
- Is it still possible to suggest ideas that we have about the
interface of DRMAA2? If so, how is it done? Is it customary to share ideas
in this forum, or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110112/a5e814d9/attachment.html
Mariusz Mamoński
2011-01-12 16:28:46 UTC
Permalink
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
please us the wiki as it is the most up to date version of the DRMAA spec:
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
It is supported: The JobSession has a method:
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
????????? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
????????? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
????????? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
????????? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
????????? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). ?Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
????????? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
time of reading very old version of the specification (@Peter: maybe
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
Nadav Brandes
2011-01-13 08:23:17 UTC
Permalink
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.

I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the job-arrays feature, which is the most crucial feature for us.

When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.

All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.

Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.

We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.


2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Post by Mariusz Mamoński
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters
given
Post by Nadav Brandes
in the 'JobTemplate' interface. For example, can one get the
'remoteCommand'
Post by Nadav Brandes
of a job that was already submitted, if he only has a 'Job' object in
hand,
Post by Nadav Brandes
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a
group
Post by Nadav Brandes
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of
'runBulkJobs'
Post by Nadav Brandes
that sends a sequence of jobs altogether, but it also returns a sequence
of
Post by Nadav Brandes
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be
very
Post by Nadav Brandes
useful if one could determine a queue in 'JobTemplate', change the queue
of
Post by Nadav Brandes
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters
containing
Post by Nadav Brandes
a large number of compute nodes (which some of them might be out of
order),
Post by Nadav Brandes
jobs might fail randomly, without any justified reason. We think it could
be
Post by Nadav Brandes
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added
to
Post by Nadav Brandes
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features
of
Post by Nadav Brandes
memory management, cores management, and also general resources
management
Post by Nadav Brandes
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs
so
Post by Nadav Brandes
each running job will have all the resources it needs. If 'JobTemplate'
had
Post by Nadav Brandes
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization.
I am
Post by Nadav Brandes
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the
group
Post by Nadav Brandes
can only hope. From experience, I would expect nothing useful before
Summer
Post by Nadav Brandes
2011.
Is it still possible to suggest ideas that we have about the interface
of
Post by Nadav Brandes
DRMAA2? If so, how is it done? Is it customary to share ideas in this
forum,
Post by Nadav Brandes
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110113/4165b61b/attachment.html
Mariusz Mamoński
2011-01-13 09:40:17 UTC
Permalink
Hi,
Post by Nadav Brandes
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the?job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
This would be possible via the drmsSpecific attribute
Post by Nadav Brandes
The greatest advantage of job-arrays, is the ability of users to "remember"
In my personal opinion: DRMAA is an API - not a command line tool for
users. Some issues like "remembering" are much easier if you have
random access memory not just your synapses ;-) So you can realize
such functionality like "terminate all of the tasks" on top of DRMAA
by single loop. Maybe some other members of the group can comment on
this?
Post by Nadav Brandes
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Post by Mariusz Mamoński
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
? ? ? ? ? ? ? ?sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
????????? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
????????? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
????????? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
????????? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
????????? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). ?Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
????????? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
?should go to: ? ? ? ? ?attribute Dictionary drmsSpecific;
? // must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
Mariusz
Peter Tröger
2011-01-14 21:36:31 UTC
Permalink
Hi Nadav,

first let me thank you again for the in-depth analysis, which gives us confidence that the current spec design is the right one to go.

I personally like your job array argumentation, since the proposal extends the existing bulk job facility in a natural way. Another pro-argument is that the according bulk operations would be also implementable by the DRMAA library itself, if the DRMS does not support them. As usual, every feature extension has the danger of forgetting nasty side effects in the job control flow, as Mariusz mentioned.

I am willing to open up a discussion on that, so here is a first proposal. If we just take the current status and introduce a "set of jobs" representation, we end up with something like this:

==== snip ===

interface JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in long beginIndex, in long endIndex, in long step)
...
}

interface JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() // suspend all jobs of the array, partial failures in changing the state are ok
void resume() // resume all jobs of the array, partial failures in changing the state are ok
void hold() // put a queued bulk job on hold
void release() // release an array job on hold
void terminate() // terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession function
};

==== snip ===

Fetching status information makes only sense on job level, so the according getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of getJobs(JobInfo filter), since the filter semantics would become horrible to specify.

All functions should be implementable with the 'loop' fallback in the library, when we allow partial success in the bulk control functions.

DRMAA folks, your comments please. Is this a feasible interface for the denoted DRM systems with direct job array control support ?

Best,
Peter.
The newer API specification does look a great deal better, and obviously I came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are still relevant, but first I want to elaborate a little bit about the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed on jobs according to the current DRMAA specification, are actually performed upon tasks, which are identified by two IDs instead of one, and except of that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a matter of terminology. But the important thing about job-arrays is the ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a single command (supplying only the ID of the whole job-array, without needing to give the ID of each task, which might be very exhausting for users).
An example for a more advanced logic that one might want to perform on job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run simultaneously in a job-array (for example, submitting a job-array containing 1000 tasks, where only 10 tasks are allowed to run simultaneously at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember" many tasks with a single ID, what is impossible to do when submitting many single jobs.
Many schedulers (like LSF) support all these features, and you can see it implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more "job-arrays oriented". I truly believe that DRMAA will be better if it supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110114/5867eb48/attachment.html
Andre Merzky
2011-01-14 22:34:27 UTC
Permalink
Hi Peter,

in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).

My $0.02,

Andre.
Post by Peter Tröger
==== snip ===
interface?JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate,?in
?long?beginIndex,?in long?endIndex,?in ?long?step)
?? ? ? ? ? ? ? ?...
}
interface?JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() ? ??// suspend all jobs of the array, partial failures in
changing the state are ok
void resume() ? ? ?//?resume all jobs of the array, partial failures in
changing the state are ok
void hold() ? ? ? ?// put a queued bulk job on hold
void release() ? ??// release an array job on hold
void terminate() ??// terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the according
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible to
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow?partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the?job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Post by Mariusz Mamoński
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
? ? ? ? ? ? ? ?sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
????????? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
????????? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
????????? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
????????? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
????????? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). ?Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
????????? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
?should go to: ? ? ? ? ?attribute Dictionary drmsSpecific;
? // must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Nothing is ever easy...
Nadav Brandes
2011-01-21 10:12:40 UTC
Permalink
Thank you all for your comments.

And thank you Peter for your comment and draft, I really like it, and it
looks great.



Only few things I would change:



(1) I agree that putting an option to filter jobs with any generic
user-made filter would be pretty horrible, and it's already better just to
let him iterate over the jobs himself and filter them has he likes. But I
would put a feature that allows filtering jobs in certain status (as many
batch systems support). Something like this:



Interface JobArray {

?

Sequence<Job> getJobsOfState(in JobState state)

}





(2) Also, I think that DRMAA should allow giving job-arrays more
arguments that what regular jobs can get (in JobTemplate struct). For
example, as I mentioned before, you might want to give a 'slotsLimit'
argument to a new submitted job-array (in order to limit the number of tasks
in the job-array that may run simultaneously). Therefore, I would change the
interface to something like this:



struct JobArrayTemplate extends JobTemplate {

// Contains all the attributes that JobTemplate contains

// Also contains the following attributes:

attribute long beginIndex

attribute long endIndex

attribute long step

attribute long slotsLimit; // In order to limit the number
of tasks in the job-array that may run at any one time

// I guess that more attributes will be added here over-time

}



Interface JobSession {

?

JobArray runBulkJobs(in DRMAA::JobArrayTemplate
jobArrayTemplate)

}





By the way, you can see that all the features that I mentioned here are
supported by LSF:

http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618



Best Regards,

Nadav
Post by Andre Merzky
Hi Peter,
in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).
My $0.02,
Andre.
Post by Peter Tröger
==== snip ===
interface JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in
long beginIndex, in long endIndex, in long step)
...
}
interface JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() // suspend all jobs of the array, partial failures in
changing the state are ok
void resume() // resume all jobs of the array, partial failures in
changing the state are ok
void hold() // put a queued bulk job on hold
void release() // release an array job on hold
void terminate() // terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession
function
Post by Peter Tröger
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the
according
Post by Peter Tröger
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible
to
Post by Peter Tröger
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously
I
Post by Peter Tröger
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that
are
Post by Peter Tröger
still relevant, but first I want to elaborate a little bit about
the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of
the
Post by Peter Tröger
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are
performed
Post by Peter Tröger
on jobs according to the current DRMAA specification, are actually
performed
Post by Peter Tröger
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run
simultaneously
Post by Peter Tröger
at a given time).
The greatest advantage of job-arrays, is the ability of users to
"remember"
Post by Peter Tröger
many tasks with a single ID, what is impossible to do when submitting
many
Post by Peter Tröger
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being
more
Post by Peter Tröger
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Post by Mariusz Mamoński
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
please us the wiki as it is the most up to date version of the DRMAA
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support
this
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a
sequence
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could
be
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
very
useful if one could determine a queue in 'JobTemplate', change the
queue
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful
if
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be
added
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources
to
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
each submitted job (for example, submitting a job that requires 5
cores,
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the
jobs
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
so
each running job will have all the resources it needs. If
'JobTemplate'
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and
I
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
find it to be a great improvement, and very useful for my
organization.
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on
when
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there,
the
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the
interface
Post by Peter Tröger
Post by Mariusz Mamoński
Post by Nadav Brandes
of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Nothing is ever easy...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110121/70a43756/attachment-0001.html
Peter Tröger
2011-01-25 07:31:16 UTC
Permalink
Hi Nadav,
Post by Nadav Brandes
Interface JobArray {
?
Sequence<Job> getJobsOfState(in JobState state)
I can understand the practical usefulness, but this falls into a category of functions we typically reject. First, it has a timing issue - what happens if an already identified job changes its state before the collected result is returned ? Second, it has not automatically a consistent mapping to DRM-specific interfaces, since the job model is DRMAA-specific. There might be DRMAA-states that do not exist in the particular system. There might be also DRMAA states not specific enough, the ones that only map to some DRMS state together with a sub state.

Since this function is easily implementable with both manual looping or the new event callback features, I would prefer to reject it - as long as we don't have more enthusiastic feedback for other DRM systems. DRMAA tries to stay minimalistic, and this one gives you only a small performance advantage, but no real functional advantage.
Post by Nadav Brandes
struct JobArrayTemplate extends JobTemplate {
// Contains all the attributes that JobTemplate contains
attribute long beginIndex
attribute long endIndex
attribute long step
attribute long slotsLimit; // In order to limit the number of tasks in the job-array that may run at any one time
// I guess that more attributes will be added here over-time
}
Job templates are intended for re-usage, so I don't know if it makes sense to have to index attributes in the template.

For slots in general, we had long and painful discussions. We were able to agree that slots are an opaque concept to DRMAA, since there is absolutely no common mapping in the different DRM systems. Talking about a slots limit for jobs gives them semantics, which we cannot do. Sorry, but this on is a candidate for the still-existing native options.
Post by Nadav Brandes
http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618
Great, it is very good to have an LSF guy in the group. I would be happy if you could also do some sanity check for the existing concepts with respect to LSF.

Since we good no heavy objection against my JobArray proposal, I will add it to the spec draft.
Post by Nadav Brandes
Hi Peter,
in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).
Thanks Andre, but WaitAll* functions are not part of the spec at all, so they will also not be part of JobArray. Same argumentation as above. Since waitAll() is so easy to implement, there is no reason to clutter up the DRMAA interface with it.

Best,
Peter.
Post by Nadav Brandes
My $0.02,
Andre.
Post by Peter Tröger
==== snip ===
interface JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in
long beginIndex, in long endIndex, in long step)
...
}
interface JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() // suspend all jobs of the array, partial failures in
changing the state are ok
void resume() // resume all jobs of the array, partial failures in
changing the state are ok
void hold() // put a queued bulk job on hold
void release() // release an array job on hold
void terminate() // terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the according
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible to
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Post by Mariusz Mamoński
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Nothing is ever easy...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110125/e002dec2/attachment-0001.html
Peter Tröger
2011-01-27 10:47:34 UTC
Permalink
Hi,

sitting here with Thijs from Platform, we figured out that LSF does not have to large JobArray operations support as suggested below. In fact, you can only kill submitted job arrays as a whole.

I would add an accordingly restricted version of the interface to the spec.

Best,
Peter.
Post by Mariusz Mamoński
Hi Nadav,
first let me thank you again for the in-depth analysis, which gives us confidence that the current spec design is the right one to go.
I personally like your job array argumentation, since the proposal extends the existing bulk job facility in a natural way. Another pro-argument is that the according bulk operations would be also implementable by the DRMAA library itself, if the DRMS does not support them. As usual, every feature extension has the danger of forgetting nasty side effects in the job control flow, as Mariusz mentioned.
==== snip ===
interface JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in long beginIndex, in long endIndex, in long step)
...
}
interface JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() // suspend all jobs of the array, partial failures in changing the state are ok
void resume() // resume all jobs of the array, partial failures in changing the state are ok
void hold() // put a queued bulk job on hold
void release() // release an array job on hold
void terminate() // terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession function
};
==== snip ===
Fetching status information makes only sense on job level, so the according getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of getJobs(JobInfo filter), since the filter semantics would become horrible to specify.
All functions should be implementable with the 'loop' fallback in the library, when we allow partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously I came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are still relevant, but first I want to elaborate a little bit about the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed on jobs according to the current DRMAA specification, are actually performed upon tasks, which are identified by two IDs instead of one, and except of that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a matter of terminology. But the important thing about job-arrays is the ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a single command (supplying only the ID of the whole job-array, without needing to give the ID of each task, which might be very exhausting for users).
An example for a more advanced logic that one might want to perform on job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run simultaneously in a job-array (for example, submitting a job-array containing 1000 tasks, where only 10 tasks are allowed to run simultaneously at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember" many tasks with a single ID, what is impossible to do when submitting many single jobs.
Many schedulers (like LSF) support all these features, and you can see it implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more "job-arrays oriented". I truly believe that DRMAA will be better if it supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Hi Nadav,
Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported
Post by Nadav Brandes
This is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110127/77bf2cd1/attachment-0001.html
Loading...