[DRMAA-WG] DRMAA2

Post by Nadav Brandes
Hello,
I have few questions about version 2 of DRMAA. My main question is-
when is this version planned to be released?
I'm not sure whether I got to the right adress, so if I'm in the wrong
place, please let me know.
Thanks in advance,
Nadav
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20101213/9749436e/attachment.html

Nadav Brandes

2010-12-14 18:24:28 UTC

I'm glad to hear that, I was concerned that it might take longer.

Now, in a little delay, allow me to introduce myself.

My name is Nadav Brandes, and I work for an HPC organization. We currently
work with SGE and LSF, and soon we also plan to work with MOAB (wrapping
Torque).

Recently we have a major interest in DRMAA, which we believe can make our
life with distributed resource managers a lot easier.

A few weeks ago, I downloaded a DRMAA implementation for SGE and tested it
for a while. I must admit that I was a little disappointed with the limited
interface and possibilities given by this version of DRMAA.

Now I saw the documentation of the planned interface for DRMAA2, and I find
it to be a great improvement, and very useful for my organization. I am
truly anxious to try it, and have some more questions about its release:

- Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
- Is it still possible to suggest ideas that we have about the interface
of DRMAA2? If so, how is it done? Is it customary to share ideas in this
forum, or do you prefer it to be done through Wiki?

Thanks,

Nadav

On Mon, Dec 13, 2010 at 9:19 PM, Daniel Templeton <

Funny you should ask. Just this weekend we made an attempt to flesh out
the IDL in the wiki so that we can turn it into a full specification
document. The goal is to do that by the end of the year, or perhaps more
realistically, by the end of January.
Daniel
Hello,
I have few questions about version 2 of DRMAA. My main question is- when is
this version planned to be released?
I'm not sure whether I got to the right adress, so if I'm in the wrong
place, please let me know.
Thanks in advance,
Nadav
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20101214/089efcd2/attachment.html

Nadav Brandes

2011-01-12 16:03:12 UTC

Hello everyone,

I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

If it's not too late, we have few questions/suggestions:

? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?

? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.

? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.

? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.

? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added to
the 'Job' interface.

? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.

This is it for now, thank for reading it.

I would like to hear what you think.

Best Regards,

Nadav

2010/12/21 Peter Tr?ger <peter at troeger.eu>

Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I find
it to be a great improvement, and very useful for my organization. I am
- Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
- Is it still possible to suggest ideas that we have about the
interface of DRMAA2? If so, how is it done? Is it customary to share ideas
in this forum, or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110112/a5e814d9/attachment.html

Mariusz Mamoński

2011-01-12 16:28:46 UTC

Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

please us the wiki as it is the most up to date version of the DRMAA spec:
http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

It is supported: The JobSession has a method:
sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

Post by Nadav Brandes
????????? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
????????? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.

IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)

Post by Nadav Brandes
????????? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

Post by Nadav Brandes
????????? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.

has: it is called accountingId

Post by Nadav Brandes
????????? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). ?Such 'rerun()' method could be added to
the 'Job' interface.

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

Post by Nadav Brandes
????????? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.

resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
should go to: attribute Dictionary drmsSpecific;
// must be supported

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
time of reading very old version of the specification (@Peter: maybe
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.

--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,

--
Mariusz

Nadav Brandes

2011-01-13 08:23:17 UTC

The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.

I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the job-arrays feature, which is the most crucial feature for us.

When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.

All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.

Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.

We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.

2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>

Post by Mariusz MamoÅski
Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters

given

Post by Nadav Brandes
in the 'JobTemplate' interface. For example, can one get the

'remoteCommand'

Post by Nadav Brandes
of a job that was already submitted, if he only has a 'Job' object in

hand,

Post by Nadav Brandes
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a

group

Post by Nadav Brandes
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of

'runBulkJobs'

Post by Nadav Brandes
that sends a sequence of jobs altogether, but it also returns a sequence

Post by Nadav Brandes
'Job' objects, and not a single job with a single ID.

Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be

very

Post by Nadav Brandes
useful if one could determine a queue in 'JobTemplate', change the queue

Post by Nadav Brandes
an existing job, and also get a list of all the queues in the cluster.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.

has: it is called accountingId

Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters

containing

Post by Nadav Brandes
a large number of compute nodes (which some of them might be out of

order),

Post by Nadav Brandes
jobs might fail randomly, without any justified reason. We think it could

Post by Nadav Brandes
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added

Post by Nadav Brandes
the 'Job' interface.

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features

Post by Nadav Brandes
memory management, cores management, and also general resources

management

Post by Nadav Brandes
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs

Post by Nadav Brandes
each running job will have all the resources it needs. If 'JobTemplate'

had

Post by Nadav Brandes
a resources dictionary field, it could also be very useful.

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization.

I am

Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the

group

can only hope. From experience, I would expect nothing useful before

Summer

2011.
Is it still possible to suggest ideas that we have about the interface

DRMAA2? If so, how is it done? Is it customary to share ideas in this

forum,

or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.

--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,
--
Mariusz

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110113/4165b61b/attachment.html

Mariusz Mamoński

2011-01-13 09:40:17 UTC

Hi,

Post by Nadav Brandes
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the?job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).

This would be possible via the drmsSpecific attribute

Post by Nadav Brandes
The greatest advantage of job-arrays, is the ability of users to "remember"

In my personal opinion: DRMAA is an API - not a command line tool for
users. Some issues like "remembering" are much easier if you have
random access memory not just your synapses ;-) So you can realize
such functionality like "terminate all of the tasks" on top of DRMAA
by single loop. Maybe some other members of the group can comment on
this?

Post by Nadav Brandes
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>

Post by Mariusz MamoÅski
Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

? ? ? ? ? ? ? ?sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

has: it is called accountingId

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
?should go to: ? ? ? ? ?attribute Dictionary drmsSpecific;
? // must be supported

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,
--
Mariusz

--
Mariusz

Peter Tröger

2011-01-14 21:36:31 UTC

Hi Nadav,

first let me thank you again for the in-depth analysis, which gives us confidence that the current spec design is the right one to go.

I personally like your job array argumentation, since the proposal extends the existing bulk job facility in a natural way. Another pro-argument is that the according bulk operations would be also implementable by the DRMAA library itself, if the DRMS does not support them. As usual, every feature extension has the danger of forgetting nasty side effects in the job control flow, as Mariusz mentioned.

I am willing to open up a discussion on that, so here is a first proposal. If we just take the current status and introduce a "set of jobs" representation, we end up with something like this:

==== snip ===

interface JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in long beginIndex, in long endIndex, in long step)
...
}

interface JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() // suspend all jobs of the array, partial failures in changing the state are ok
void resume() // resume all jobs of the array, partial failures in changing the state are ok
void hold() // put a queued bulk job on hold
void release() // release an array job on hold
void terminate() // terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession function
};

==== snip ===

Fetching status information makes only sense on job level, so the according getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of getJobs(JobInfo filter), since the filter semantics would become horrible to specify.

All functions should be implementable with the 'loop' fallback in the library, when we allow partial success in the bulk control functions.

DRMAA folks, your comments please. Is this a feasible interface for the denoted DRM systems with direct job array control support ?

Best,
Peter.

The newer API specification does look a great deal better, and obviously I came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are still relevant, but first I want to elaborate a little bit about the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed on jobs according to the current DRMAA specification, are actually performed upon tasks, which are identified by two IDs instead of one, and except of that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a matter of terminology. But the important thing about job-arrays is the ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a single command (supplying only the ID of the whole job-array, without needing to give the ID of each task, which might be very exhausting for users).
An example for a more advanced logic that one might want to perform on job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run simultaneously in a job-array (for example, submitting a job-array containing 1000 tasks, where only 10 tasks are allowed to run simultaneously at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember" many tasks with a single ID, what is impossible to do when submitting many single jobs.
Many schedulers (like LSF) support all these features, and you can see it implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more "job-arrays oriented". I truly believe that DRMAA will be better if it supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

Post by Nadav Brandes
? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.

Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.

has: it is called accountingId

Post by Nadav Brandes
? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). Such 'rerun()' method could be added to
the 'Job' interface.

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

Post by Nadav Brandes
? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,
--
Mariusz
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110114/5867eb48/attachment.html

Andre Merzky

2011-01-14 22:34:27 UTC

Hi Peter,

in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).

My $0.02,

Andre.

Post by Peter TrÃ¶ger
==== snip ===
interface?JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate,?in
?long?beginIndex,?in long?endIndex,?in ?long?step)
?? ? ? ? ? ? ? ?...
}
interface?JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() ? ??// suspend all jobs of the array, partial failures in
changing the state are ok
void resume() ? ? ?//?resume all jobs of the array, partial failures in
changing the state are ok
void hold() ? ? ? ?// put a queued bulk job on hold
void release() ? ??// release an array job on hold
void terminate() ??// terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the according
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible to
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow?partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the?job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>

Post by Mariusz MamoÅski
Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

? ? ? ? ? ? ? ?sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

has: it is called accountingId

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,
--
Mariusz

--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg

--
Nothing is ever easy...

Nadav Brandes

2011-01-21 10:12:40 UTC

Thank you all for your comments.

And thank you Peter for your comment and draft, I really like it, and it
looks great.

Only few things I would change:

(1) I agree that putting an option to filter jobs with any generic
user-made filter would be pretty horrible, and it's already better just to
let him iterate over the jobs himself and filter them has he likes. But I
would put a feature that allows filtering jobs in certain status (as many
batch systems support). Something like this:

Interface JobArray {

?

Sequence<Job> getJobsOfState(in JobState state)

}

(2) Also, I think that DRMAA should allow giving job-arrays more
arguments that what regular jobs can get (in JobTemplate struct). For
example, as I mentioned before, you might want to give a 'slotsLimit'
argument to a new submitted job-array (in order to limit the number of tasks
in the job-array that may run simultaneously). Therefore, I would change the
interface to something like this:

struct JobArrayTemplate extends JobTemplate {

// Contains all the attributes that JobTemplate contains

// Also contains the following attributes:

attribute long beginIndex

attribute long endIndex

attribute long step

attribute long slotsLimit; // In order to limit the number
of tasks in the job-array that may run at any one time

// I guess that more attributes will be added here over-time

}

Interface JobSession {

?

JobArray runBulkJobs(in DRMAA::JobArrayTemplate
jobArrayTemplate)

}

By the way, you can see that all the features that I mentioned here are
supported by LSF:

http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618

Best Regards,

Nadav

Post by Andre Merzky
Hi Peter,
in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).
My $0.02,
Andre.

function

Post by Peter TrÃ¶ger
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the

according

Post by Peter TrÃ¶ger
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible

Post by Peter TrÃ¶ger
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously

Post by Peter TrÃ¶ger
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that

are

Post by Peter TrÃ¶ger
still relevant, but first I want to elaborate a little bit about
the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of

the

Post by Peter TrÃ¶ger
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are

performed

Post by Peter TrÃ¶ger
on jobs according to the current DRMAA specification, are actually

performed

Post by Peter TrÃ¶ger
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run

simultaneously

Post by Peter TrÃ¶ger
at a given time).
The greatest advantage of job-arrays, is the ability of users to

"remember"

Post by Peter TrÃ¶ger
many tasks with a single ID, what is impossible to do when submitting

many

Post by Peter TrÃ¶ger
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being

Post by Peter TrÃ¶ger
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>

Post by Mariusz MamoÅski
Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

please us the wiki as it is the most up to date version of the DRMAA
http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

this

Post by Nadav Brandes
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a

sequence

Post by Nadav Brandes
of
'Job' objects, and not a single job with a single ID.

Post by Nadav Brandes
? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could

Post by Nadav Brandes
very
useful if one could determine a queue in 'JobTemplate', change the

queue

Post by Nadav Brandes
of
an existing job, and also get a list of all the queues in the cluster.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful

Post by Nadav Brandes
'JobTemplate' had such field.

has: it is called accountingId

added

Post by Nadav Brandes
to
the 'Job' interface.

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

Post by Nadav Brandes
each submitted job (for example, submitting a job that requires 5

cores,

Post by Nadav Brandes
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the

jobs

Post by Nadav Brandes
so
each running job will have all the resources it needs. If

'JobTemplate'

Post by Nadav Brandes
had
a resources dictionary field, it could also be very useful.

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and

find it to be a great improvement, and very useful for my

organization.

I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on

when

it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there,

the

group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the

interface

of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.

--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,
--
Mariusz

--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

--
Nothing is ever easy...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110121/70a43756/attachment-0001.html

Peter Tröger

2011-01-25 07:31:16 UTC

Hi Nadav,

Post by Nadav Brandes
Interface JobArray {
?
Sequence<Job> getJobsOfState(in JobState state)

I can understand the practical usefulness, but this falls into a category of functions we typically reject. First, it has a timing issue - what happens if an already identified job changes its state before the collected result is returned ? Second, it has not automatically a consistent mapping to DRM-specific interfaces, since the job model is DRMAA-specific. There might be DRMAA-states that do not exist in the particular system. There might be also DRMAA states not specific enough, the ones that only map to some DRMS state together with a sub state.

Since this function is easily implementable with both manual looping or the new event callback features, I would prefer to reject it - as long as we don't have more enthusiastic feedback for other DRM systems. DRMAA tries to stay minimalistic, and this one gives you only a small performance advantage, but no real functional advantage.

Post by Nadav Brandes
struct JobArrayTemplate extends JobTemplate {
// Contains all the attributes that JobTemplate contains
attribute long beginIndex
attribute long endIndex
attribute long step
attribute long slotsLimit; // In order to limit the number of tasks in the job-array that may run at any one time
// I guess that more attributes will be added here over-time
}

Job templates are intended for re-usage, so I don't know if it makes sense to have to index attributes in the template.

For slots in general, we had long and painful discussions. We were able to agree that slots are an opaque concept to DRMAA, since there is absolutely no common mapping in the different DRM systems. Talking about a slots limit for jobs gives them semantics, which we cannot do. Sorry, but this on is a candidate for the still-existing native options.

Post by Nadav Brandes
http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618

Great, it is very good to have an LSF guy in the group. I would be happy if you could also do some sanity check for the existing concepts with respect to LSF.

Since we good no heavy objection against my JobArray proposal, I will add it to the spec draft.

Post by Nadav Brandes
Hi Peter,
in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).

Thanks Andre, but WaitAll* functions are not part of the spec at all, so they will also not be part of JobArray. Same argumentation as above. Since waitAll() is so easy to implement, there is no reason to clutter up the DRMAA interface with it.

Best,
Peter.

Post by Nadav Brandes
My $0.02,
Andre.

Post by Peter TrÃ¶ger
==== snip ===
interface JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in
long beginIndex, in long endIndex, in long step)
...
}
interface JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() // suspend all jobs of the array, partial failures in
changing the state are ok
void resume() // resume all jobs of the array, partial failures in
changing the state are ok
void hold() // put a queued bulk job on hold
void release() // release an array job on hold
void terminate() // terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the according
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible to
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>

Post by Mariusz MamoÅski
Hi Nadav,

Post by Nadav Brandes
Hello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).

http://wikis.sun.com/display/DRMAAv2/Home

Post by Nadav Brandes
? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?

sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.

this was already addressed (wiki!), except alteration of target queue
of already submitted job.

Post by Nadav Brandes
? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.

has: it is called accountingId

We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure

Post by Nadav Brandes
This is it for now, thank for reading it.

thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)

Post by Nadav Brandes
I would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>

--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

Best Regards,
--
Mariusz

--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg

--
Nothing is ever easy...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110125/e002dec2/attachment-0001.html

Peter Tröger

2011-01-27 10:47:34 UTC