waitAllTerminated versions (which would return void IMHO). Otherwise
looks great to me. waitAll is easily implementable in the library
(max cost: 2n*waitAny).
Andre.
Post by Peter Tröger==== snip ===
interface?JobSession {
...
Job runJob(in DRMAA::JobTemplate jobTemplate)
JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate,?in
?long?beginIndex,?in long?endIndex,?in ?long?step)
?? ? ? ? ? ? ? ?...
}
interface?JobArray {
readonly attribute string jobArrayId;
sequence<Job> jobs;
readonly attribute JobSession session;
readonly attribute JobTemplate jobTemplate;
readonly attribute Reservation reservation;
void suspend() ? ??// suspend all jobs of the array, partial failures in
changing the state are ok
void resume() ? ? ?//?resume all jobs of the array, partial failures in
changing the state are ok
void hold() ? ? ? ?// put a queued bulk job on hold
void release() ? ??// release an array job on hold
void terminate() ??// terminate a running job
Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
function
};
==== snip ===
Fetching status information makes only sense on job level, so the according
getInfo() call is not part of the JobArray interface.
I would also resist the temptation to add a JobArray counterpart of
getJobs(JobInfo filter), since the filter semantics would become horrible to
specify.
All functions should be implementable with the 'loop' fallback in the
library, when we allow?partial success in the bulk control functions.
DRMAA folks, your comments please. Is this a feasible interface for the
denoted DRM systems with direct job array control support ?
Best,
Peter.
The newer API specification does look a great deal better, and obviously I
came up with some irrelevant questions.
I'll let you decide what you think about those issues I mentioned that are
still relevant, but first I want to elaborate a little bit about
the?job-arrays feature, which is the most crucial feature for us.
When dealing with job arrays, each task actually has two IDs (The ID of the
whole job-array, and the index of the task within the job-array).
Therefore, in job-arrays, all of the queries and actions that are performed
on jobs according to the current DRMAA specification, are actually performed
upon tasks, which are identified by two IDs instead of one, and except of
that are perfectly similar to single jobs.
All I said so far doesn't make any significant difference, and is only a
matter of terminology. But the important thing about job-arrays is the
ability to perform inclusive queries and operations on them.
For example, one can terminate all of the tasks in a job-array using a
single command (supplying only the ID of the whole job-array, without
needing to give the ID of each task, which might be very exhausting for
users).
An example for a more advanced logic that one might want to perform on
job-arrays is to rerun all the failed tasks in a given job-array.
Another advanced logic might be to limit the number of tasks that may run
simultaneously in a job-array (for example, submitting a job-array
containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
at a given time).
The greatest advantage of job-arrays, is the ability of users to "remember"
many tasks with a single ID, what is impossible to do when submitting many
single jobs.
Many schedulers (like LSF) support all these features, and you can see it
implemented in a growing number of scheduler.
We believe that DRMAA should support these features as well, by being more
"job-arrays oriented". I truly believe that DRMAA will be better if it
supports job-arrays.
2011/1/12 Mariusz Mamo?ski <mamonski at man.poznan.pl>
Post by Mariusz MamoÅskiHi Nadav,
Post by Nadav BrandesHello everyone,
I went over your API description with my team (as described in
http://www.drmaa.org/drmaav2_draft5.pdf).
http://wikis.sun.com/display/DRMAAv2/Home
Post by Nadav Brandes????????? Can one get a 'Job' object representing a job already submitted
once, given only the job index (as an integer)?
? ? ? ? ? ? ? ?sequence<Job> getJobs(JobInfo filter);
which as i remember is not constrained to jobs submitted via DRMAA.
Post by Nadav Brandes????????? It seems like the 'JobInfo' interface misses few parameters given
in the 'JobTemplate' interface. For example, can one get the 'remoteCommand'
of a job that was already submitted, if he only has a 'Job' object in hand,
and not the 'JobTemplate'?
????????? Does DRMAA support job-arrays feature (meaning submitting a group
of tasks in one job, that has a single ID)? Most schedulers support this
feature (include LSF, Moab and SGE). You do have a feature of 'runBulkJobs'
that sends a sequence of jobs altogether, but it also returns a sequence of
'Job' objects, and not a single job with a single ID.
IMHO most of the batch systems returns many job ids for job arrays but
they offer to do perform some of the operations on the whole array
(bulk) by giving common suffix of those job ids. Having one job id,
thus one Job complicates state model (what if half of the array
sub-jobs are running and the rest queued? What should be the state of
the whole array job?)
Post by Nadav Brandes????????? Does DRMAA support the notion of queues (a feature that is
supported by all of the schedulers I know)? We believe that it could be very
useful if one could determine a queue in 'JobTemplate', change the queue of
an existing job, and also get a list of all the queues in the cluster.
this was already addressed (wiki!), except alteration of target queue
of already submitted job.
Post by Nadav Brandes????????? Many batch systems have a feature that allows giving a 'project
name' to submitted jobs. We believe that it could also be very useful if
'JobTemplate' had such field.
has: it is called accountingId
Post by Nadav Brandes????????? Sometimes, especially when dealing with large clusters containing
a large number of compute nodes (which some of them might be out of order),
jobs might fail randomly, without any justified reason. We think it could be
useful if DRMAA supported a feature that allows rerunning failed jobs (as
many schedulers allow, like LSF). ?Such 'rerun()' method could be added to
the 'Job' interface.
We have: rerunnable attribute of the JobTemplate. So one can configure
batch system to rerun jobs that failed due to resources failure
Post by Nadav Brandes????????? Modern schedulers (like Moab and LSF) support advanced features of
memory management, cores management, and also general resources management
(like GPUs). In general, it means giving a list of required resources to
each submitted job (for example, submitting a job that requires 5 cores,
12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs so
each running job will have all the resources it needs. If 'JobTemplate' had
a resources dictionary field, it could also be very useful.
resources that are common for all schedulers are expressed as
JobTemplate attributes, e.g.: minPhysMemory
others DRMS specific options (also resources requirements)
?should go to: ? ? ? ? ?attribute Dictionary drmsSpecific;
? // must be supported
Post by Nadav BrandesThis is it for now, thank for reading it.
thanks for providing your comments, and sorry that you lost much of
it would be better to delete reference to the September 2009, DRMAA2
Draft 5)
Post by Nadav BrandesI would like to hear what you think.
Best Regards,
Nadav
2010/12/21 Peter Tr?ger <peter at troeger.eu>
Hi Navad,
Now I saw the documentation of the planned interface for DRMAA2, and I
find it to be a great improvement, and very useful for my organization. I am
Do you know which distributed resource manager will be the first to
implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
it'll happen, and when will I be able to download a trial version of it?
Since we have the Oracle Grid Engine Product Manager as one of the
co-chairs, I leave the implementation estimation to you ;-) .... We also
have very capable people in Poznan, which might take care of non-OGE
systems. We assume to put out the spec in January, and from there, the group
can only hope. From experience, I would expect nothing useful before Summer
2011.
Is it still possible to suggest ideas that we have about the interface of
DRMAA2? If so, how is it done? Is it customary to share ideas in this forum,
or do you prefer it to be done through Wiki?
The best thing is to start a discussion on the list. The Wiki is good as
reference. Comments on the Wiki pages might get lost ...
Best regards,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
Best Regards,
--
Mariusz
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg