Discussion:
[DRMAA-WG] Report from D-Grid conference
Peter Tröger
2010-03-25 23:44:08 UTC
Permalink
Dear all,

this week, I had a DRMAAv2 presentation at the conference of the
German grid initiative (D-Grid). Even though it was the last session
on the last day, attendance was pretty good. I got some interesting
remarks I wanted to share:

- Typical D-Grid installations have PBS or SGE, sometimes Torque. No
Condor. LSF is on the agenda.
- With the ability to check for core dump file existence in JobInfo,
they wondered if DRMAA could also offer to actually get this file.
- One user community in D-Grid typically has "pre-jobs" that prepare a
node for the real work with some software installation. DRMAAv2 with
it's waitAnyTerminated() looked good enough for them.
- One request from the audience was automated re-queueing - if a job
goes to Failed state, it should be re-queued automatically. This is a
typical massive scale cluster resp. grid problem, were machines
outages are normal. Condor (of course) has that, I am not sure about
the others.
- Another commonly agreed request was intermediate result preview. The
problem is that some simulations run for hours, and you want to know
pretty early if it is worthwhile to complete the run. LSF has a
feature were you can look on job's stdout while it runs, even with non-
interactive jobs. I don't know about other systems.
- One SLA expert in the auditorium was happy about the startTime /
endTime / duration approach in the AR template. He called that
"relaxed reservation".
- Another guest recommended GLUE2 as input for our monitoring
attributes. It's like JSDL and DCIM - everything optional, but maybe
good for semantics.
- It was requested that we check the monitoring attributes against
Globus MDS and Unicore TSI.

I was also asked about the time frame for DRMAAv2 implementations -
really. Not only the D-Grid audience seems to be highly interested in
using DRMAAv2, I got the same kind of feedback also at OGF28. I hope
this is enough motivation for everybody in the upcoming finalization
phase ...

Slides are attached, feel free to re-use them.

Best,
Peter.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: dgrid-dresden.pdf
Type: application/pdf
Size: 1386166 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/drmaa-wg/attachments/20100326/87c7c144/attachment-0001.pdf
-------------- next part --------------
Mariusz Mamoński
2010-03-26 10:34:26 UTC
Permalink
Hi Peter,

It's very nice feedback. I took this mail as an opportunity to share
some comments.
Post by Peter Tröger
Dear all,
this week, I had a DRMAAv2 presentation at the conference of the German grid
initiative (D-Grid). Even though it was the last session on the last day,
attendance was pretty good. I got some interesting remarks I wanted to
- Typical D-Grid installations have PBS or SGE, sometimes Torque. No Condor.
LSF is on the agenda.
- With the ability to check for core dump file existence in JobInfo, they
wondered if DRMAA could also offer to actually get this file.
yes, this is also an use case of one of DRMAA for LSF user. By now
this realized via setting core file limit in nativeSpecification. This
attribute is explicitly supported both by the SGE and LSF. I believe
for torque/PBSPro it could be quite easily implemented on top of the
DRMS. So why not to add it as JobTemplate attribute?
Post by Peter Tröger
- One user community in D-Grid typically has "pre-jobs" that prepare a node
for the real work with some software installation. DRMAAv2 with it's
waitAnyTerminated() looked good enough for them.
- One request from the audience was automated re-queueing - if a job goes to
Failed state, it should be re-queued automatically. This is a typical
massive scale cluster resp. grid problem, were machines outages are normal.
Condor (of course) has that, I am not sure about the others.
for me this is only DRMS configuration issue, not the DRMAA. However
as i remember in many systems job must be marked as reRunnable in
order for the DRM to do this (rerunning a job may cause the partial
results from the failed run to be overwritten). I will try do a
research on this topic.
Post by Peter Tröger
- Another commonly agreed request was intermediate result preview. The
problem is that some simulations run for hours, and you want to know pretty
early if it is worthwhile to complete the run. LSF has a feature were you
can look on job's stdout while it runs, even with non-interactive jobs. I
don't know about other systems.
we observed the same, this is also vital for SaaS use cases as it
allow to emulate remote execution of application as local one. In LSF
there is as special command/function in API called bpeek (as
stdout/stderr files redirected to temporary files until the job ends).
In SGE the stdout/stderr are simply redirected into stdout/stderr
file names given upon submission - so user can simply read them
(tested!). Torque can be configured to do the same (by default during
the execution the stdout/stderr are redirected to files in worker node
spool directory - not accessible from fronted).
Post by Peter Tröger
- One SLA expert in the auditorium was happy about the startTime / endTime /
duration approach in the AR template. He called that "relaxed reservation".
- Another guest recommended GLUE2 as input for our monitoring attributes.
It's like JSDL and DCIM - everything optional, but maybe good for semantics.
some long time ago i was thinking to provide in our service the
monitoring info using the GLUE schema. I found it way to complex, but
maybe i do not put enough effort in understanding it...
Post by Peter Tröger
- It was requested that we check the monitoring attributes against Globus
MDS and Unicore TSI.
I was also asked about the time frame for DRMAAv2 implementations - really.
Not only the D-Grid audience seems to be highly interested in using DRMAAv2,
I got the same kind of feedback also at OGF28. I hope this is enough
motivation for everybody in the upcoming finalization phase ...
Slides are attached, feel free to re-use them.
Best,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
Cheers,
--
Mariusz
Loading...