[DRMAA-WG] DRMAA2 Draft 6, next steps, no conf call

Discussion:

Peter Tröger

2011-06-21 21:14:20 UTC

Dear all,

after a very productive face-to-face meeting in Potsdam, we ended up
with the new draft 6 of the DRMAAv2 spec. Please find attached the
document. I would like to thank Mariusz, Daniel G. and Andre Merczy for
investing their time and effort.

The good news is that we were able to clarify all pending functional
issues. We are now in a sanity check phase, were the text itself gets
some proof-reading to find inconsistencies.

Since at least three group members are now into reading and editing, I
will drop the call for this week. If no serious (I mean *really*
serious) things are found, we will wrap-up in a couple of days and
perform the official "last call" for comments on the list.

Beside that, we started some initial debate on the C binding. Please
understand that this discussion will go public only after the IDL spec
was submitted, in order to avoid redundant efforts.

Best regards,
Peter.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: drmaav2.pdf
Type: application/pdf
Size: 637769 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/drmaa-wg/attachments/20110621/1fcaa86d/attachment-0001.pdf

Mariusz Mamoński

2011-06-23 21:39:41 UTC

Permalink

Hi,

Post by Peter TrÃ¶ger
Dear all,
after a very productive face-to-face meeting in Potsdam, we ended up with
the new draft 6 of the DRMAAv2 spec. Please find attached the document. I
would like to thank Mariusz, Daniel G. and Andre Merczy for investing their
time and effort.
The good news is that we were able to clarify all pending functional issues.
We are now in a sanity check phase, were the text itself gets some
proof-reading to find inconsistencies.
Since at least three group members are now into reading and editing, I will
drop the call for this week. If no serious (I mean *really* serious) things
are found, we will wrap-up in a couple of days and perform the official
"last call" for comments on the list.
Beside that, we started some initial debate on the C binding. Please
understand that this discussion will go public only after the IDL spec was
submitted, in order to avoid redundant efforts.
Best regards,
Peter.
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg

Result of my proof-reading (most of them fortunately minor ;-)

line 19: "The scope is limited to job submission, job control, and
retrival.." -> "The scope is limited to job submission, job control,
reservation management and retrival..."
line 100: act as execution -> act as an execution
line 206: mention JobTemplate
line 255: if possible i would add the following statement, it does not
change nothing but brings reader attention to important concept of the
DRM systems.

"It is worth to mention that the WALLCLOCK_TIME in most of the DRM
systems is not only a resource limit but also a key job attribute
taken into account in the scheduling process"

line 291: line falling behind the margin
line 364: missing "\"
line 382: Maybe we should clarify that te value should be eventually
normalized: e.g.:

"The load value MUST be always within the <0;1> range (inclusive). The
value 0 should indicate that machine is idling, while the 1 that all
computing units are used"

line 481: as the JobSubState is an opaque object then passing
"sub-state is not suported by the impl.." may simply lead to SEG FAULT
;-) so filtering using sub-state should be permitted if one known
which implementation is used.

line 513: "The accumulated CPU time" -> "The accumulated, over all
job's processes, CPU time" (just a proposition)

line 686: expressed by the expressed by -> expressed by

line 762: "being allowed on one machine" -> "being allowed to run"
(@see maxSlots)

line 863: a Uns.. -> an Unsup...

line 878: missing space after "support"

line 890: missing space after "reservation."
line 895: missing space after "machines."

line 1069: Should we state that is enough that session names must be
unique for tuple (DRMS,user)

line 1097: Should we explicitly mention when one can call the
destroySession ? If yes i would propose "only for not opened session".

line 1183: sessionName can be also generated by the implementation...

line 1374: what about job objects returned in the monitoring session?
which session should be referred then?

line 1384: maybe we should warn here that this operation might not be atomic.

footnote 39: "start and time" -> "start and end time"

line 1837: poznan -> poznan.pl

Cheers,

--
Mariusz

Andre Merzky

2011-06-23 21:48:38 UTC

Permalink

Hi Mariusz,

some comments inlined :-)

Cheers, Andre.

Post by Mariusz MamoÅski
"The load value MUST be always within the <0;1> range (inclusive). The
value 0 should indicate that machine is idling, while the 1 that all
computing units are used"

Sounds sensible to me, although I have often seen load values >1,
mostly indicating that a machine is overloaded. You may want to
change the MUST into a SHOULD thus?

Post by Mariusz MamoÅski
line 1069: Should we state that is enough that session names must be
unique for tuple (DRMS,user)
line 1097: Should we explicitly mention when one can call the
destroySession ? If yes i would propose "only for not opened session".

These two items together imply that it is an error if I open a session
in one application instance, and destroy it in another instance which
runs at the same time. Which instance will show the error? Both?
How is synchronization done?

The fundamental problem seems to be that the spec introduces stateful
sessions which do not (necessarily) have any state management in the
backend. If you library itself is maintaining the state, you will
introduce race conditions.

Cheers, Andre.

--
Nothing is ever easy...

Mariusz Mamoński

2011-06-23 21:55:39 UTC

Permalink

Post by Andre Merzky
Hi Mariusz,
some comments inlined :-)
Cheers, Andre.

Post by Mariusz MamoÅski
"The load value MUST be always within the <0;1> range (inclusive). The
value 0 should indicate that machine is idling, while the 1 that all
computing units are used"

Sounds sensible to me, although I have often seen load values >1,
mostly indicating that a machine is overloaded. ?You may want to
change the MUST into a SHOULD thus?

i basically wanted to avoid situation that this value is "number of
core specific" ;-)

Post by Andre Merzky

These two items together imply that it is an error if I open a session
in one application instance, and destroy it in another instance which
runs at the same time. ?Which instance will show the error? ?Both?
How is synchronization done?

I think opening the same session **concurrently** in two application
falls into "invalid usage".

Post by Andre Merzky
The fundamental problem seems to be that the spec introduces stateful
sessions which do not (necessarily) have any state management in the
backend. ?If you library itself is maintaining the state, you will
introduce race conditions.
Cheers, Andre.
--
Nothing is ever easy...

--
Mariusz

Andre Merzky

2011-06-23 22:16:23 UTC

Permalink

Post by Mariusz MamoÅski

Post by Andre Merzky
Hi Mariusz,

I think opening the same session **concurrently** in two application
falls into "invalid usage".

Then that needs to be documented in the spec.

FWIW, this will be very hard on the end user. For example, tool developers
which build tools upon DRMAA have no control over how the tools are used,
and how instances are synchronized. This will be particularly difficult as
sessions are supposed to be persistent, and thus are *supposed* to be used
(i.e. opened) in different application instances.

I don't see a better solution - just saying. I guess at the end this will
only really work if the DRM system can support the session state's
persistence...

Cheers, Andre.

Post by Mariusz MamoÅski

--
Mariusz

--
Nothing is ever easy...

Mariusz Mamoński

2011-06-24 05:24:12 UTC

Permalink

Post by Andre Merzky

Post by Mariusz MamoÅski

Post by Andre Merzky
Hi Mariusz,

I think opening the same session **concurrently** in two application
falls into "invalid usage".

Then that needs to be documented in the spec.
FWIW, this will be very hard on the end user. ?For example, tool developers
which ?build tools upon DRMAA have no control over how the tools are used,
and how instances are synchronized. ?This will be particularly difficult as
sessions are supposed to be persistent, and thus are *supposed* to be used
(i.e. opened) in different application instances.

this is still possible but sequentially not concurrently and i think
it serves most of the use cases. I guess it typically would be the
same application but different run. I think one of the idea of
introducing the restartable session concept in DRMAA 2.0 was that in
DRMAA 1.0 you had to (in theory) keep the application running as long
as you had some job in the system.

Post by Andre Merzky
I don't see a better solution - just saying. ?I guess at the end this will
only really work if the DRM system can support the session state's
persistence...
Cheers, Andre.

Post by Mariusz MamoÅski

--
Mariusz

--
Nothing is ever easy...

--
Mariusz

Andre Merzky

2011-06-24 06:49:17 UTC

Permalink

Hi again,

Post by Mariusz MamoÅski

Post by Andre Merzky
FWIW, this will be very hard on the end user. ?For example, tool developers
which ?build tools upon DRMAA have no control over how the tools are used,
and how instances are synchronized. ?This will be particularly difficult as
sessions are supposed to be persistent, and thus are *supposed* to be used
(i.e. opened) in different application instances.

Yes, I agree that this is the most interesting use case.

Best, Andre.

--
Nothing is ever easy...

Daniel Gruber

2011-06-24 07:09:38 UTC

Permalink

Post by Andre Merzky
Hi Mariusz,
some comments inlined :-)
Cheers, Andre.

Post by Mariusz MamoÅski
"The load value MUST be always within the <0;1> range (inclusive). The
value 0 should indicate that machine is idling, while the 1 that all
computing units are used"

Sounds sensible to me, although I have often seen load values >1,
mostly indicating that a machine is overloaded. You may want to
change the MUST into a SHOULD thus?

I disagree! We agreed that the value "is similar to the uptime" command.
Load values indeed can be bigger than 1 because they measure
the amount of "runnable" processes in average. There is no need to
artificially normalize the value somehow because the max. number is
unknown. We should take whatever the DRM is reporting us, and this
is similar to the uptime command (and by the way also depends on the
amount of cores). This is we agreed on.

Daniel

Mariusz Mamoński

2011-06-24 07:56:51 UTC

Permalink

Post by Daniel Gruber

Post by Andre Merzky
Hi Mariusz,
some comments inlined :-)
Cheers, Andre.

Post by Mariusz MamoÅski
"The load value MUST be always within the <0;1> range (inclusive). The
value 0 should indicate that machine is idling, while the 1 that all
computing units are used"

Sounds sensible to me, although I have often seen load values >1,
mostly indicating that a machine is overloaded. ?You may want to
change the MUST into a SHOULD thus?

ok, you convinced me. Lets leave this as it is.

Post by Daniel Gruber
Daniel

--
Mariusz