Discussion:
[DRMAA-WG] Load average interval ?
Peter Tröger
2010-03-22 13:05:23 UTC
Permalink
Hi,

next remaining thing from OGF28:

We support the determination of machineLoad average in the MonitoringSession interface. At OGF, we could not agree on which of the typical intervals (1/5/15 minutes) we want to use here. Maybe all of them ?

Best,
Peter.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2208 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/drmaa-wg/attachments/20100322/60525d15/attachment.bin
Daniel Gruber
2010-03-22 13:41:58 UTC
Permalink
From SGE point of view all three (uptime(1)) values can be used. The
standard load
value is the 5 minute average. We also have a normalized load, which
divides the load
by the number of processors installed (in order to compare the load
between different
server systems). And again the normalized load is available for all
three load values.

I'm voting for having at least all 3 uptime load values.

Cheers

Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the MonitoringSession interface. At OGF, we could not agree on which of the typical intervals (1/5/15 minutes) we want to use here. Maybe all of them ?
Best,
Peter.
------------------------------------------------------------------------
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20100322/42ec3bd6/attachment-0001.html
Daniel Templeton
2010-03-22 15:02:16 UTC
Permalink
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.

Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the MonitoringSession interface. At OGF, we could not agree on which of the typical intervals (1/5/15 minutes) we want to use here. Maybe all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Peter Tröger
2010-03-23 16:48:11 UTC
Permalink
Any non-SGE opinion ?

And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.

Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Peter Tröger
2010-03-23 20:51:12 UTC
Permalink
Post by Peter Tröger
Any non-SGE opinion ?
Here is mine:

I could only find one single source that explains the load average
source in Condor :)

http://www.patentstorm.us/patents/5978829/description.html

Condor provides only the 1-minute load average from the uptime command.

Same holds for Moab:
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml

And PBS:
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA

And MAUI:
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement

I vote for reporting only the 1-minute load average.

/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Daniel Templeton
2010-03-23 22:00:16 UTC
Permalink
That's fine with me.

Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Mariusz Mamoński
2010-03-25 14:03:44 UTC
Permalink
Also for me. As we are talking about monitoring interface i propose
two more changes to the machine monitoring interface:

1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)

2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
this two-level granularity @see Google Doc) we loose the most
essential information: "how many single processing units do we have on
single machine?"

Cheers,
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. ?You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
? drmaa-wg mailing list
? drmaa-wg at ogf.org
? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? drmaa-wg mailing list
? drmaa-wg at ogf.org
? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ?drmaa-wg mailing list
? ?drmaa-wg at ogf.org
? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Mariusz
Daniel Templeton
2010-03-25 14:44:09 UTC
Permalink
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)

Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Daniel Gruber
2010-03-25 15:50:17 UTC
Permalink
I would also vote for the total amount of cores and sockets :)

We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).

If not we should at least define that a core is really a physical core.

Daniel
Post by Daniel Templeton
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20100325/400e349e/attachment-0001.html
Peter Tröger
2010-03-26 00:02:41 UTC
Permalink
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
my (negative) surprise, they report nothing else:

http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294

When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).

But Condor is not our primary use case ;-)

/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
Post by Daniel Templeton
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Andre Merzky
2010-03-26 00:09:56 UTC
Permalink
Post by Peter Tröger
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).
Please don't hardcode that number '2': that is only valid for Intels
Hyperthreading, and only at this point in time... ;-)

Anyway: if one has to chose, the hardware threads are likely more
useful than cores, IMHO, although learning both, or even the full
hierarchy (nodes/sockets/cores/threads) would be simply nice...

Best, Andre.
Post by Peter Tröger
But Condor is not our primary use case ;-)
/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
Post by Daniel Templeton
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Nothing is ever easy.
Daniel Templeton
2010-03-26 14:36:37 UTC
Permalink
The concept of slots in SGE is only loosely bound to CPU architecture.
We assume a slot per thread or core, but it's only a suggestion.
Administrators can configure an arbitrary number of slots. For example,
the 1-node test cluster I have running on my workstation current has
over 200 slots on a dual-core machine.

Daniel
Post by Andre Merzky
Post by Peter Tröger
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).
Please don't hardcode that number '2': that is only valid for Intels
Hyperthreading, and only at this point in time... ;-)
Anyway: if one has to chose, the hardware threads are likely more
useful than cores, IMHO, although learning both, or even the full
hierarchy (nodes/sockets/cores/threads) would be simply nice...
Best, Andre.
Post by Peter Tröger
But Condor is not our primary use case ;-)
/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
Post by Daniel Templeton
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Mariusz Mamoński
2010-03-26 14:42:39 UTC
Permalink
Post by Daniel Templeton
The concept of slots in SGE is only loosely bound to CPU architecture.
We assume a slot per thread or core, but it's only a suggestion.
Administrators can configure an arbitrary number of slots. ?For example,
the 1-node test cluster I have running on my workstation current has
over 200 slots on a dual-core machine.
Is it common to observe production system that permits oversubsription
of cpus? We can always add slots as machineInfo attribute additional
(or instead of) to cpu/cores.
Post by Daniel Templeton
Daniel
Post by Andre Merzky
Post by Peter Tröger
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).
Please don't hardcode that number '2': that is only valid for Intels
Hyperthreading, and only at this point in time... ;-)
Anyway: if one has to chose, the hardware threads are likely more
useful than cores, IMHO, although learning both, or even the full
hierarchy (nodes/sockets/cores/threads) would be simply nice...
Best, Andre.
Post by Peter Tröger
But Condor is not our primary use case ;-)
/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
I would tend to agree that total core count is more useful. ?SGE also
reports socket count as of 6.2u5, by the way. ?(That's actually
thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. ?You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
? ? drmaa-wg mailing list
? ? drmaa-wg at ogf.org
? ? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ? drmaa-wg mailing list
? ? drmaa-wg at ogf.org
? ? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ? ?drmaa-wg mailing list
? ? ?drmaa-wg at ogf.org
? ? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ?drmaa-wg mailing list
? ?drmaa-wg at ogf.org
? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ?drmaa-wg mailing list
? ?drmaa-wg at ogf.org
? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? drmaa-wg mailing list
? drmaa-wg at ogf.org
? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Mariusz
Daniel Templeton
2010-03-26 14:52:57 UTC
Permalink
I think slots is a concept that is out of scope for DRMAA. There's
absolutely zero value in reporting slot counts in SGE unless you're also
going to report the queue configurations and resource policies, because
the total number of slots is almost never available for simultaneous use.

Daniel
Post by Mariusz Mamoński
Post by Daniel Templeton
The concept of slots in SGE is only loosely bound to CPU architecture.
We assume a slot per thread or core, but it's only a suggestion.
Administrators can configure an arbitrary number of slots. For example,
the 1-node test cluster I have running on my workstation current has
over 200 slots on a dual-core machine.
Is it common to observe production system that permits oversubsription
of cpus? We can always add slots as machineInfo attribute additional
(or instead of) to cpu/cores.
Post by Daniel Templeton
Daniel
Post by Andre Merzky
Post by Peter Tröger
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).
Please don't hardcode that number '2': that is only valid for Intels
Hyperthreading, and only at this point in time... ;-)
Anyway: if one has to chose, the hardware threads are likely more
useful than cores, IMHO, although learning both, or even the full
hierarchy (nodes/sockets/cores/threads) would be simply nice...
Best, Andre.
Post by Peter Tröger
But Condor is not our primary use case ;-)
/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
Post by Daniel Templeton
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Mariusz Mamoński
2010-03-26 15:21:53 UTC
Permalink
I think slots is a concept that is out of scope for DRMAA. ?There's
absolutely zero value in reporting slot counts in SGE unless you're also
going to report the queue configurations and resource policies, because the
total number of slots is almost never available for simultaneous use.
i meant, the number of slots for "simultaneous use", i.e. the
system/machine capacity counted as maximum number of single process
jobs allowed to run concurrently. Sorry, but i'm slightly confused
with the "slots is a concept that is out of scope DRMAA" (i read here
DRMAA as DRMS API): What you are giving upon submission as an argument
of the parallel environment: cores/cpu?
Daniel
On 26 March 2010 15:36, Daniel Templeton<daniel.templeton at oracle.com>
Post by Daniel Templeton
The concept of slots in SGE is only loosely bound to CPU architecture.
We assume a slot per thread or core, but it's only a suggestion.
Administrators can configure an arbitrary number of slots. ?For example,
the 1-node test cluster I have running on my workstation current has
over 200 slots on a dual-core machine.
Is it common to observe production system that permits oversubsription
of cpus? We can always add slots as machineInfo attribute additional
(or instead of) to cpu/cores.
Post by Daniel Templeton
Daniel
Post by Andre Merzky
Post by Peter Tröger
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).
Please don't hardcode that number '2': that is only valid for Intels
Hyperthreading, and only at this point in time... ;-)
Anyway: if one has to chose, the hardware threads are likely more
useful than cores, IMHO, although learning both, or even the full
hierarchy (nodes/sockets/cores/threads) would be simply nice...
Best, Andre.
Post by Peter Tröger
But Condor is not our primary use case ;-)
/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
I would tend to agree that total core count is more useful. ?SGE also
reports socket count as of 6.2u5, by the way. ?(That's actually
thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. ?You could solve it the same way we did for SGE -- offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
? ? drmaa-wg mailing list
? ? drmaa-wg at ogf.org
? ? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ? drmaa-wg mailing list
? ? drmaa-wg at ogf.org
? ? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ? ?drmaa-wg mailing list
? ? ?drmaa-wg at ogf.org
? ? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ?drmaa-wg mailing list
? ?drmaa-wg at ogf.org
? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? ?drmaa-wg mailing list
? ?drmaa-wg at ogf.org
? ?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
? drmaa-wg mailing list
? drmaa-wg at ogf.org
? http://www.ogf.org/mailman/listinfo/drmaa-wg
--
?drmaa-wg mailing list
?drmaa-wg at ogf.org
?http://www.ogf.org/mailman/listinfo/drmaa-wg
--
Mariusz
Daniel Templeton
2010-03-26 19:32:04 UTC
Permalink
I meant from the monitoring side. Even reporting the number of
simultaneously available slots in SGE isn't particularly useful. To
understand the meaning of the slot count, you have to understand the
configuration of the scheduler, and that is clearly out of bounds for DRMAA.

For the submission of a parallel job, the concept of slots is not really
required. Parallel jobs just need to specify how many slaves there are
and where they should run (i.e. how many per machine). How that relates
to slots in the DRMS is unimportant.

I get that it would be nice to be able to expose slots in a useful way
via DRMAA, but I'm doubtful that it's possible. Just like queues, the
meaning (or really application) of slots is too DRM-specific.

Daniel
Post by Mariusz Mamoński
Post by Daniel Templeton
I think slots is a concept that is out of scope for DRMAA. There's
absolutely zero value in reporting slot counts in SGE unless you're also
going to report the queue configurations and resource policies, because the
total number of slots is almost never available for simultaneous use.
i meant, the number of slots for "simultaneous use", i.e. the
system/machine capacity counted as maximum number of single process
jobs allowed to run concurrently. Sorry, but i'm slightly confused
with the "slots is a concept that is out of scope DRMAA" (i read here
DRMAA as DRMS API): What you are giving upon submission as an argument
of the parallel environment: cores/cpu?
Post by Daniel Templeton
Daniel
On 26 March 2010 15:36, Daniel Templeton<daniel.templeton at oracle.com>
Post by Daniel Templeton
The concept of slots in SGE is only loosely bound to CPU architecture.
We assume a slot per thread or core, but it's only a suggestion.
Administrators can configure an arbitrary number of slots. For example,
the 1-node test cluster I have running on my workstation current has
over 200 slots on a dual-core machine.
Is it common to observe production system that permits oversubsription
of cpus? We can always add slots as machineInfo attribute additional
(or instead of) to cpu/cores.
Post by Daniel Templeton
Daniel
Post by Andre Merzky
Post by Peter Tröger
Condor usually reports the number of cores incl. hyperthreaded ones,
which confirms to the 'concurrent threads' metric Daniel proposed. To
http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
When we only look into this case, the according attribute could be
named 'supportedSlots', since we created the understanding of slots as
resources for concurrent job activities / threads / processes. The
sockets attribute would not be implementable in Condor. The value of
the cores attribute could be guessable (supportedSlots/2).
Please don't hardcode that number '2': that is only valid for Intels
Hyperthreading, and only at this point in time... ;-)
Anyway: if one has to chose, the hardware threads are likely more
useful than cores, IMHO, although learning both, or even the full
hierarchy (nodes/sockets/cores/threads) would be simply nice...
Best, Andre.
Post by Peter Tröger
But Condor is not our primary use case ;-)
/Peter.
Post by Daniel Gruber
I would also vote for the total amount of cores and sockets :)
We could also think about reporting the amount of concurrent
threads that are supported by the hardware (hyperthreading in
case of Intel or chip-multithreading in case of Sun T2 processors).
This could prevent the user for puzzling out what is meant by
a core (is it a real one, or the hyperthreading/CMT thing).
If not we should at least define that a core is really a physical core.
Daniel
Post by Daniel Templeton
I would tend to agree that total core count is more useful. SGE also
reports socket count as of 6.2u5, by the way. (That's actually thanks
to our own Daniel Gruber.)
Daniel
Post by Mariusz Mamoński
Also for me. As we are talking about monitoring interface i propose
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
2. change machineCoresPerSocket to machinesCores, if one have
machineSockets he or she can easily determine the
machineCoresPerSocket. The problem with the current API is that if the
DRM do not support "machineSockets" (as far i checked only LSF provide
essential information: "how many single processing units do we have on
single machine?"
Cheers,
On 23 March 2010 23:00, Daniel
Post by Daniel Templeton
That's fine with me.
Daniel
Post by Peter Tröger
Post by Peter Tröger
Any non-SGE opinion ?
I could only find one single source that explains the load average
source in Condor :)
http://www.patentstorm.us/patents/5978829/description.html
Condor provides only the 1-minute load average from the uptime command.
http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
I vote for reporting only the 1-minute load average.
/Peter.
Post by Peter Tröger
And BTW, by using the uptime(1) load semantics, we loose Windows
support. There is no such attribute there, load is measured in
percentage of non-idle time, and has no direct relationship to the
ready queue lengths.
Best,
Peter.
Post by Daniel Templeton
SGE tends to look at the 5-minute average, although any can be
configured. You could solve it the same way we did for SGE --
offer
three: machineLoadShort, machineLoadMed, machineLoadLong.
Daniel
Post by Peter Tröger
Hi,
We support the determination of machineLoad average in the
MonitoringSession interface. At OGF, we could not agree on
which of
the typical intervals (1/5/15 minutes) we want to use here. Maybe
all of them ?
Best,
Peter.
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
--
drmaa-wg mailing list
drmaa-wg at ogf.org
http://www.ogf.org/mailman/listinfo/drmaa-wg
Peter Tröger
2010-03-25 23:47:52 UTC
Permalink
Post by Mariusz Mamoński
1. Having a new data struct called "MachineInfo" with attributes like
Load, PhysMemory, ... and getMachineInfo(in String machineName) method
in the Monitoring interface. Rationale: the same as for the JobInfo
(consistency issue, fetching all machines attributes at once is more
natural in DRMS APIs then querying for each attribute separately)
Sounds ok, as long a we don't want to apply the JobTemplate extension
mechanism for implementation-specific attributes also the machine
monitoring.

Best,
Peter.
Loading...