.TH CPUPOWER\-MONITOR "1" "22/02/2011" "" "cpupower Manual"
.SH NAME
cpupower\-monitor \- Report processor frequency and idle statistics
.SH SYNOPSIS
.ft B
.B cpupower monitor
.RB "\-l"

.B cpupower monitor
.RB [ -c ] [ "\-m <mon1>," [ "<mon2>,..." ] ]
.RB [ "\-i seconds" ]
.br
.B cpupower monitor
.RB [ -c ][ "\-m <mon1>," [ "<mon2>,..." ] ]
.RB command
.br
.SH DESCRIPTION
\fBcpupower-monitor \fP reports processor topology, frequency and idle power
state statistics. Either \fBcommand\fP is forked and
statistics are printed upon its completion, or statistics are printed periodically.

\fBcpupower-monitor \fP implements independent processor sleep state and
frequency counters. Some are retrieved from kernel statistics, some are
directly reading out hardware registers. Use \-l to get an overview which are
supported on your system.

.SH Options
.PP
\-l
.RS 4
List available monitors on your system. Additional details about each monitor
are shown:
.RS 2
.IP \(bu
The name in quotation marks which can be passed to the \-m parameter.
.IP \(bu
The number of different counters the monitor supports in brackets.
.IP \(bu
The amount of time in seconds the counters might overflow, due to
implementation constraints.
.IP \(bu
The name and a description of each counter and its processor hierarchy level
coverage in square brackets:
.RS 4
.IP \(bu
[T] \-> Thread
.IP \(bu
[C] \-> Core
.IP \(bu
[P] \-> Processor Package (Socket)
.IP \(bu
[M] \-> Machine/Platform wide counter
.RE
.RE
.RE
.PP
\-m <mon1>,<mon2>,...
.RS 4
Only display specific monitors. Use the monitor string(s) provided by \-l option.
.RE
.PP
\-i seconds
.RS 4
Measure intervall.
.RE
.PP
\-c
.RS 4
Schedule the process on every core before starting and ending measuring.
This could be needed for the Idle_Stats monitor when no other MSR based
monitor (has to be run on the core that is measured) is run in parallel.
This is to wake up the processors from deeper sleep states and let the
kernel re
-account its cpuidle (C-state) information before reading the
cpuidle timings from sysfs.
.RE
.PP
command
.RS 4
Measure idle and frequency characteristics of an arbitrary command/workload.
The executable \fBcommand\fP is forked and upon its exit, statistics gathered since it was
forked are displayed.
.RE
.PP
\-v
.RS 4
Increase verbosity if the binary was compiled with the DEBUG option set.
.RE

.SH MONITOR DESCRIPTIONS
.SS "Idle_Stats"
Shows statistics of the cpuidle kernel subsystem. Values are retrieved from
/sys/devices/system/cpu/cpu*/cpuidle/state*/.
The kernel updates these values every time an idle state is entered or
left. Therefore there can be some inaccuracy when cores are in an idle
state for some time when the measure starts or ends. In worst case it can happen
that one core stayed in an idle state for the whole measure time and the idle
state usage time as exported by the kernel did not get updated. In this case
a state residency of 0 percent is shown while it was 100.

.SS "Mperf"
The name comes from the aperf/mperf (average and maximum) MSR registers used
which are available on recent X86 processors. It shows the average frequency
(including boost frequencies).
The fact that on all recent hardware the mperf timer stops ticking in any idle
state it is also used to show C0 (processor is active) and Cx (processor is in
any sleep state) times. These counters do not have the inaccuracy restrictions
the "Idle_Stats" counters may show.
May work poorly on Linux-2.6.20 through 2.6.29, as the \fBacpi-cpufreq \fP
kernel frequency driver periodically cleared aperf/mperf registers in those
kernels.

.SS "Nehalem" "SandyBridge" "HaswellExtended"
Intel Core and Package sleep state counters.
Threads (hyperthreaded cores) may not be able to enter deeper core states if
its sibling is utilized.
Deepest package sleep states may in reality show up as machine/platform wide
sleep states and can only be entered if all cores are idle. Look up Intel
manuals (some are provided in the References section) for further details.
The monitors are named after the CPU family where the sleep state capabilities
got introduced and may not match exactly the CPU name of the platform.
For example an IvyBridge processor has sleep state capabilities which got
introduced in Nehalem and SandyBridge processor families.
Thus on an IvyBridge processor one will get Nehalem and SandyBridge sleep
state monitors.
HaswellExtended extra package sleep state capabilities are available only in a
specific Haswell (family 0x45) and probably also other future processors.

.SS "Fam_12h" "Fam_14h"
AMD laptop and desktop processor (family 12h and 14h) sleep state counters.
The registers are accessed via PCI and therefore can still be read out while
cores have been offlined.

There is one special counter: NBP1 (North Bridge P1).
This one always returns 0 or 1, depending on whether the North Bridge P1
power state got entered at least once during measure time.
Being able to enter NBP1 state also depends on graphics power management.
Therefore this counter can be used to verify whether the graphics' driver
power management is working as expected.

.SH EXAMPLES

cpupower monitor -l" may show:
.RS 4
Monitor "Mperf" (3 states) \- Might overflow after 922000000 s

   ...

Monitor "Idle_Stats" (3 states) \- Might overflow after 4294967295 s

   ...

.RE
cpupower monitor \-m "Idle_Stats,Mperf" scp /tmp/test /nfs/tmp

Monitor the scp command, show both Mperf and Idle_Stats states counter
statistics, but in exchanged order.


.RE
Be careful that the typical command to fully utilize one CPU by doing:

cpupower monitor cat /dev/zero >/dev/null

Does not work as expected, because the measured output is redirected to
/dev/null. This could get workarounded by putting the line into an own, tiny
shell script. Hit CTRL\-c to terminate the command and get the measure output
displayed.

.SH REFERENCES
"BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 14h Processors"
http://support.amd.com/us/Processor_TechDocs/43170.pdf

"Intel® Turbo Boost Technology
in Intel® Core™ Microarchitecture (Nehalem) Based Processors"
http://download.intel.com/design/processor/applnots/320354.pdf

"Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3B: System Programming Guide"
http://www.intel.com/products/processor/manuals

.SH FILES
.ta
.nf
/dev/cpu/*/msr
/sys/devices/system/cpu/cpu*/cpuidle/state*/.
.fi

.SH "SEE ALSO"
powertop(8), msr(4), vmstat(8)
.PP
.SH AUTHORS
.nf
Written by Thomas Renninger <trenn@suse.de>

Nehalem, SandyBridge monitors and command passing
based on turbostat.8 from Len Brown <len.brown@intel.com>
o the card with FLUSH_CACHE operation.
And because busy end interrupt may be mistakenly cleared while not yet
processed, this mmc request may never complete.  As a result, mmcqd task
may be stuck forever.

Here is an instance caught by lockup detector which shows that mmcqd task
was hung while waiting for mmc_flush_cache command to complete:

..
[  240.251595] INFO: task mmcqd/1:52 blocked for more than 120 seconds.
[  240.257973]       Not tainted 4.1.13-00510-g9d91424 #2
[  240.263109] "echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.270955] mmcqd/1         D c047504c     0    52      2 0x00000000
[  240.277359] [&lt;c047504c&gt;] (__schedule) from [&lt;c04754a0&gt;] (schedule+0x40/0x98)
[  240.284418] [&lt;c04754a0&gt;] (schedule) from [&lt;c0477d40&gt;] (schedule_timeout+0x148/0x188)
[  240.292191] [&lt;c0477d40&gt;] (schedule_timeout) from [&lt;c0476040&gt;] (wait_for_common+0xa4/0x170)
[  240.300491] [&lt;c0476040&gt;] (wait_for_common) from [&lt;c02efc1c&gt;] (mmc_wait_for_req_done+0x4c/0x13c)
[  240.309224] [&lt;c02efc1c&gt;] (mmc_wait_for_req_done) from [&lt;c02efd90&gt;] (mmc_wait_for_cmd+0x64/0x84)
[  240.317953] [&lt;c02efd90&gt;] (mmc_wait_for_cmd) from [&lt;c02f5b14&gt;] (__mmc_switch+0xa4/0x2a8)
[  240.325964] [&lt;c02f5b14&gt;] (__mmc_switch) from [&lt;c02f5d40&gt;] (mmc_switch+0x28/0x30)
[  240.333389] [&lt;c02f5d40&gt;] (mmc_switch) from [&lt;c02f0984&gt;] (mmc_flush_cache+0x54/0x80)
[  240.341073] [&lt;c02f0984&gt;] (mmc_flush_cache) from [&lt;c02ff0c4&gt;] (mmc_blk_issue_rq+0x114/0x4e8)
[  240.349459] [&lt;c02ff0c4&gt;] (mmc_blk_issue_rq) from [&lt;c03008d4&gt;] (mmc_queue_thread+0xc0/0x180)
[  240.357844] [&lt;c03008d4&gt;] (mmc_queue_thread) from [&lt;c003cf90&gt;] (kthread+0xdc/0xf4)
[  240.365339] [&lt;c003cf90&gt;] (kthread) from [&lt;c0010068&gt;] (ret_from_fork+0x14/0x2c)
..
..
[  240.664311] INFO: task partprobe:564 blocked for more than 120 seconds.
[  240.670943]       Not tainted 4.1.13-00510-g9d91424 #2
[  240.676078] "echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.683922] partprobe       D c047504c     0   564    486 0x00000000
[  240.690318] [&lt;c047504c&gt;] (__schedule) from [&lt;c04754a0&gt;] (schedule+0x40/0x98)
[  240.697396] [&lt;c04754a0&gt;] (schedule) from [&lt;c0477d40&gt;] (schedule_timeout+0x148/0x188)
[  240.705149] [&lt;c0477d40&gt;] (schedule_timeout) from [&lt;c0476040&gt;] (wait_for_common+0xa4/0x170)
[  240.713446] [&lt;c0476040&gt;] (wait_for_common) from [&lt;c01f3300&gt;] (submit_bio_wait+0x58/0x64)
[  240.721571] [&lt;c01f3300&gt;] (submit_bio_wait) from [&lt;c01fbbd8&gt;] (blkdev_issue_flush+0x60/0x88)
[  240.729957] [&lt;c01fbbd8&gt;] (blkdev_issue_flush) from [&lt;c010ff84&gt;] (blkdev_fsync+0x34/0x44)
[  240.738083] [&lt;c010ff84&gt;] (blkdev_fsync) from [&lt;c0109594&gt;] (do_fsync+0x3c/0x64)
[  240.745319] [&lt;c0109594&gt;] (do_fsync) from [&lt;c000ffc0&gt;] (ret_fast_syscall+0x0/0x3c)
..

Here is the detailed sequence showing when this issue may happen:

1) At probe time, mmci device is initialized and card busy detection based
on DAT[0] monitoring is enabled.

2) Later during run time, since card reported to support internal caches, a
MMCI_SWITCH command is sent to eMMC device with FLUSH_CACHE operation. On
receiving this command, eMMC may enter busy state (for a relatively short
time in the case of the dead-lock).

3) Then mmci interrupt is raised and mmci_irq() is called:

MMCISTATUS register is read and is equal to 0x01000440. So the following
status bits are set:
- MCI_CMDRESPEND (= 6)
- MCI_DATABLOCKEND (= 10)
- MCI_ST_CARDBUSY (= 24)

Since MMCIMASK0 register is 0x3FF, status variable is set to 0x00000040 and
BIT MCI_CMDRESPEND is cleared by writing MMCICLEAR register.

Then mmci_cmd_irq() is called. Considering the following conditions:
- host-&gt;busy_status is 0,
- this is a "busy response",
- reading again MMCISTATUS register gives 0x1000400,
MMCIMASK0 is updated to unmask MCI_ST_BUSYEND bit.

Thus, MMCIMASK0 is set to 0x010003FF and host-&gt;busy_status is set to wait
for busy end completion.

Back again in status loop of mmci_irq(), we quickly go through
mmci_data_irq() as there are no data in that case.  And we finally go
through following test at the end of while(status) loop:

/*
 * Don't poll for busy completion in irq context.
 */
if (host-&gt;variant-&gt;busy_detect &amp;&amp; host-&gt;busy_status)
	status &amp;= ~host-&gt;variant-&gt;busy_detect_flag;

Because status variable is not yet null (is equal to 0x40), we do not leave
interrupt context yet but we loop again into while(status) loop. So we run
across following steps:

a) MMCISTATUS register is read again and this time is equal to 0x01000400.
So that following bits are set:
- MCI_DATABLOCKEND (= 10)
- MCI_ST_CARDBUSY (= 24)

Since MMCIMASK0 register is equal to 0x010003FF:

b) status variable is set to 0x01000000.
c) MCI_ST_CARDBUSY bit is cleared by writing MMCICLEAR register.

Then, mmci_cmd_irq() is called one more time. Since host-&gt;busy_status is
set and that MCI_ST_CARDBUSY is set in status variable, we just return from
this function.

Back again in mmci_irq(), status variable is set to 0 and we finally leave
the while(status) loop. As a result we leave interrupt context, waiting for
busy end interrupt event.

Now, consider that busy end completion is raised IN BETWEEN steps 3.a) and
3.c). In such a case, we may mistakenly clear busy end interrupt at step
3.c) while it has not yet been processed. This will result in mmc command
to wait forever for a busy end completion that will never happen.

To fix the problem, this patch implements the following changes:

Considering that the mmci seems to be triggering the IRQ on both edges
while monitoring DAT0 for busy completion and that same status bit is used
to monitor start and end of busy detection, special care must be taken to
make sure that both start and end interrupts are always cleared one after
the other.

1) Clearing of card busy bit is moved in mmc_cmd_irq() function where
unmasking of busy end bit is effectively handled.
2) Just before unmasking busy end event, busy start event is cleared by
writing card busy bit in MMCICLEAR register.
3) Finally, once we are no more busy with a command, busy end event is
cleared writing again card busy bit in MMCICLEAR register.

This patch has been tested with the ST Accordo5 machine, not yet supported
upstream but relies on the mmci driver.

Signed-off-by: Sarang Mairal &lt;sarang.mairal@garmin.com&gt;
Signed-off-by: Jean-Nicolas Graux &lt;jean-nicolas.graux@st.com&gt;
Reviewed-by: Linus Walleij &lt;linus.walleij@linaro.org&gt;
Tested-by: Ulf Hansson &lt;ulf.hansson@linaro.org&gt;
Signed-off-by: Ulf Hansson &lt;ulf.hansson@linaro.org&gt;
</div><div class='diffstat-header'><a href='/cgit.cgi/linux/net-next.git/diff/?id=5cad24d835772f9f709971a8d6fcf12afe53b2a7'>Diffstat</a> (limited to 'drivers/usb/mon/Makefile')</div><table summary='diffstat' class='diffstat'>