RE: Problems with EDAC coexisting with BIOS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




>-----Original Message-----
>From: Doug Thompson [mailto:[email protected]]
>Sent: Friday, April 21, 2006 2:43 PM
>To: Gross, Mark
>Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>[email protected]; Doug Thompson;
[email protected]
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>On Fri, 2006-04-21 at 21:32 +0000, "Gross, Mark"  wrote:
>>
>> >-----Original Message-----
>> >From: Doug Thompson [mailto:[email protected]]
>> >Sent: Friday, April 21, 2006 1:57 PM
>> >To: Gross, Mark
>> >Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>> >[email protected]; [email protected]
>> >Subject: Re: Problems with EDAC coexisting with BIOS
>> >
>> >Mark thanks for the informaton on this.
>> >
>> >Now the e752x_edac.c driver contains no direct calls to panic within
>> >itself. The edac_mc.c 'core' piece does, but only calls that if an
UE
>> >error is found and panic on UE is enabled. (The other is a PCI
parity
>> >panic, but doesn't effect this path). It might be possible that
since
>> >the hidden register was now hidden, the retrieval function returns
some
>> >garbage which falsely triggers the panic by the core.
>>
>> The problem is that once the BIOS SMI re-hides the device reading
from
>> the chipset error registers returns 0xFFFFFFFF.
>
>Ok so the panic is caused by the driver because the UE bit is a one
>(somewhere in the 32-bit field), and then informs the CORE driver of
the
>false positive UE event, which does the panic trigger. I follow the
>logic now.
>
>The question then comes up: how to determine if the system has the
>hidden register feature.

For the E752x chipsets, if the Dev0:Fun1 is hidden after the BIOS hands
control to the OS, I think the driver should simply honor this and not
un-hide it.

The test is to check if bit 5 of the DEVPRES1 resister is set.  If set
the device is available, otherwise its hidden.  (see section 3.5.41 of
e7520 specification
ftp://download.intel.com/design/chipsets/datashts/30300602.pdf )

>
>You mentioned that the probe should be refactored to look for this all
>ones return value as such an indicator, is that correct?

Yes, basically we should consider removing the un-hide code in
e752x_probe1, and handle the case where the dev0:fun1 is left hidden by
the BIOS as an error condition at driver probe time.

>
>But there is a window that the register is unhidden, the probe runs,
>detects the hardware it is looking at and registers itself. Then later
>the window closes, the register is hidden again and the panic can
>reoccurr as above. Thus, a check needs to be made in the poll operation
>for the all ones return value, notify the log of this situtation,
>shutdown future polling of this hardware and basicly become a no-op
poll
>operation.

You can never predict when a SMI will bubble through the system.  Even
if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
do you deal with the race between the BIOS SMI based handling and the
driver?  Who will end up reading (and clearing) the error registers
first?  There is no good way to share today.


>
>Does that fix cover the issue as you have discovered it? Or is there
>more, or another fix you see?

No, you should return an error from e752x_probe1 if bit5 is cleared
already.  To do anything else you are assuming that you are privileged
to have more knowledge about what the BIOS is doing than you should.

I'll send the patch to e752x_edac.c this weekend.

--mgross

>
>doug t
>
>
>
>>
>> >
>> >I would like to see the panic output and stack trace to see where
that
>> >panic came from. It might have come from the PCI device access
subsytem
>> >when it was trying to access the now hidden register.
>> >
>> >can you post that panice information?
>>
>> Yes, but the edac_mc.c call to panic doesn't drop any data to the
>> console.
>> The following is the output from an instrumentation we did to the
EDAC
>> that went into RHEL4-U3 :
>>
>> INIT: version 2.85 booting
>>                 Welcome to Red Hat Enterprise Linux AS
>>                 Press 'I' to enter interactive startup.
>> Starting udev:  [  OK  ]
>> Initializing hardware...  storage network audioIn probe1, before
write
>> E752X_DEVPRES1 = 0x10
>> In probe1, after write
>> E752X_DEVPRES1 = 0x30
>>
>> Snip .....
>>
>> Configuring kernel parameters:  [  OK  ]
>> Setting clock  (localtime): Sun Mar 12 00:52:17 PST 2006 [  OK  ]
>> Setting hostname localhost.localdomain:  [  OK  ]
>> Checking root filesystem
>> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
>> /: clean, 365954/8830976 files, 2357630/17657443 blocks
>> [  OK  ]
>> Remounting root filesystem in read-write mode:  [  OK  ]
>> No devices found
>> No Software RAID disks
>> Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
>> ment: E752X_DEVPRES1 = 0x10
>> Kernel panic - not syncing: MC0: Uncorrected Error
>>
>> that's all you get, no stack trace.
>>
>> The EX752C_DEVPRES1 = ... debug statements came from sprinkling call
to
>> the following debug function we instrumented into the EDAC driver
code.
>>
>> static void print_error_info(struct pci_dev *pdev)
>> {
>>   u8 stat8;
>>   pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
>>   printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
>> }
>>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux