Skip to main content

Raymii.org Logo (IEC resistor symbol)logo

Quis custodiet ipsos custodes?
Home | About | All pages | RSS Feed | Gopher

That time when one of my HP-UX servers lost half of it's RAM (and how to connect to an HP iLO 2 with modern OpenSSH (7.6+))

Published: 06-06-2018 | Author: Remy van Elst | Text only version of this article


Table of Contents


One of my favorite sayings is: 'Hardware is stupid, move everything to thecloud!'. The cloud is just someone elses computer, but at least I'm notresponsible for the hardware anymore, since hardware breaks. When a VM breaks,because you use configuration management and version control, just roll out anew one. We all know that's not true, but still, the thought of it is nice. Aman can have hopes and dreams, even if the harsh reality shoots them down everytime.

Last week one of the HP-UX machines had a failing disk and this week it'sback with a whole new issue. After it was rebooted (due to issues with theservices running on it), the Event Monitoring Service (EMS) sent an emailregarding RAM issues and after manual checking it seems the machine lost half ofit's RAM.

It should have 16 GB and now it only has 8 GB. You might imagine my suprise.This post goes into my troubleshooting, since I was not able to go to themachine, shut it down and check if the RAM was still there or note part numbers.I'll cover the use of cstm (Support Tool Manager), how to connect to the HP iLO(out of band access) with modern OpenSSH (7.2) and the steps I took to gatherinformation on what might have happened.

If you like this article, consider sponsoring me by trying out a Digital OceanVPS. With this link you'll get $100 credit for 60 days). (referral link)

This machine is not under monitoring yet, therefore it is regularly visited andchecked manually. You could call that monitoring as well, but I rather haveIcinga (or any other tool) doing work for me. An intern doing said things is notconsidered a tool, sadly (this is a joke). The machine furthermore is part of anolder setup (10 years) and is being replaced, so there is not a lot of budget ortime allocated. I like to learn new things, so I seize the oppertunity to expandmy UNIX knowledge, thus allocating some of my own time to research (andhopefully fix) these issues. Read my other HP-UX articles as well if youlike that kind of stuff.

Now that you know some background, let's dive in.

DRAM failure on DIMM XX, deallocate rank

It all started with this nice email from the HP-UX system:

Onderwerp: hpux09: Event Monitor Notification>------------ Event Monitoring Service Event Notification ------------<Notification Time: Fri Jun  1 10:59:36 2018hpux09 sent Event Monitor notification information:/system/events/ipmi_fpl/ipmi_fpl is >= 3.Its current value is MAJORWARNING(3).Event data from monitor:Event Time..........: Fri Jun  1 10:59:36 2018Severity............: MAJORWARNINGMonitor.............: fpl_emEvent #.............: 890System..............: hpux09Summary:     DRAM failure on DIMM XX, deallocate rankDescription of Error:SFW has detected that a DRAM is failing on the DIMM specified by the physical location. The rank the failing DIMM is part of will be deallocated.Probable Cause / Recommended Action:SFW detected a failing DIMMReplace the DIMM flagged by SFWAdditional Event Data:     System IP Address...: 198.51.100.30     Event Id............: 0x5b110af800000000     Monitor Version.....: A.01.00     Event Class.........: System     Client Configuration File...........:     /var/stm/config/tools/monitor/default_fpl_em.clcfg     Client Configuration File Version...: A.01.00          Qualification criteria met.               Number of events..: 1     Associated OS error log entry id(s):          None     Additional System Data:          System Model Number.............: ia64 hp server rx3600          EMS Version.....................: A.04.20.31.02          STM Version.....................: D.04.00          System Serial Number............: DEH48456BK     Latest information on this event:          http://docs.hp.com/hpux/content/hardware/ems/fpl_em.htm#890v-v-v-v-v-v-v-v-v-v-v-v-v    D  E  T  A  I  L  S    v-v-v-v-v-v-v-v-v-v-v-v-vIPMI event hex: 0x7a800fa000e00250 0xffffffff010cff74 Time Stamp: Fri Jun  1 08:44:24 2018 Event keyword: MEM_CHIPSPARE_DEALLOC_RANK Alert level name: Warning Reporting vers: 1 Data field type: Reserved Decoded data field:Reporting entity ID: 0 ( Cab 0 Cell 0 CPU 0 ) Reporting entity Full Name: System Firmware IPMI Event ID : 4000 (0xfa0)>---------- End Event Monitoring Service Event Notification ----------<

The HP-UX error message is not very helpful, since no actual DIMM location isgiven:

DRAM failure on DIMM XX, deallocate rankSFW has detected that a DRAM is failing on the DIMM specified by the physical location. The rank the failing DIMM is part of will be deallocated.

The machine has no DIMM slot XX. The purchase order stated that the machine camewith 16 GB of RAM installed. It now only reported 8 GB of memory.

My first thought was to search the part number and purchase replacement modules(or search our own stash of "old stuff"), then doing my regular procedure withRAM issues, only executing the next step if the previous action did not resolvethe issue. I've had my fair share of Dell hardware issues (looking at you,noncritical raid controller error after upgrading OpenManage...) so my procedurehas proven itself.

Since this machine is not under a support contract anymore, step 1 and 5 are offthe table. HP requires an active support contract to get firmware updates. I donot have access to replacement DIMM's and I was not able to go to the machine todo a DIMM swap (or note part numbers).

The only thing left to do was to dig into the issue and gather as much loggingand information as possible, to prepare for a visit to the machine.

However, first I ended up researching a term unfamiliar to me, Memory Ranks.

Memory Ranking

I had never heard of the term Rank in RAM context, but Wikipedia came tothe rescue:

Sometimes memory modules are designed with two or more independent sets ofDRAM chips connected to the same address and data buses; each such set is calleda rank.

There is even an article on Memory Ranking:

A memory rank is a set of DRAM chips connected to the same chip select, whichare therefore accessed simultaneously.

That still didn't tell me much, how correct it might be. This site had avery helpfull and practical explanation:

The term "rank" simply refers to a 64-bit chunk of data. In its simplest form,a DIMM with DRAM chips on just one side would contain a single 64-bit chunk ofdata and would be called a single-rank (1R) module. DIMMs with chips on bothsides often contain at least two 64-bit chunks of data and are referred to asdual-rank (2R) modules. Some DIMMs can have DRAM chips on both sides but areconfigured so that they contain two 64-bit data chunks on each sidefour intotaland are referred to as quad-rank (4R) modules. Quad-rank DIMMs run at amaximum PC3-8500 (DDR3-1066) speed in current architecture.

Now that I had a better understanding of what a memory rank is, my suspicion isthat there is one failed memory module and the rest of the modules (after thatfailed module) are not loaded anymore.

Support Tool Manager

Using cstm I hope to find the part number of the DIMM's so I can at leastorder a few replacement modules. According to multiple HPe forumpostsctsm should report the part number with this command:

echo "selclass qualifier memory;info;wait;infolog" | /usr/sbin/cstm In my case, the output did not contain the part number:Running Command File (/usr/sbin/stm/ui/config/.stmrc).-- Information --Support Tools ManagerVersion D.04.00Product Number B4708AA(C) Copyright Hewlett Packard Co. 1995-2008All Rights ReservedUse of this program is subject to the licensing restrictions describedin "Help-->On Version".  HP shall not be liable for any damages resultingfrom misuse or unauthorized use of this program.cstm>selclass qualifier memory;info;wait;infolog-- Updating Map --Updating Map...-- Converting a (4036) byte raw log file to text. --Preparing the Information Tool Log for IPF_MEMORY on path memory File ....... hpux09  :  198.51.100.30 .... -- Information Tool Log for IPF_MEMORY on path memory --Log creation time: Fri Jun  1 12:13:01 2018Hardware path: memoryBasic Memory Description    Module Type: MEMORY   Page Size: 4096 Bytes   Total Physical Memory: N/A    Total Configured Memory: 8192 MB   Total Deconfigured Memory: N/A Memory Board Inventory    DIMM Location          Size(MB)     DIMM Location          Size(MB)   --------------------   --------     --------------------   --------   Ext 0 DIMM 0A          2048         Ext 0 DIMM 0B          2048       Ext 0 DIMM 0C          2048         Ext 0 DIMM 0D          2048       Ext 0 DIMM 1A          ----         Ext 0 DIMM 1B          ----       Ext 0 DIMM 1C          ----         Ext 0 DIMM 1D          ----       Ext 0 DIMM 2A          ----         Ext 0 DIMM 2B          ----       Ext 0 DIMM 2C          ----         Ext 0 DIMM 2D          ----       Ext 0 Total: 8192 (MB)   ===========================================================================   DIMM Location          Size(MB)     DIMM Location          Size(MB)   --------------------   --------     --------------------   --------   Ext 1 DIMM 0A          ----         Ext 1 DIMM 0B          ----       Ext 1 DIMM 0C          ----         Ext 1 DIMM 0D          ----       Ext 1 DIMM 1A          ----         Ext 1 DIMM 1B          ----       Ext 1 DIMM 1C          ----         Ext 1 DIMM 1D          ----       Ext 1 DIMM 2A          ----         Ext 1 DIMM 2B          ----       Ext 1 DIMM 2C          ----         Ext 1 DIMM 2D          ----       Ext 1 Total: 0 (MB)   ===========================================================================Memory Error Log Summary    The memory error log is empty.Page Deallocation Table (PDT)    The Page Deallocation Table is empty.   PDT Entries Used: 0   PDT Entries Free: 100   PDT Total Size: 100 -- Information Tool Log for IPF_MEMORY on path memory --View   - To View the file.Print  - To Print the file.SaveAs - To Save the file.Enter Done, Help, Print, SaveAs, or View: [Done] #.

It did gave me a better idea of the physical memory layout.

According to this post, the output with part number should look like this:

DIMM  Location       Size(MB)  State   Serial Num       Part Num-------------------- --------  ------- ---------------- ------------------Cab 0 Cell 0 DIMM 0A 2048      Config  PRY07064US       A9846-60301Cab 0 Cell 0 DIMM 0B 2048      Config  PRY07063JF       A9846-60301Cab 0 Cell 0 DIMM 1A 2048      Config  PRY06372A2       A9846-60301-------------------- -------- ------- ---------------- ------------------

But, as can be seen, no part numbers in my output. I guess it's a versiondifference. I found another command to get ALL the hardware in the machine:

echo "selall;infolog;wait"|cstm

but except for a ton of output, it did not contain any part number. The post didreference the following:

You can get the memory dimm part number from cstm and also from GSP/MPRun the following command..GSP>cmGSP:CM>df ==> Select A for selecting all and D to dump.From this out put you would get the exact part number for the dimm and also other HW connected to this box.

The MP referred here is the "HP Integrated Lights Out Management Processor",shortly known as the iLO. Dell calls them iDrac (integrated Dell remote accesscontroller) and on a SuperMicro server it's just called IPMI or OOB (out of bandaccess). It provides a way to power on/off and troubleshoot the server when youcannot access it, often also a remote console.

Since I was not able to go to the machine and reboot into some kind of BIOS oriLO console, I had to resort to connecting via the web or SSH. The web interfacewas useless on gathering RAM information, so SSH was my last resort.

SSH with modern OpenSSH (7.6) to an HP iLO2

With good hope I connected to the iLO IP from my Ubuntu 18.04 box. Only to begreeted by a happy little error message:

$ ssh Admin@192.0.2.30Unable to negotiate with 192.0.2.30 port 22: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1

Time to configure some old settings. In my ~/.ssh/config file I started withthe following:

Host hpux09-ilo  HostName 192.0.2.30  KexAlgorithms diffie-hellman-group1-sha1

But of course, just a KeyAlgorithm is not enough:

$ ssh Admin@hpux09-iloUnable to negotiate with 192.0.2.30 port 22: no matching cipher found. Their offer: aes128-cbc,3des-cbc

Let's add that ciphersuite to my ~/.ssh/config:

Host hpux09-ilo  HostName 192.0.2.30  KexAlgorithms diffie-hellman-group1-sha1  Ciphers aes128-cbc,3des-cbc

We know that the setting did something, because now it just fails with nohelpful error:

$ ssh Admin@hpux09-iloReceived disconnect from 192.0.2.30 port 22:11:  Client DisconnectDisconnected from 192.0.2.30 port 22

Lucky for me, the iLO 2 was horribly old and insecure in 2013 already, as thispost shows. With OpenSSH 6.2 there were problems connecting to the iLO,back then. I'm on OpenSSH 7.6 so let's hope that their fix works for me as well.

In a firmware update for the iLO2 some of these bugs are fixed, but not all of them, and quoting Oscar A. Perez (who lists "Senior Embedded System Engineer, 100% committed to make Embedded Systems reliable, safe and secure" on it's LinkedIn for 15 years, so I guess probably is legit), it will be hard to fix in the future due to the limited iLO 2 memory:

I had to make lots of changes to the mpSSH server code to get it to work withthe new OpenSSH 6.2p1. I hope this is the last time we have to make changes likethis one. iLO2 memory is very limited and already full so, we won't be able tospin new firmware releases, every time the OpenSSH folks decide to increase thesize of the payload during Key Exchange.

Lower on in the post I do find the correct OpenSSH options to connect. I missedthe HostKeyAlgorithms and the MACs. The complete, working configuration inmy ~/.ssh/config file looks like this:

Host hpux09-ilo  HostName 192.0.2.30  HostKeyAlgorithms ssh-rsa,ssh-dss  KexAlgorithms diffie-hellman-group1-sha1  Ciphers aes128-cbc,3des-cbc  MACs hmac-md5,hmac-sha1

A one-liner with these options:

ssh -o HostKeyAlgorithms=ssh-rsa,ssh-dss -o KexAlgorithms=diffie-hellman-group1-sha1 -o Ciphers=aes128-cbc,3des-cbc -o MACs=hmac-md5,hmac-sha1 Admin@hpux09-ilo

*Do note that a better solution here is to upgrade the hardware and get it under a support contract.

iLO Management Processor hardware information

Logging in gives me a few options to work with:

Admin@192.0.2.30's password:               Hewlett-Packard Integrity Integrated Lights-Out 2    (c) Copyright Hewlett-Packard Company 1999-2008.  All Rights Reserved.                           MP Host Name: mphpux09                              Revision F.02.23   MP MAIN MENU:         CO: Console        VFP: Virtual Front Panel         CM: Command Menu      SMCLP: Server Management Command Line Protocol         CL: Console Log         SL: Show Event Logs         HE: Main Help Menu          X: Exit Connection[mphpux09] MP> 

The forum post stated to go into the Command Menu:

[mphpux09] MP> cm

(Use Ctrl-B to return to MP main menu.)

Then to enter the following command:

[mphpux09] MP:CM> df -nc -a

The output contains a long, long list of hardware. You can find it at the bottomof this article. We are interested in the RAM parts. To my pleasent suprise itdid list the actual RAM in the machine, the 16 GB, including the part numbers.The Operating system does not see the Ext1 DIMM's, the iLO does:

Ext0

The cstm output showed me that Ext0 is filled with 4 DIMM's of 2 GB each:

Memory Board Inventory    DIMM Location          Size(MB)     DIMM Location          Size(MB)   --------------------   --------     --------------------   --------   Ext 0 DIMM 0A          2048         Ext 0 DIMM 0B          2048       Ext 0 DIMM 0C          2048         Ext 0 DIMM 0D          2048       Ext 0 DIMM 1A          ----         Ext 0 DIMM 1B          ----       Ext 0 DIMM 1C          ----         Ext 0 DIMM 1D          ----       Ext 0 DIMM 2A          ----         Ext 0 DIMM 2B          ----       Ext 0 DIMM 2C          ----         Ext 0 DIMM 2D          ----       Ext 0 Total: 8192 (MB)

The ILO confirms that:

PRODUCT INFO:FRU Entry #  16 :FRU NAME                : MemExt0 DIMM0AFRU ID                  : 0128JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D209EMfg Unique Serial #     : 0x00CE010640711D209EFRU Entry #  17 :FRU NAME                : MemExt0 DIMM0BFRU ID                  : 0136JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D20AAMfg Unique Serial #     : 0x00CE010640711D20AAFRU Entry #  18 :FRU NAME                : MemExt0 DIMM0CFRU ID                  : 0144JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D20A2Mfg Unique Serial #     : 0x00CE010640711D20A2FRU Entry #  19 :FRU NAME                : MemExt0 DIMM0DFRU ID                  : 0152JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D20A8Mfg Unique Serial #     : 0x00CE010640711D20A8

I cannot easily find a replacement for this part number, but it comes up in a HPforum post as a Samsung 2GB module.

Ext1

Ext1 according to the operating system is empty:

   DIMM Location          Size(MB)     DIMM Location          Size(MB)   --------------------   --------     --------------------   --------   Ext 1 DIMM 0A          ----         Ext 1 DIMM 0B          ----       Ext 1 DIMM 0C          ----         Ext 1 DIMM 0D          ----       Ext 1 DIMM 1A          ----         Ext 1 DIMM 1B          ----       Ext 1 DIMM 1C          ----         Ext 1 DIMM 1D          ----       Ext 1 DIMM 2A          ----         Ext 1 DIMM 2B          ----       Ext 1 DIMM 2C          ----         Ext 1 DIMM 2D          ----       Ext 1 Total: 0 (MB)

The iLO thinks differently:

FRU Entry #  20 :FRU NAME                : MemExt1 DIMM0AFRU ID                  : 0160JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD925C4D3Mfg Unique Serial #     : 0x002C0C0836D925C4D3FRU Entry #  21 :FRU NAME                : MemExt1 DIMM0BFRU ID                  : 0168JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD72FB723Mfg Unique Serial #     : 0x002C0C0836D72FB723FRU Entry #  22 :FRU NAME                : MemExt1 DIMM0CFRU ID                  : 0176JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD925C4D9Mfg Unique Serial #     : 0x002C0C0836D925C4D9FRU Entry #  23 :FRU NAME                : MemExt1 DIMM0DFRU ID                  : 0184JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD925C4CBMfg Unique Serial #     : 0x002C0C0836D925C4CB

So the DIMM's are still in the system, and of all places, Amazon sellsthese DIMM's. The type is 2GB DDR2 PC2-5300 667MHz 240pin ECC, which isexactly what I need to order a replacement or look into our hardware stash. Ittook me a good hour, but the part number and some more type information has beenfound.

I still need to figure out which specific DIMM broke, but that is not in theabove output.

iLO Event Log

In the iLO menu, I also saw SL: Show Event Logs. Maybe that will tell mespecifically which DIMM could be the culprit.

 [mphpux09] MP> slEvent Log Viewer Menu:       Log Name            Entries    % Full      Latest Timestamped Entry---------------------------------------------------------------------------   E - System Event           32         4 %      01 Jun 2018 08:47:06   F - Forward Progress     4000       100 %    B - Current Boot          130        43 %    P - Previous Boot         130        43 %    I - iLO Event             316        63 %      01 Jun 2018 11:14:55   C - Clear All Logs   L - Live Events

Let's view the System Event log, using option E:

Enter menu item or [Ctrl-B] to Quit: E        Log Name            Entries    % Full      Latest Timestamped Entry---------------------------------------------------------------------------   E - System Event           32         4 %      01 Jun 2018 08:47:06Event Log Navigation Help:   +       View next block     (forward in time,  e.g. from 3 to 4)   -       View previous block (backward in time, e.g. from 3 to 2)   <CR>    Continue to the next or previous block   D       Dump the entire log   F       First entry   L       Last entry   J       Jump to entry number   H       View mode configuration - Hex   K       View mode configuration - Keyword   T       View mode configuration - Text   A       Alert Level Filter options   U       Alert Level Unfiltered   ?       Display this Help menu   Q       Quit and return to the Event Log Viewer Menu   Ctrl-B  Exit command, and return to the MP Main Menu

Just give me everything, D it is:

MP:SL (+,-,<CR>,D, F, L, J, H, K, T, A, U, ? for Help, Q or Ctrl-B to Quit) >d   Confirm? (Y/[N]): y#  Location|Alert| Encoded Field    |  Data Field    |   Keyword / Timestamp-------------------------------------------------------------------------------0     BMC      2  0x205A9918E5020010 FFFF0103FCC00300 TIME_SET                                                      02 Mar 2018 09:27:011     BMC      2  0x205A991B79020020 FFFF0103FCC00300 TIME_SET                                                      02 Mar 2018 09:38:012     BMC      2  0x205ABBA1FC020030 FFFF0103FCC00300 TIME_SET                                                      28 Mar 2018 14:09:003     BMC      2  0x205ABBA490020040 FFFF0103FCC00300 TIME_SET                                                      28 Mar 2018 14:20:004     BMC      2  0x205B0FEF26020050 FFFF006F04140300 POWER_BUTTON_PRESSED                                                      31 May 2018 12:48:385     BMC      2  0x205B0FEF28020060 040EA37004120300 CHASSIS_CONTROL_REQUEST                                                      31 May 2018 12:48:406     HPUX 2   2  0x54801C3002E00070 00000000001A100C HP-UX_OS_NORMAL_SHUTDOWN                                                      31 May 2018 12:48:547     BMC      2  0x205B0FEF38020090 FFFF006F04140300 POWER_BUTTON_PRESSED                                                      31 May 2018 12:48:568     BMC      2  0x205B0FEF3B0200A0 040EA37004120300 CHASSIS_CONTROL_REQUEST                                                      31 May 2018 12:48:599     BMC      2  0x205B0FEF420200B0 FFFF056FFA220300 ACPI_SOFT_OFF                                                      31 May 2018 12:49:0610    BMC      2  0x205B0FEF430200C0 FA00A370FA120300 CHASSIS_CONTROL_REQUEST                                                      31 May 2018 12:49:0711    BMC      2  0x205B0FEF440200D0 FFFF000943090300 POWER_UNIT_DISABLED                                                      31 May 2018 12:49:0812    BMC      2  0x205B0FEF550200E0 FFFF006F04140300 POWER_BUTTON_PRESSED                                                      31 May 2018 12:49:2513    BMC      2  0x205B0FEF560200F0 FFFF027000120300 SOFT_RESET                                                      31 May 2018 12:49:2614    BMC      2  0x205B0FEF56020100 FFFF010943090300 POWER_UNIT_ENABLED                                                      31 May 2018 12:49:2615    BMC      2  0x205B0FEF57020110 FFFF006FFA220300 ACPI_ON                                                      31 May 2018 12:49:2716    BMC      2  0x205B0FEF57020120 0401A37004120300 CHASSIS_CONTROL_REQUEST                                                      31 May 2018 12:49:2717    BMC      2  0x205B0FEF64020130 FFFF027000120300 SOFT_RESET                                                      31 May 2018 12:49:4018    SFW      2  0xC15B0FEF71020140 FFFF000A001D0300 CPU_START_BOOT                                                      31 May 2018 12:49:5319    SFW  0   2  0x5480006300E00150 0000000000000000 BOOT_START                                                      31 May 2018 12:49:5320    SFW  0  *3  0x7A800FA000E00170 FFFFFFFF010CFF74 MEM_CHIPSPARE_DEALLOC_RANK                                                      31 May 2018 12:50:0121    SFW  0  *3  0x7A800FA000E00190 FFFFFFFF010BFF74 MEM_CHIPSPARE_DEALLOC_RANK                                                      31 May 2018 12:50:0122    SFW  0   2  0x40801CBB00E001B0 0000000000000000 BOOT_SWITCH_INSECURE_MODE                                                      31 May 2018 12:50:2423    HPUX 0   2  0x54801C2F00E001D0 0000000000001001 HP-UX_BOOT_COMPLETE                                                      31 May 2018 12:52:3924    HPUX 0   2  0x54801C3000E001F0 00000000001A100C HP-UX_OS_NORMAL_SHUTDOWN                                                      01 Jun 2018 08:44:0125    BMC      2  0x205B110757020210 FFFF027000120300 SOFT_RESET                                                      01 Jun 2018 08:44:0726    SFW      2  0xC15B110760020220 FFFF000A001D0300 CPU_START_BOOT                                                      01 Jun 2018 08:44:1627    SFW  0   2  0x5480006300E00230 0000000000000000 BOOT_START                                                      01 Jun 2018 08:44:1628    SFW  0  *3  0x7A800FA000E00250 FFFFFFFF010CFF74 MEM_CHIPSPARE_DEALLOC_RANK                                                      01 Jun 2018 08:44:2429    SFW  0  *3  0x7A800FA000E00270 FFFFFFFF010BFF74 MEM_CHIPSPARE_DEALLOC_RANK                                                      01 Jun 2018 08:44:2430    SFW  0   2  0x40801CBB00E00290 0000000000000000 BOOT_SWITCH_INSECURE_MODE                                                      01 Jun 2018 08:44:4831    HPUX 0   2  0x54801C2F00E002B0 0000000000001001 HP-UX_BOOT_COMPLETE                                                      01 Jun 2018 08:47:06   -> This is the last entry in the selected log.MP:SL (+,-,<CR>,D, F, L, J, H, K, T, A, U, ? for Help, Q or Ctrl-B to Quit) >

Filtering out all the logs (reboots were expected), focussing on the DIMM parts:

28    SFW  0  *3  0x7A800FA000E00250 FFFFFFFF010CFF74 MEM_CHIPSPARE_DEALLOC_RANK29    SFW  0  *3  0x7A800FA000E00270 FFFFFFFF010BFF74 MEM_CHIPSPARE_DEALLOC_RANK

Not much help, no clear DIMM location yet. But, I did search around and foundthis HP support page titled "HP Integrity rx3600 Servers - BOOT DECONFIGCPU Can Be Caused by Memory Dimm Failure". In the log output there are theselines, looking a lot like the above output:

SFW  0   0  0x040000E500E00000 FFFFFFFF000AFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF000BFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF001AFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF001BFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF010AFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF010BFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF011AFF74 MEM_SPD_2G_DIMM_FOUNDSFW  0   0  0x040000E500E00000 FFFFFFFF011BFF74 MEM_SPD_2G_DIMM_FOUND

In this part I do see a pattern in the Data Field column, the only thingchanging looks an awfull lot like a memory location:

Comparing that to the DIMM layout output from earlier:

  DIMM Location          Size(MB)     DIMM Location          Size(MB)   --------------------   --------     --------------------   --------   Ext 0 DIMM 0A          2048         Ext 0 DIMM 0B          2048       Ext 0 DIMM 0C          2048         Ext 0 DIMM 0D          2048      DIMM Location          Size(MB)     DIMM Location          Size(MB)   --------------------   --------     --------------------   --------   Ext 1 DIMM 0A          ----         Ext 1 DIMM 0B          ----       Ext 1 DIMM 0C          ----         Ext 1 DIMM 0D          ---- 

Not exactly the same but good enough, since it's a different server (the aboveoutput seems to be for the 8 port memory carrier board). This postconfirms my suspicion on the HEX values corresponding to the DIMM Slots.

HP Integrity rx3600 Server User Service Guide

The HP Integrity rx3600 Server User Service Guide, chapter 5Troubleshooting, subsection CPU, Memory and SBA, subsection Troubleshootingrx3600 memory has a picture of the 24 slot memory carrier board:

I know this server has 2 24 slot memory carrier boards due to the full output ofthe hardware list, it states 4 12 DIMM Memory Extender components.

The service guide also lists the error message:

Furthermore, the service manual states this:

* Troubleshooting rx3600 MemoryThe memory controller logic in the zx2 chip supports three versions of memory expanders. An eightDIMM memory carrier provides two memory boards that hold two or four memory DIMMs in bothmemory cells 0 and 1. A 24 DIMM memory carrier provides two 12-DIMM memory boards thathold four, eight, or twelve DIMMs in both memory cells 0 and 1.All three versions of memory expanders must have their memory DIMMs installed in groups of four,known as a quad. DIMM quads of different sizes can be installed in any physical rank on allversions of memory expanders, but they must be grouped by their size.* Memory Subsystem BehaviorsYou must replace DIMMs or memory carriers when a threshold is reached for multiple double-byteerrors from one or more DIMMs on the same board. When any uncorrectable memory error (morethan 2 bytes) or when no quad of like memory DIMMs is loaded in rank 0 of side 0, you mustreplace the DIMMs. All other DIMM errors are corrected by zx2 and reported to the PageDeallocation Table (PDT) and the diagnostic LED panel.* Memory DIMM Load OrderFor a minimally loaded server, four equal-size memory DIMMs must be installed in slots 0A, 0B,0C, and 0D on the same side of the 24/48 slot memory expander; and in the 0A and 0B slotson both 0 and 1 sides of the 8 slot memory expander.The first quad of DIMMs are always loaded into rank 0s slots for side 0 then in the rank 0s slotsfor side 1. The next quad of DIMMs are loaded into rank 1s slots for side 0, then for side 1, andso on, until all ranks slots for both sides are full.Best memory subsystem performance result when both memory sides 0 and 1 have the same numberof DIMM quads in them.

In chapter 6. "Removing and replacing server components", there are 11 pages(185-196), with picutes and example configurations, on replacing the DIMM's. Italso has Memory loading guidelines:

Use the following rules and guidelines when installing memory:

(Nested lists in Markdown are fun)

The guide even has a list of Customer Replacable parts including HP andreplacement part numbers. For my 2GB memory module, it would be AD328A, forsale on lots of places.

Conclusion

Combining all the knowledge and logging, my best guess is that the followingslots have issues:

Replacement DIMM's are ordered and on their way, soon to be replaced in thecorrect order. Let's hope that the machine get's the other half of it's RAM backand the problem is fixed.

Reference, complete output of MP:CM> df -nc -a

FRU Entry #   0 : FRU NAME: Core I/O Board ID:0000CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 7462080 Manufacturer       : CELESTICA  Product Name       : Core IO Board with VGA           S/N                : MYL011Y014       Part Number        : AB463-60003 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4848 Custom Info        : A5 Custom Info        : 0PRODUCT INFO:FRU Entry #   1 :FRU NAME: Mem Extender 0 ID:0001CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6729120 Manufacturer       : CELESTICA  Product Name       : 12 DIMM Memory Extender          S/N                : TH9842X51H       Part Number        : AB463-60112 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4721 Custom Info        : A2 Custom Info        : 0PRODUCT INFO:FRU Entry #   2 :FRU NAME: Mem Extender 1 ID:0002CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6729120 Manufacturer       : CELESTICA  Product Name       : 12 DIMM Memory Extender          S/N                : TH9842X51J       Part Number        : AB463-60112 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4721 Custom Info        : A2 Custom Info        : 0PRODUCT INFO:FRU Entry #   3 :FRU NAME: Power Supply 0 ID:0003CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6662537 Manufacturer       : C&D        Product Name       : BULK POWER SUPPLY                S/N                : A804070ET3       Part Number        : 0957-2198   Fru File ID        : 10 Custom Info        : 00000000 Custom Info        : 0804 Custom Info        : 07 Custom Info        : 0PRODUCT INFO:FRU Entry #   4 :FRU NAME: Power Supply 1 ID:0004CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6675416 Manufacturer       : muRata-ps  Product Name       : BULK POWER SUPPLY                S/N                : A8371000F7       Part Number        : 0957-2198   Fru File ID        : 10 Custom Info        : 00000000 Custom Info        : 0837 Custom Info        : 10 Custom Info        : 0PRODUCT INFO:FRU Entry #   5 :FRU NAME: I/O Assembly ID:0005CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6655680 Manufacturer       : CELESTICA  Product Name       : 10 Slot PCI-E 1.1 IOBP           S/N                : TH9834955H       Part Number        : AB463-60028 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4832 Custom Info        : A2 Custom Info        : 0PRODUCT INFO:FRU Entry #   6 :FRU NAME: Display Board ID:0006CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6703200 Manufacturer       : CELESTICA  Product Name       : DVD/Display Board                S/N                : TH983910DA       Part Number        : AB463-60020 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4814 Custom Info        : A3 Custom Info        : 0PRODUCT INFO:FRU Entry #   7 :FRU NAME: Disk Backplane ID:0007CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6668640 Manufacturer       : CELESTICA  Product Name       : 8 Disk Drive SAS Backplane       S/N                : TH9835V650       Part Number        : AB463-60006 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4814 Custom Info        : A3 Custom Info        : 0PRODUCT INFO:FRU Entry #   8 :FRU NAME: ProcessorCarrier ID:0009CHASSIS INFO: Type:Rack Mount Chassis Part Number        :             Serial Number      :             BOARD INFO: Mfg Date/Time      : 6732745 Manufacturer       : JABIL      Product Name       : 2 Socket CPU Carrier             S/N                : MYJ84203PH       Part Number        : AB463-60113 Fru File ID        : 10 Custom Info        : C        Custom Info        : 4818 Custom Info        : A3 Custom Info        : 0PRODUCT INFO: Manufacturer       : hp Product Name       : server rx3600                    Part/Model         :             Version            :        S/N                :                      Asset Tag          :                                  FRU File ID        : 11 Custom Info        : 411FRU Entry #   9 :FRU NAME: Interconnect Bd ID:0010CHASSIS INFO: Type:Rack Mount Chassis Part Number        :             Serial Number      :             BOARD INFO: Mfg Date/Time      : 6694560 Manufacturer       : CELESTICA  Product Name       : SAS Interconnect Board           S/N                : TH9838Z061       Part Number        : AB463-60025 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4814 Custom Info        : A2 Custom Info        : 0PRODUCT INFO: Manufacturer       : hp Product Name       : server rx3600 Part/Model         : AB596A Version            :        S/N                : DEH48456BK Asset Tag          :                                  FRU File ID        : 11 Custom Info        : 411FRU Entry #  10 :FRU NAME: Hot-Plug Board ID:0011CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 6658560 Manufacturer       : CELESTICA  Product Name       : PCI Hot Plug Control Board       S/N                : TH983515H7       Part Number        : AB463-60002 Fru File ID        : 10 Custom Info        : A        Custom Info        : 4711 Custom Info        : A2 Custom Info        : 0PRODUCT INFO:FRU Entry #  11 :FRU NAME: I/O Power Module ID:0015CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 2105376 Manufacturer       : C&D        Product Name       : PCI POWER BOARD                  S/N                : 9080708250F8     Part Number        : 0950-4677   Fru File ID        : 10 Custom Info        : X1       Custom Info        : 0825 Custom Info        : A1 Custom Info        : 0PRODUCT INFO:FRU Entry #  12 :FRU NAME: Processor 0 ID:0032PROCESSOR DATA  S-spec/QDF:  LAB5 Sample/Prod: 01CORE DATA  Arch Revision                :  00 Core Family                  :  20 Core Model                   :  01 Core Stepping                :  01 Max Core Frequency           : 1666 MHZ Max SysBus Frequency         :  333 MHZ Core Voltage                 : 1150 mV Core Voltage Tolerance,High  :   64 mV Core Voltage Tolerance,Low   :   3F mVCACHE DATA  Cache Size                   :   18 MBPACKAGE DATA  Package Revision             : NE Substrate Revision: 01PROC PART NUMBER DATA  Part Number                  : 80567KF Electronic Signature         : 0003AB71B8C89784THERMAL REF DATA  Upper Temp Ref               :  92 C Calibr Offset                :  18 CFEATURES DATA  IA-32 Proc Core Feature Flags:  FFFB8743 IA-64 Proc Core Feature Flags:  1B81806300000000 Package Feature Flags        :  3F010000 Devices on TAP Chain         :  2FRU Entry #  13 :FRU NAME: Processor 1 ID:0033PROCESSOR DATA  S-spec/QDF:  LAB5 Sample/Prod: 01CORE DATA  Arch Revision                :  00 Core Family                  :  20 Core Model                   :  01 Core Stepping                :  01 Max Core Frequency           : 1666 MHZ Max SysBus Frequency         :  333 MHZ Core Voltage                 : 1150 mV Core Voltage Tolerance,High  :   64 mV Core Voltage Tolerance,Low   :   3F mVCACHE DATA  Cache Size                   :   18 MBPACKAGE DATA  Package Revision             : NE Substrate Revision: 01PROC PART NUMBER DATA  Part Number                  : 80567KF Electronic Signature         : 0001E4BA8B93CDD5THERMAL REF DATA  Upper Temp Ref               :  92 C Calibr Offset                :  17 CFEATURES DATA  IA-32 Proc Core Feature Flags:  FFFB8743 IA-64 Proc Core Feature Flags:  1B81806300000000 Package Feature Flags        :  3F010000 Devices on TAP Chain         :  2FRU Entry #  14 :FRU NAME: Processor 0 RAM ID:0036CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 0 Manufacturer       :  Product Name       : MTV_A1_1618      S/N                : PR1084441T   Part Number        : AD391-2100C Fru File ID        : B Custom Info        :  Custom Info        : 4733 Custom Info        :    Custom Info        : 2H64014D2b82     Custom Info        : 1PRODUCT INFO:FRU Entry #  15 :FRU NAME: Processor 1 RAM ID:0037CHASSIS INFO:BOARD INFO: Mfg Date/Time      : 0 Manufacturer       :  Product Name       : MTV_A1_1618      S/N                : PR108443XV   Part Number        : AD391-2100C Fru File ID        : B Custom Info        :  Custom Info        : 4733 Custom Info        :    Custom Info        : 2H64014D2b82     Custom Info        : 1PRODUCT INFO:FRU Entry #  16 :FRU NAME                : MemExt0 DIMM0AFRU ID                  : 0128JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D209EMfg Unique Serial #     : 0x00CE010640711D209EFRU Entry #  17 :FRU NAME                : MemExt0 DIMM0BFRU ID                  : 0136JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D20AAMfg Unique Serial #     : 0x00CE010640711D20AAFRU Entry #  18 :FRU NAME                : MemExt0 DIMM0CFRU ID                  : 0144JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D20A2Mfg Unique Serial #     : 0x00CE010640711D20A2FRU Entry #  19 :FRU NAME                : MemExt0 DIMM0DFRU ID                  : 0152JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0xCE00000000000000JEDEC Mfg Location      : 0x01JEDEC Mfg Part #        : M3 93T5750CZ3-CD5 JEDEC Mfg Revision Code : 0x3343JEDEC Mfg Year          : 0x06JEDEC Mfg Week          : 0x40JEDEC Mfg Serial #      : 0x711D20A8Mfg Unique Serial #     : 0x00CE010640711D20A8FRU Entry #  20 :FRU NAME                : MemExt1 DIMM0AFRU ID                  : 0160JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD925C4D3Mfg Unique Serial #     : 0x002C0C0836D925C4D3FRU Entry #  21 :FRU NAME                : MemExt1 DIMM0BFRU ID                  : 0168JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD72FB723Mfg Unique Serial #     : 0x002C0C0836D72FB723FRU Entry #  22 :FRU NAME                : MemExt1 DIMM0CFRU ID                  : 0176JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD925C4D9Mfg Unique Serial #     : 0x002C0C0836D925C4D9FRU Entry #  23 :FRU NAME                : MemExt1 DIMM0DFRU ID                  : 0184JEDEC SPD Rev           : 0x12JEDEC Mfg ID            : 0x2C00000000000000JEDEC Mfg Location      : 0x0CJEDEC Mfg Part #        : 36HTF25672PY-667D1JEDEC Mfg Revision Code : 0x0100JEDEC Mfg Year          : 0x08JEDEC Mfg Week          : 0x36JEDEC Mfg Serial #      : 0xD925C4CBMfg Unique Serial #     : 0x002C0C0836D925C4CB   -> This is the last entry in the selected list.-> Command successful.[mphpux09] MP:CM> sysrevSYSREVCurrent firmware revisions MP FW     : F.02.23 BMC FW    : 05.25 EFI FW    : ROM A 07.12, ROM B 07.12 System FW : ROM A 04.03, ROM B 04.03, Boot ROM A PDH FW    : 50.07 DHPC FW   : 01.23 UCIO FW   : 03.0b PRS FW    : 00.08 UpSeqRev: 0c, DownSeqRev: 08 HFC FW    : 00.04 SetRev: 00
Tags: blog, cstm, ems, hardware, hp, hp-ux, ilo, itanium, raid, ram, ssh, unix