UCSM 1.4 : Maintenance Policies and Schedules

Strange as it may seem with all of the great new features in UCSM 1.4, this is one of my favorites.

To understand the impact, first look at the way disruptive changes were handled prior to this release.   When changing a configuration setting on a service profile, updating service profile template, or many policies, if the change would cause a disruption to running service profiles (i.e. requiring a reboot), you had two options : yes, or no.  When modifying a single profile, this wasn’t a big issue.  You could simply change the configuration when you were also ready to accommodate a reboot of that particular profile.   Where it became troublesome was when you wanted to modify an updating service profile or policy that affected many service profiles – your choice was really only to reboot them all simultaneously, or modify each individually.   Obviously for large deployments using templates and policies (the real strength of UCS), this wasn’t ideal.

With UCSM 1.4, we now have the concept of a Maintenance Policy.   The screenshot below is taken from the Servers tab:

Creating a Maintenance Policy allows the administrator to define the manner in which a service profile (or template) should behave when disruptive changes are applied.   First, there’s the old way:

A policy of “Immediate” means that when a disruptive change is made, the affected service profiles are immediately rebooted without confirmation.   A normal “soft” reboot occurs, whereby a standard ACPI power-button press is sent to the physical compute node – assuming that the operating system traps for this, the OS should gracefully shut down and the node will reboot.

A much safer option is to use the “user-ack” policy option:

When this option is selected, disruptive changes are staged to each affected service profile, but the profile is not immediately rebooted.   Instead, each profile will show the pending changes in its status field, and will wait for the administrator to manually acknowledge the changes when it is acceptable to reboot the node.

The most interesting new option is the is the “timer-automatic” setting.   This setting allows the maintenance policy to reference another new object, the Schedule.

Schedules allow you to define one-time or reoccurring time periods where one or more of the affected nodes may be rebooted without administrator intervention.  Note that the Schedules top-level object is located within the Servers tab:

The only schedule created automatically by UCSM is the “default” schedule, which has one recurring entry to reboot all service profiles that reference a “timed-automatic” maintenance policy associated with the “default” schedule each day at midnight.   This “default” schedule can be modified, of course.

Creating a user-defined schedule provides the ability to control when – and how many – profiles are rebooted to apply disruptive changes.

The One Time Occurrence option sets a single date and time when this schedule will be in effect.  For example, if you wanted all affected profiles to be rebooted on January 18th at midnight, you could create an entry such as the following.

Once the date and time have been selected, the other options for the occurrence can be selected.

Max Duration specifies how long this occurrence can run.   Based on the other options selected below it, it is possible that all service profiles may not be able to be rebooted in the time allotted.   If this is the case, changes to those profiles will not take effect.

Max Number Of Tasks specifies how many total profiles could be rebooted by this occurrence.

Max Number Of Concurrent Tasks controls how many profiles can be rebooted simultaneously.   If, for example, this schedule will be used on a large cluster of service profiles where workload can be sustained even while 5 nodes are unavailable, set this value to 5 and the reboots will occur in groups of that size.

Minimum Internal Between Tasks allows you to set a delay between each reboot.  This can be set to ensure that each node rebooted is given time to fully boot before the next node is taken down.

The Recurring Occurrence option provides for the creation of a schedule that will run every day, or on a specific day, to apply disruptive changes.

This option has the same per-task options as the previous example.

Once you have created your maintenance policy and schedule (if necessary), the service profile or service profile template must reference the maintenance policy in order for it to have any effect.  After selecting your service profile or template, the Actions window has an option to Change Maintenance Policy.

You may then select the Maintenance Policy you wish to use, or create a new one.

The service profile properties will now show that a maintenance policy has been selected.

In this example, a policy requiring user acknowledgement has been chosen.   Now if any disruptive changes are made, the service profile will not reset until manually acknowledged by an administrator.   Any time profiles are awaiting acknowledgement, a warning “Pending Activities” will be shown on the UCSM status bar.

Within the profile properties, a description of the pending changes will be displayed along with the “Reboot Now” option.

I hope this description of the new maintenance policies and schedules options was helpful.  I’m very excited by all the new features rolling into UCS – it was a great system before, and it’s only getting better!

Small bugfix in UCSM 1.4(1j)

After upgrading one of my lab systems to 1.4(1i) (released December 20, 2010), all of the fans in my chassis showed as failed.   Since each fan module contains two separately monitored fans, this resulted in 24 total warnings in my system (8 x fan module, 16 x fans) – annoying, but cosmetic only.

UCSM 1.4(1j) was released just a few weeks later (January 7, 2011) with a number of small bug fixes listed in the release notes, but nothing about my fan issue.   However, after updating my IO Modules to the new 1.4(1j) code, the errors disappeared.   This makes sense, since the IO Modules contain the Chassis Management Controller which is responsible for monitoring all of the chassis components.

So, thanks to Cisco for fixing this small but annoying bug!

UCSM 1.4 : Where to find firmware now

Prior to UCSM 1.4, all UCS firmware was delivered as a single bundle – this included UCSM itself, the code for the Fabric Interconnects, IO Modules, blades, mezzanine cards, etc.   With UCSM 1.4, code is now delivered in three different packages.   This makes it easier for Cisco to release support for new blades, mezzanine cards, etc, without having to release a new version of UCSM.

First, the old way:

Note the path to the software – you’d navigate to the Fabric Interconnect and select “Complete Software Bundle”.   As of December 31, 2010, the last version posted her is 1.3(1p) – even though 1.4 has been released.   This is due to the new way code is distributed.   Instead of going to a specific piece of hardware, navigate to Products/Unified Computing and review the options listed:

The three new categories are “Cisco UCS Infrastructure Software”, “Cisco UCS Manager Server Software”, and “Cisco UCS Manager Capability Catalog Software”.

The “Infrastructure Software” category contains UCSM, and the firmware/software for the Fabric Interconnects, IO Modules, and FEX modules (for C-series attachment).

“Cisco UCS Manager Server Software” has two sub-categories, one for B-series blades and one for C-series rack-mount servers.

Finally, the “UCS Manager Capability Catalog Software” category contains a small file that describes (to UCSM) all of the components of a UCS system for inventory, categorization, etc.   If Cisco were to release, say, new fan modules that had different specifications than the existing ones, only this file would need to be updated instead of a full system-wide upgrade.

I hope this helps when going looking for the latest code for your UCS system!

UCSM 1.4 : Direct upload of firmware bundles

Ok, so this one isn’t earth-shattering, but I thought it was worth mentioning.

Previous to UCSM 1.4, the only way to transfer bundles of firmware to UCSM was via an external server – FTP, TFTP, SCP, or SFTP.   In most shops, this isn’t a big deal – you likely already have a utility server of some type available on your management network(s) for other similar tasks.   In some scenarios (especially greenfield deployments), though, you may not have ready access to such a server or for other reasons may not want to put your UCS code there.

With 1.4, you can now upload firmware directly from the UCSM client.   When selecting the Download Firmware option in Firmware Management,

You are now presented with the option to either upload a file from your local workstation,

or use the traditional method transferring the file from a remote server.

Again, not a huge deal, but definitely a nice convenience enhancement.

UCSM 1.4 : Direct attach appliance/storage ports!

One of the most often requested features in the early days of UCS was the ability to directly attach 10GE storage devices (both Ethernet and FCoE based) to the UCS Fabric Interconnects.

Up until UCSM 1.4, only two types of Ethernet port configurations existed in UCS – Server Ports (those connected to IO Modules in the chassis) and Uplink Ports (those connected to the upstream Ethernet switches).   As UCS treated all Uplink ports equally, you could not in a supported manner connect an end device such as a storage array or server to those ports.   There were, of course, clever customers who found ways to do it – but it wasn’t the “right” or most optimal way to do it.

Especially within the SMB market, many customers may not have existing 10G Ethernet infrastructures outside of UCS, or FC switches to connect storage to.   For these customers, UCS could often provide a “data center in a box”, with the exception of storage connectivity.   For Ethernet-based storage, all storage arrays had to be connected to some external Ethernet switch, while FC arrays had to be connected to a FC switch.   Adding a 10G Ethernet or FC switch just for a few ports didn’t make a lot of financial sense, especially if those customers didn’t have any additional need for those devices beyond UCS.

With UCSM 1.4, all of that changes.   Of course, the previous method of connecting to upstream Ethernet and FC switches still exists, and will still be the proper topology for many customers.  Now, however, a new set of options has been opened.

Take a look at some of the new port types available in UCSM 1.4 :

New in 1.4 are the Appliance, FCoE Storage, Monitoring Ethernet, Monitoring FC, and Storage FC port types.

I’ll cover the Monitoring types in a later post.

On the Ethernet side of things, the Appliance and FCoE Storage allow for the direct connection of Ethernet storage devices to the Fabric Interconnects.

The Appliance port is intended for connecting Ethernet-based storage arrays (such as those serving iSCSI or NFS services) directly to the Fabric Interconnect.   If you recall from previous posts, in the default deployment mode (Ethernet Host Virtualizer), UCS selected one Uplink port to accept all broadcast and multicast traffic from the upstream switches.   By adding this Appliance port type, you can ensure that any port configured as an Appliance Port will not be selected to receive broadcast/multicast traffic from the Ethernet fabric, as well as providing the ability to configure VLAN support on the port independently of the other Uplink ports.

The FCoE Storage Port type provides similar functionality as the Appliance Port type, while extending FCoE protocol support beyond the Fabric Interconnect.   Note that this is not intended for an FCoE connection to another FCF (FCoE Forwarder) such as a Nexus 5000.   Only direct connection of FCoE storage devices (such as those produced by NetApp and EMC) are supported.   When an Ethernet port is configured as an FCoE Storage Port, traffic is expected to arrive without a VLAN tag.   The Ethernet headers will be stripped away and a VSAN tag will be added to the FC frame.   Much as the previous FC port configuration was, only one VSAN is supported per FCoE Storage Port.   Think of these ports like an Ethernet “access” port – the traffic is expected to arrive un-tagged, and the switching device (in this case, the Fabric Interconnect) will tag the frames with a VSAN to keep track of it internally.   When the frames are eventually delivered to the destination (typically the CNA on the blade), the VSAN tag will be removed before delivery.   Again, it’s very similar to traffic flowing through a traditional Ethernet switch, access port to access port.   Even though both the sending and receiving devices are expecting un-tagged traffic, it’s still tagged internally within the switch while in transit.

The Storage FC Port type allows for the direct attachment of a FC storage device to one of the native FC ports on the Fabric Interconnect expansion modules.  Like the FCoE Storage Port type, the FC frames arriving on these ports are expected to be un-tagged – so no connection to an MDS FC switch, etc.   Each Storage FC Port is assigned a VSAN number to keep the traffic separated within the UCS Unified Fabric.   When used in this way, the Fabric Interconnect is not providing any FC zoning configuration capabilities – all devices within a particular VSAN will be allowed, at least at the FC switching layer (FC2), to communicate with each other.   The expectation is that the devices themselves, through techniques such as LUN Masking, etc, will provide the access control.   This is acceptable for small implementations, but does not scale well for larger or more enterprise-like configurations.   In those situations, an external FC switch should be used either for connectivity or to provide zoning information – the so-called “hybrid model”.   I’ll cover the hybrid model in a later post.

What’s cool in UCSM 1.4?

Since so many other great bloggers announced earlier this month that Cisco had released UCS Manager 1.4 (codenamed ‘Balboa’), I didn’t see any reason to wade into the fray with yet another summary of the release notes.   For one such excellent summary, see Steve Chamber’s post here: http://viewyonder.com/2010/12/20/ciscoucs-1-4-is-here/

Instead I thought it might be useful, especially for those new to UCS, to do a series of posts on the new features (there’s a ton of them!) and what they really mean to an existing or potential UCS shop.   I’m really excited by this release, as there are so many cool new things that really cement UCS as a top-notch architecture.  So many of my wish-list items have been fulfilled by this release.  Many of the features I’ve heard customers asking for have been delivered in 1.4, so I’m sure this upgrade is going to make a lot of people very happy.

So, I have a handful of features that I plan to detail over the next few days and weeks, but I’d like to know – what features are you most curious about?  What features perhaps do you not see the value of?   Your comments will help me prioritize my posts!

UCSM 1.3(1c) Released!

Cisco has released UCS Manager version 1.3(1c).   This is the first public release in the 1.3 line, also known as “Aptos+”.

Release notes are here: http://www.cisco.com/en/US/docs/unified_computing/ucs/release/notes/ucs_22863.html

Haven’t gotten a chance to play with the new version yet, but there are some significant enhancements.    Among them…

  • 1 GE support on UCS6120 and UCS6140 Fabric Interconnects
    • On the 6120, you can now use 1GE transceivers in the first 8 physical ports.
    • On the 6140, you can now use 1GE transceivers in the first 16 physical ports.
    • Watch for a post soon on why I think this is a bad idea.  🙂
  • Support for the new, 2nd generation mezzanine cards
    • Both Emulex and Qlogic have produced a 2nd generation mezzanine card, using a single-chip design which should lower power consumption
      • Be warned that these new mezzanine cards won’t support the “Fabric Failover” feature as supported by the first generation CNAs, or by the VIC (Palo) adapter
      • These aren’t shipping quite yet, but will be soon
    • A Broadcom BCM57711 mezzanine adapter
      • This will compete with the Intel based, 10GE mezzanine adapters that UCS has had until now
      • The Broadcom card supports TOE (TCP Offload Engine) and iSCSI Offload, but not iSCSI boot
    • An updated Intel mezzanine adapter, based on the Niantic chipset
  • Support for the B440-M1 blade
    • The B440 blade will be available in a 2 or 4 processor configuration, using the Intel Xeon 7500 processors
    • Up to 4 SFFP hard drives
    • 32 DIMM slots, for up to 256GB of memory
    • 2 Mezzanine slots
    • Full-width form factor
  • SSD hard drive support in B200-M2, B250-M2, and B440-M1 blades
    • First drive available is a Samsung 100GB SSD
  • Improved SNMP support
  • Ability to configure more BIOS options, such as virtualization options, through the service profiles
    • This is a big step towards making UCS blades honestly and truly stateless
    • Previously, I’d recommended that UCS customers configure each blade’s BIOS options to support virtualization when they received them, whether or not they were going to use ESX/etc on all of the blades.  This way they didn’t have to worry about setting them again when moving service profiles
  • Support for heterogeneous mezzanine adapters in full-width blades
  • Increased the supported limit of chassis to 14.
  • Increased the limit of VLANs in UCSM to 512
    • There’s been some discussion around this lately, particular in the service provider space.   Many service providers need many more VLANs than this for their architectures.
    • I’ve seen reference to a workaround using ESX, Nexus 1000V, private VLANs, and a promiscuous VLAN through the Fabric Interconnect into an upstream switch, but I’m still trying to get my head around that one.  🙂
  • Ability to cap power levels per blade
    • Will have to wait until I get a chance to test out the code level to see what kinds of options are available here

Looking forward to seeing customer reaction to the new features.

UCS Manager 1.2(1) Released

As a full UCS bundle (including all code – from the lowliest baseboard management controller to the UCS Manager in all it’s process-preserving glory), Cisco has released version 1.2(1).

Full release notes are available here.

To summarize, this release adds support for the soon-to-be-shipping “M2” versions of the UCS blades, which support the Intel Xeon 5600-series processors (Westmere), which include the 6-core version of the Nehalem lineage.   There are also numerous bug-fixes (expected in this generation of product), including many on my list of “slightly annoying but still ought to be fixed” bugs.

MAC forwarding table aging on UCS 6100 Fabric Interconnects

I was recently forwarded some information on the MAC table aging process in the UCS 6100 Fabric Interconnects that I thought was very valuable to share.

Prior to this information, I was under the impression (and various documentation had confirmed) that the Fabric Interconnect never ages MAC addresses – in other words, it understands where all the MAC addresses are within the chassis/blades, and therefore has no need to age-out addresses.   In the preferred Ethernet Host Virtualizer mode, it also doesn’t learn any addresses from the uplinks, again, so no need to age a MAC address.

So what about VMware and the virtual MAC addresses that live behind the physical NICs on the blades?

Well, as it turns out, the Fabric Interconnects do age addresses, just not those assigned by UCS Manager to a physical NIC (or a vNIC on a Virtual Interface Card – aka Palo).

On UCS code releases prior to 1.1, learned address age out in 7200 seconds (120 minutes) and is not configurable.

On UCS code releases of 1.1 and later, learned addresses age out in 7200 seconds (120 minutes) by default, but can be adjusted in the LAN Uplinks Manager within UCS Manager.

Why do we care?   Well, it’s possible that if a VM (from which we’ve learned an address) has gone silent for whatever reason, we may end up purging it’s address from the forwarding table after 120 minutes… which will mean it’s unreachable from the outside world, since we’ll drop any frame that arrives on an uplink to an unknown unicast MAC address.   Only if the VM generates some outbound traffic will we re-learn the address and be able to accept traffic on the uplinks for it.

So if you have silent VMs and have trouble reaching them from the outside world, you’ll want to upgrade to the latest UCS code release and adjust the MAC aging timeout to something very high (or possibly never).

Moving UCS Service Profile between UCS Clusters

@SlackerUK on Twitter asked about moving Service Profiles between UCS clusters.

In short, it’s not currently possible with UCS Manager without a bit of manual work.

First, create a “logical” backup from UCS Manager.  This will create an XML file containing all of the logical configuration of UCS Manager, including your service profiles.   Find the service profile you want, and remove everything else from the backup.  You can then import that XML file into another UCS Manager instance.  Be aware that everything comes with that XML, including identifiers – so make sure you’re OK with that or remove the original service profile to eliminate duplicates.

If you’re using BMC Bladelogic for Cisco UCS, it *does* have the capability to move service profiles between clusters.