AbrisTech: 2011

Friday, September 30, 2011

Solaris10 u10 - lucreate failures

After several successful upgrades to the currently released Solaris10 u10 I've got an issue:
lucreate failed to prepare alternative boot environment with an error like:

...
Mounting ABE .
ERROR: mount: /zones/myzone-dataset1/legacy: No such file or directory
ERROR: cannot mount mount point 
 ...

and several warnings like:

WARNING: Directory zone lies on a filesystem shared between BEs, remapping path to .

Hmm , OK I don't need these filesystems mounted, but can safely mount them (at least temporary) if it will solve the problem. New mountpoint is set and lucreate successfully finished with warnings only.
The box is not in critical environment and warnings were ignored - BIG MISTAKE -

!!! Do not ignore WARNINGS during lucreate !!!

But anyway, upgrade finished successfully, new BE activated, init 6 ...

Server started, but two zones failed to start ... (there are other zones on the server that booted without issues)
Attempt to boot affected zone resulted in multiple complains about filesystems that are not "legacy" mounted in global zone ...
Hmm ... Looking at zonecfg -z myzone export info and see bunch of

  add fs
  set dir=....

additionally to correctly defined

 add dataset
 set name=...

Fixing zone config by removing all fs records that shouldn't be there , another attempt to boot to figure out that system is trying to boot zone from zonepath=/zones/myzone-sol10u10 ( instead of /zones/myzone )

Checking real status of filesystems and fixing zoneconfig again,
but "Zone myzone already installed; set zonepath not allowed."

Not allowed but can be done by editing /etc/zones/myzone.xml and /etc/zones/index ( Don't forget to backup current files ... )

It looks much better now- all zones are up and running ...

But lucreate is still broken and failing on attempt to create new BE.
Looks like a bug in live upgrade. Search shows the same issue in this thread . Currently there are no updates for patches 121431(x86) and 121430(sparc), double checking and filing the bug.

Update:
After a long conversation with oracle I was able to confirm that there is a bug in the current LU suite ( Patch 121431-67 ). Solution is simple - downgrade LU to 121431-58.
In case the old version of LU is not backed up - just install the original one from the Solaris media.

Tuesday, September 20, 2011

Dell PERC controllers and Solaris

By default Solaris doesn't include tools for monitoring and management of Dell RAID adapters, but most of this card ( PERC H700, 6/i ... ) are re-branded LSI controllers.
Even if the adapter is used in a materialistic config ( almost JBOD ) and RAID functionality is delegated to ZFS I'd prefer to have at least some visibility on the state of the card ( battery, memory ... )

Solaris 10 is using mega_sas ( LSI ) drivers, so for configuration , monitoring, etc ... you can safely use MegaCli utility which can be downloaded form LSI support site.

Not sure if it will be officially supported by Dell or Oracle, but it works - personally tested on H700 and 6i - just make sure that you are running it using root privileges.

As a monitoring tool - raid-monitor can be used with Xymon . It generates an alert if current state differs from generated "good" reference-file.

Friday, September 16, 2011

Solaris 10 8/11 is released

Solaris 10 u10 ( 8/11 ) is released and available for download
Notes on the upgrade ( from u9):


bash-3.00# lofiadm -a /export/home/iso/sol-10-u10-ga2-x86-dvd.iso 
/dev/lofi/1
bash-3.00# mount -F hsfs /dev/lofi/1 /mnt/
bash-3.00# /mnt/Solaris_10/Tools/Installers/liveupgrade20  # If upgrading from old solaris and liveupgrade 2.0 is not installed 
bash-3.00# lucreate -n sol10u10
bash-3.00# echo "auto_reg=disable" > /tmp/sysidcfg
bash-3.00# luupgrade -u -n sol10u10 -s /mnt -k  /tmp/sysidcfg 
...
...
INFORMATION: The file  on boot 
environment  contains a log of the upgrade operation.
INFORMATION: The file  on boot 
environment  contains a log of cleanup operations required.
INFORMATION: Review the files listed above. Remember that all of the files 
are located on boot environment . Before you activate boot 
environment , determine if any additional system maintenance is 
required or if additional media of the software distribution must be 
installed.
The Solaris upgrade of the boot environment  is complete.
...


bash-3.00# lumount sol10u10 /a

Review log files: /a/var/sadm/system/logs/upgrade_log and /var/sadm/system/data/upgrade_cleanup


bash-3.00# luumount /a
bash-3.00# luactivate  sol10u10
bash-3.00#  init 6

Review results of the upgrade


bash-3.2# lustatus  | egrep sol10u10\|Name\|Env\|--
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
sol10u10                   yes      yes    yes       no     -         
bash-3.2# uname -svr
SunOS 5.10 Generic_147441-01

If everything looks OK ( and in my case there were no issues ) proceed with upgrade of zpool and zfs and in case of mirrored boot pool - don't forget to update grub on the second disk.;

Changes since u9 ( zpool v22 , zfs v4 ):

zpool:

VER  DESCRIPTION
---  --------------------------------------------------------

 23  Slim ZIL
 24  System attributes
 25  Improved scrub stats
 26  Improved snapshot deletion performance
 27  Improved snapshot creation performance
 28  Multiple vdev replacements
 29  RAID-Z/mirror hybrid allocator
bash-3.2# zfs upgrade -v
The following filesystem versions are supported:

zfs

VER  DESCRIPTION
---  --------------------------------------------------------
 5   System attributes

!!! If system zpool is upgraded to the new version - there will be no way to boot into the old environment !!!

Update:
There are potential issues when upgrading host with multiple zones, see details in the post

Tuesday, September 13, 2011

Xymon - monitoring from the cloud

I this post I'm going to deploy Xymon in amazon cloud ( AWS ) for off-site monitoring.

Running “external” monitor in the cloud is efficient alternative for third-party services ( search for "External Website Monitoring" ). Easy installation, small footprint (no database) and flexibility of xymon makes it very attractive instrument for such project.

Bellow is just a some notes for a minimal setup:

Logon to AWS and launch the smallest instance ( t1.micro ) using Basic 32-bit Amazon Linux AMI. Make sure that SSH, HTTP ( or/and HTTPs ) connections are permitted in the security group.


ssh -i YourKey.pem ec2-user@ec2-XX.compute-1.amazonaws.com

[ec2-user@mon ~]$ sudo -i
[root@mon ~]# yum -y update
…
 ( reboot if needed )
... 
[root@mon ~]# useradd -m xymon

Now let’s add all packages we need for the build


[root@mon ~]# yum -y  install subversion fping gcc gcc-c++ openssl-devel make \

              binutils rrdtool rrdtool-devel  pcre-devel httpd cyrus-sasl-devel \

              ncurses-devel

Get source code from repository ( or download archive from xymon.com ).


[root@mon ~]# mkdir src
[root@mon ~]# cd src
[root@mon ~]# svn co https://xymon.svn.sourceforge.net/svnroot/xymon/branches/4.3.5
[root@mon ~]# cd 4.3.5
[root@mon ~]# ./configure
...
I found fping in /usr/sbin/fping
Do you want to use it [Y/n] ?
Y
…
Do you want to be able to test SSL-enabled services (y) ?
Y
…
What group-ID does your webserver use [nobody] ?
apache
…
[root@mon ~]# make &&  make install

OK, application is installed and can be started, but currently it will be checking localhost only and reporting to log files.

Let's prepare front-end - Apache web server.

Create a web user and restrict access to /xymon/


[root@mon ~]# htpasswd -c /etc/httpd/xymonpasswd  admin
[root@mon ~]# cp ~xymon/server/etc/xymon-apache.conf
 /etc/httpd/conf.d/xymon-apache.conf
[root@mon ~]# sed -i 's/\/home\/xymon\/server\/etc\/xymonpasswd/\/etc\/httpd\/xymonpasswd/g' /etc/httpd/conf.d/xymon-apache.conf
[root@mon ~]# sed -i 's/AuthGroupFile/#AuthGroupFile/g' /etc/httpd/conf.d/xymon-apache.conf

Review /etc/httpd/conf.d/xymon-apache.conf ( and httpd.conf ) files, and start/restart apache service;
sudo to xymon and add monitoring targets to ~/server/etc/hosts.cfg ( read manpage )

as an example we cat test some Google sites, in future connection to google.com could be used as a "always up" service. Adding dependency allows avoid noise form hiccups on AWS network.


group-compress Web services
0.0.0.0  www.google.com                   # http://www.google.com
0.0.0.0  encrypted.google.com            # https://encrypted.google.com/

group-compress DNS
8.8.8.8  google-public-dns-a.google.com  # dns=A:www.google.com,MX:google.com

group-compress Local
127.0.0.1   localhost      # bbd http://localhost/

More sophisticated examples are available on http://xymon.com

Now start xymon server ( as user xymon )
~/server/xymon.sh start and check your page http://ec2-XX.compute-1.amazonaws.com/xymon/

Now - tricky part.
Web interface is nice, trends, etc …, but what about alert notifications ?
It's easy to add a record in ~xymon/server/etc/alerts.cfg, but most likely e-mails from AWS host will be delivered to a spam folder …
One solution - use amazon email service,
another - use any public e-mail provider who support smtp authorization.
For example:
- create new mail account on mail.google.com
- compile mutt on your virtual server ( the one from aws yum repositories won’t work ... )
-- get recent source from http://www.mutt.org/download.html


./configure --enable-imap --enable-smtp --with-sasl --with-ssl &&; make &&; make install

create .muttrc in ~xymon with following contents:


# SENDING MAIL

set copy=yes
set smtp_url="smtp://NEW.EMAIL@smtp.gmail.com:587/"
set smtp_pass="EMAIL.PASS"
set from="NEW.EMAIL@gmail.com"
set realname="Xymon in the Cloud"

# RECEIVING MAIL
set imap_user = "NEW.EMAIL@gmail.com"
set imap_pass = "EMAIL.PASS"
set folder = "imaps://imap.gmail.com:993"
set spoolfile="imaps://imap.gmail.com/INBOX"
set postponed="imaps://imap.gmail.com/Drafts"
set record="imaps://imap.gmail.com/Sent"
set message_cachedir=~/.mutt/cache/bodies
set certificate_file=~/.mutt/certificates
set move = no

Verify that it really works:


date | mutt -s test  you_real_address@provider.com

And create a script for alert notifications like:


[xymon@mon ~]#  cat ~xymon/bin/m.sh
#!/bin/bash

if [ ${RECOVERED} = 1 ]
        then
                export BBCOLORLEVEL="RECOVERED"
                export BBCOLOR="green"
        else
                export BBCOLOR=$BBCOLORLEVEL
fi

S=$BBHOSTSVC:$BBCOLOR 

echo $BBALPHAMSG | mutt -s $S $RCPT

Finally, add alert rules ( ~xymon/server/etc/alerts.cfg )


HOST=* COLOR=red
     SCRIPT /home/xymon/bin/m.sh you_real_address@provider.com FORMAT=TEXT REPEAT=3h RECOVERED

DONE

Long story, many steps, but in reality should take less then an hour to have basic monitoring running.
Operational cost of this setup will be definitely lower then comparable services from “remote site monitoring” providers.

PS. Before real usage, don't forget switch to HTTPs, review all config files ...
Subscribe to the Xymon mailing list ( http://xymon.com/xymon/help/known-issues.html ) for friendly support, ask for help and give help to others.

Saturday, September 10, 2011

Using xymon to monitor status of cfengine

During the deployment of the cfengine one of my biggest concerns was how to make sure that it is working as expected. Obviously there are multiple elements in the engine itself that can alert or even better - fix many issues.

Free version doesn't provide reporting, trends ... - no visualization but it has enough to build external analyzers and reporting, also external tests will give you a bit more confidence that everything is OK (or something is wrong)

As a main monitoring platform I'm using xymon and it's functionality can be easy extended.

Initially I'd like to ensure that all agents are alive and really talking to the server(s)
In my case I'm expecting that connection is established approximately every 5 min (default behavior), so you should expect that "last seen" value is less than 5 min + "splaytime".

Code of the extension is available for download from google code page

Or checkout most current version from svn:

svn co https://abris.googlecode.com/svn/trunk/xymon-ext/cfengine

requirements:

cfengine3
python 2.6+

Example of a healthy chart, looking from the cfserver:

Significant spikes in the chart indicate that you need to check status of the suspicions agent

I'm planning to add more features on this test , so stay tuned.