October 2011 – platformadmin.com

Troubleshooting: sas.servers & hostname aliases

I ran into an interesting problem over the last few days with my SAS® 9.3 deployments on Linux. I had noticed that the sas.servers scripts for my Lev2 and Lev3 deployments were taking several minutes to start the SAS services. I know JBoss takes ages to start but usually the SAS services start really quickly. I had assumed Remote Services was the culprit, but surprisingly it turned out to be starting the SAS/CONNECT Job Spawner service that was the problem. Even more surprisingly the underlying cause turned out to be the use of the hostname in the log filename where the deployment logical hostname alias used during deployment was different to the physical hostname of the machine it was deployed on.

To start the troubleshooting process I modified the sas.servers script to add some timing information to the console log messages it generated. Editing the sas.servers script is not necessarily the best option but more on that later. The sas.servers script has a function called logmsg which seems to have been very nicely provided in order to customize its logging. I wanted to make a simple change to add a time stamp so I could see where all the time was being spent. I changed the function from this (reformatted and comments omitted for brevity):

logmsg() { echo "$*" }

… to this:

logmsg() { DT=`date +%Y-%m-%d:%H:%M:%S` echo "$DT: $*" }

I stopped the services since they were already running:

root@sas93lnx01:~# cd /opt/ebiedieg/Lev3 root@sas93lnx01:/opt/ebiedieg/Lev3# ./sas.servers stop 2011-10-30:21:51:31: Stopping SAS servers

Then checked they were all stopped:

root@sas93lnx01:/opt/ebiedieg/Lev3# ./sas.servers status 2011-10-30:21:51:59: SAS servers status: 2011-10-30:21:51:59: SAS Metadata Server 1 is NOT up 2011-10-30:21:51:59: SAS OLAP Server 1 is NOT up 2011-10-30:21:51:59: SAS Object Spawner 1 is NOT up 2011-10-30:21:51:59: SAS Share Server 1 is NOT up 2011-10-30:21:51:59: SAS CONNECT Spawner 1 is NOT up 2011-10-30:21:51:59: SAS Remote Services 1 is NOT up 2011-10-30:21:51:59: SAS Framework Data Server 1 is NOT up

Once stopped, I started them again so I could see where all the time was spent:

root@sas93lnx01:/opt/ebiedieg/Lev3# ./sas.servers start 2011-10-30:21:54:54: Starting SAS servers 2011-10-30:21:55:03: SAS Metadata Server 1 is UP 2011-10-30:21:55:07: SAS OLAP Server 1 is UP 2011-10-30:21:55:11: SAS Object Spawner 1 is UP 2011-10-30:21:55:11: SAS Share Server 1 is UP 2011-10-30:21:56:12: SAS CONNECT Spawner 1 is NOT up 2011-10-30:21:56:28: SAS Remote Services 1 is UP 2011-10-30:21:56:32: SAS Framework Data Server 1 is UP

Total time to start was 98 seconds. Not too bad really but it should be quicker than that. Everything seemed to start within a few seconds of each other except for the the SAS/CONNECT Job Spawner! It looked like I was wrong to assume it was Remote Services. The SAS/CONNECT Job Spawner appeared to be taking 60 seconds to apparently not start. It had started though. This was confirmed by looking in the process list and the job spawner log file. It all seemed very odd and for 2 reasons: firstly the job spawner is usually very fast to start and secondly 60 seconds is a bit of a convenient number; it sounded like a timeout.

The sas.servers script starts each SAS service in turn waiting for the service to become available before starting the next one. It sounded like sas.servers was waiting for some event from the job spawner and gave up after 60 seconds assuming it had not started. Time to dig into the sas.servers script a bit more.

The SAS/CONNECT job spawner is started by the sas.servers script in the start_connect_spawner function. It spawns the spawner (ConnectSpawner.sh start) in the background and then calls the is_atypical_server_up function to wait for it to finish starting. Looking into the is_atypical_server_up function it seems to repeatedly sleep and grep the job spawner log for some trigger text. Effectively:

grep "SAS Job Spawner for Open Systems" /opt/ebiedieg/Lev3/ConnectSpawner/Logs/ConnectSpawner_sas93lnx01.log

Now I could see the problem. It was scanning a non-existant log file ConnectSpawner_sas93lnx01.log when it should be scanning the ConnectSpawner_sas93lnx03.log file which was actually being written to by the job spawner. The log file name contained the wrong hostname. The name sas93lnx01 was the physical name of the machine, but sas93lnx03 was the host name alias I used when deploying the Lev3 environment. I prefer to use host name aliases in deployments for the benefits they provide in being able to easily move deployments between physical machines and provide disaster recovery options. In this case the Lev3 environment was currently on the same machine as the Lev1 environment, but I knew one day I would move it onto another machine so had used a host name alias sas93lnx03 (a DNS CNAME record) which I could easily redirect later to the appropriate physical machine name (DNS A record).

There was the problem. The SAS/CONNECT job spawner was using the deployment/logical hostname alias whereas the sas.servers script was using the physical hostname (which it got from `uname -n`). With the wrong log file name it could never find the log file and therefore never see the trigger text and would always time out after 60 seconds.

There were at least two ways I could fix this problem. The first was to modify the ConnectSpawner.sh script to use the physical host name, rather than deployment host name alias, in the log file name. The second was to modify the sas.servers script to use the deployment hostname alias instead of the physical host name. That was the option I chose. It was the slightly more difficult of the two but I didn’t feel right telling the job spawner to log to a file with the wrong name :)

These were the changes I made to sas.servers to get it to work. Firstly I added an extra line where the SHOSTNAME environment variable is derived, changing this:

SHOSTNAME=`uname -n`

… to this:

SHOSTNAME=`uname -n` SHOSTNAME_ALIAS="sas93lnx03"

Secondly I modified the start_connect_spawner function to change the name of the log file (as checked by the is_atypical_server_up function) from this:

SLOGNAME="ConnectSpawner_${SHOSTNAME}"

… to this:

SLOGNAME="ConnectSpawner_${SHOSTNAME_ALIAS}"

As it happens, the same problem occurs with the deployment tester server component too so I also edited the start_deployment_testsrv function changing this:

SLOGNAME="DeploymentTesterServer_${SHOSTNAME}"

… to this:

SLOGNAME="DeploymentTesterServer_${SHOSTNAME_ALIAS}"

However, there is a potential problem with this method of editing sas.servers directly; I alluded to it earlier in the post. The sas.servers script is a generated script and any edits made to it will get wiped out if/when it is regenerated at a later date. The SAS documentation for the sas.servers script cautions against editing the script. The sas.servers script is created by the generate_boot_scripts.sh script. Next time someone or something runs generate_boot_scripts.sh the changes to sas.servers will be lost.

To protect against this you could modify the templates that are used by generate_boot_scripts.sh. This is what I did. I made all the same changes described above in the single template /opt/ebiedieg/Lev3/Utilities/script_templates/sas.servers.mainlog so they would survive a regeneration. I wouldn’t necessarily recommend this though. It’s not a method documented by SAS Institute and I suspect these templates will most likely get changed at some point during a SAS software upgrade or hotfix. However, it works for me for the time being. Looking back it probably would have been better to just changed the ConnectSpawner.sh script to use the wrong hostname ;)

With those changes done it was time to stop and start the servers to see any improvements:

root@sas93lnx01:/opt/ebiedieg/Lev3# ./sas.servers start 2011-10-30:23:54:30: Starting SAS servers 2011-10-30:23:54:39: SAS Metadata Server 1 is UP 2011-10-30:23:54:43: SAS OLAP Server 1 is UP 2011-10-30:23:54:47: SAS Object Spawner 1 is UP 2011-10-30:23:54:47: SAS Share Server 1 is UP 2011-10-30:23:54:49: SAS CONNECT Spawner 1 is UP 2011-10-30:23:55:06: SAS Remote Services 1 is UP 2011-10-30:23:55:10: SAS Framework Data Server 1 is UP

This time only 40 seconds (less than half the original time) and the SAS CONNECT Job Spawner is now reported as being up.

As it turns out, I noticed that several of the other SAS services (like the metadata server) use the physical host name in their log filenames. It all works fine but I would have preferred they used the deployment hostname if it were possible. I briefly looked into ways of telling those servers to use the logical hostname alias instead but, since they use the SAS logging framework and the %S{hostname} conversion pattern, I couldn’t see any obvious way other than editing all of the config files and scripts post deployment. If someone knows a good way of consistently and automatically using the deployment hostname (the one providing during deployment and potentially different from the physical hostname) during installation then I’m all ears :)

Metadata Time Travel with SAS 9.3

The way SAS metadata backups are done in SAS® 9.3 has changed significantly. Significantly for the better I think. There are big changes but they are well well worth it.

I’ll warn you now that this is a very long blog post with lots of images. I wanted to show someone how metadata backups are done in SAS 9.3 and also run a few tests for myself so a blog post seemed like a good idea. It turned out to be much longer than I expected though. I’ll split the post up into a short version and a long version. Here’s the short version first …

The Short Version …

I think SAS 9.3 metadata backups are fantastic! They are integrated, configured by default, and continuous. It’s almost worth upgrading to SAS 9.3 just for the metadata backups (let alone all the other enhancements) :)

Integrated: SAS 9.3 metadata backups are performed by the metadata server itself in a background thread. In SAS 9.2 and SAS 9.1.3 the %OMABAKUP macro (and associated backup wizard front end) was used. This macro paused the metadata server, made copies of the metadata repository tables, repository manager, and key config files, then resumed the metadata server. SAS 9.3 metadata backups don’t even require the metadata server to be paused anymore.
Configured by default: With SAS 9.1.3 and SAS 9.2 someone, usually the installer or the administrator, had to consciously decide to configure SAS metadata backups and arrange for them to be scheduled. In the past I had spoken to people who didn’t do metadata backups or assumed their file system backups would be sufficient. With a brand new SAS 9.3 installation, metadata backups will be scheduled by default: daily/nightly at 1am, with weekly reorganizations, and keeping 7 days of prior backup history. Even those sites that didn’t know they should be doing metadata backups should have them without realizing it. You would have to make a conscious decision to disable metadata backups rather than a decision to enable them.
Continuous: This is my favourite. SAS 9.3 is effectively backing up metadata continuously. With SAS 9.2 and 9.1.3 only periodic, usually daily, snapshots of metadata were taken – if disaster struck at the end of the day you had to decide whether it was worth losing the days metadata changes to revert to the previous nights snapshot. In SAS 9.3 the combination of regular metadata snapshots/backups with journals of metadata transactions provides a continuous backup. The journal, previously only used for performance gains, now also acts as a replay log for metadata transactions. This allows roll-forward recovery of metadata to any point in time. It’s like having a “metadata time machine” – you can roll back to any point in the history you have kept, be it 5 days ago or 5 minutes ago.

Some might say that these concepts have been around in other software for years. That may well be the case, but the fact that SAS Institute are now providing this method of metadata recovery in SAS 9.3 is still a stroke of genius in my opinion.

More information on SAS 9.3 metadata backups can be found in the SAS® 9.3 Intelligence Platform: System Administration Guide in the section Backing Up and Recovering the SAS Metadata Server

The Long Version …

If you are only interested in the short version then stop reading here. Otherwise, please read on. You might want to grab a tea/coffee/beer/wine and settle in though. Don’t say I didn’t warn you. ;)

I wanted to test out the roll-forward recovery feature with SAS 9.3 metadata backups in a scenario that would not be recoverable using nightly SAS 9.2 or SAS 9.1.3 metadata backups. I decided to test recovering content that was both newly created and accidentally deleted, on the same day, between 2 consecutive nightly snapshots/backups. With an effective continuous backup in SAS 9.3, roll-forward recovery should allow a restore to the point in time between the content creation and its accidental deletion. The following timeline diagram illustrates the steps taken in this test:

In the diagram above the green boxes represent metadata backups or snapshots (either scheduled or ad-hoc). The yellow boxes represent the metadata journals being all the metadata transactions that follow the prior snapshot. The grey boxes represent the metadata backup directories that contain a snapshot backup and a journal file containing the metadata transactions up to the time of any subsequent snapshot backup.

The sequence of events for this test was as follows:

06Oct2011:01:00: the normal scheduled nightly backup
06Oct2011:19:49: an ad-hoc backup taken late in the day just prior to the test content creation and deletion. This ad-hoc backup was not really necessary. The previous backup would have been sufficient, but it gave me a bit of extra practice with ad-hoc backups!
06Oct2011:19:58: some test content was created – in this case a folder and a stored process.
06Oct2011:20:02: the test content was ‘accidentally’ deleted. With the creation and deletion both occurring between snapshot backups, this would not normally be recoverable without the roll-forward capability (and in the absence of any intermediate snapshot backup or personal metadata export (.spk) backup taken after the content creation but before its deletion).
06Oct2011:20:06: another ad-hoc backup taken just after the test content deletion. This additional ad-hoc backup was also not necessary but it did allow me to verify that roll-forward recovery could still be done with an ad-hoc backup taken after the ‘disaster’.
06Oct2011:20:22: the roll-forward recovery restore. To get the content back, this had to be timed between the content creation (19:58), and it deletion (20:02). Half way in between at 20:00 should do. A restore was done using the 19:49 backup, the most recent one prior to the deletion, rolled-forward to 20:00, just before the deletion occurred at 20:02.

The rest of this post shows, with a number of screenshots, some of the new features of SAS 9.3 metadata backups. It also illustrates the process of running the test discussed above via SAS Management Console 9.3.

Exploring SAS 9.3 Metadata Backups

The SAS Management Console 9.3 Metadata Management plug-in provides access to a historical list of metadata backups under the Metadata Utilities/Server Backup plug-in folder. The screenshot below shows a few of the available scheduled nightly 1am backups (the ones with the green tick icons) and a couple of expired backups (with the yellow warning icons).

The table in the screenshot above shows basic summary information for the available backups. More detailed information can be found in the properties dialog for each backup. The properties dialog has a General tab with general information like status, date/time, location, size etc:

… and a Repositories tab showing which metadata repositories are included in the backup:

Metadata backups can be configured by right mouse clicking over the Server Backup plug-in folder and selecting one of the items from the context menu:

Selecting the “Backup Schedule…” context menu item displays a dialog where the backup schedule can be modified. The screenshot below shows the default schedule for backups: nightly backups at 1am every day of the week with a reorganization (reclaiming space for logically deleted metadata) on Mondays (i.e. 1am Monday morning or Sunday night depending on how you think about it).

Selecting the “Backup Configuration…” context menu item displays a dialog (shown below) where you can specify where you want the backups to be stored and how many days of backup history to keep. The defaults (unless modified during installation) are to keep 7 days of backups in the Backups folder underneath the main MetadataServer configuration folder. Depending on how your metadata server has been configured, this might potentially be on the same storage device/volume as your metadata repositories. If so, it would be advisable to move them to a different device/volume to improve your chances of being able to restore from backup if you happen to lose the storage volume the repositories are on. In terms of backup history, I personally feel 7 days is a bit short. If space allows, I prefer to keep around 1-3 months of backups. Usually I compress metadata backups to allow for more history. The option to automatically compress (zip, tar/gz, tar/bzip2 etc) backups during backup, and then expand them during restore, would be a useful enhancement I think. Whilst you probably wouldn’t want to do a full restore from more than a day or two ago, I have on occasion needed to export a metadata package from a 3 week old backup to import/restore partial content. I like to keep a basic Lev9 deployment available for such things. I could imagine another potential enhancement where you might use the Metadata Manager to export an SPK package of metadata as at a historical point in time (derived from a temporary roll-forward recovery) to immediately import (removing the need to use a Lev9 environment or equivalent). :idea:

The “Recover from Alternate Location…” context menu item displays a dialog (shown below) where you can specify the location of an unlisted backup to restore from (a backup which is not in the list of historical backups). I haven’t used this yet but I imagine it would be useful in the situation of needing to do a temporary restore into a Lev9 private admin environment for the purposes of exporting a package of metadata for a partial restore/import. I notice there’s no browse button so it looks like a job for copy/paste.

The “Run Backup Now…” context menu item is used to perform a forced ad-hoc backup between normal scheduled backups. You get to provide a comment/reason for the backup which will be logged in the backup history. You can also elect to reorganize the repositories at the same time if you want, reclaiming any space taken up by logically deleted metadata. Bear in mind that reorganizing does involve a read-only pause of the the metadata server.

You would use this “Run Backup Now…” action to force an immediate backup or snapshot when required, such as just before doing something potentially dangerous, where you want the option to easily restore. In earlier versions of SAS this was required in order to have the ability to restore. With SAS 9.3 it doesn’t sound like we need to be so strict with this practice, since we now have the option for roll-forward recovery if we hadn’t perform a forced backup (but we did keep track of the time of the significant event). I think I’ll stick with performing explicit ad-hoc backups – if not just for recording comments in the backup history log. SAS 9.3 metadata backups don’t need to pause the metadata server, unless reorganizing, so we needn’t be worried about any small windows of unavailability for our users either – all the more reason to backup.

SAS 9.3 Roll-Forward Metadata Recovery Test

This is where I started my testing. At 19:49 just before I was about to do a test of create, delete, and roll-forward restore, I used “Run Backup Now…” to force an ad-hoc backup. This wasn’t really required. I could have done the testing based on the previous night’s backup but with a longer roll-forward. I just wanted to surround the event with a couple of ad-hoc backups to confirm that they wouldn’t have any negative impact on my ability to do the recovery.

After the ad-hoc backup was completed, an information dialog popped up to let me know:

… and the backup was recorded in the backup history list.

I then set about creating some content that I could ‘accidentally’ delete and then attempt to restore. I recorded the times for the various events so I could use them to determine how much roll-forward would be required. I wanted to restore to the point after the creation but before the ‘accidental’ deletion.

At 19:58 I created my test content. This was simply a folder named “Test Restore” containing a stored process named “Test SP” as shown in the SAS Management Console 9.3 Folders tab below:

At 20:02 I then ‘accidentally’ deleted the “Test Restore” folder and all of its contents. 8-O

At 20:06 just after the ‘accidental’ deletion I did another ad-hoc backup. Once again, this wasn’t required. As mentioned before, I just wanted to surround the ‘disaster’ with a couple of ad-hoc backups to confirm that they wouldn’t have any negative impact on my ability to do the recovery.

The two backups immediately before, and after, the creation/deletion can be seen recorded in the backup history list in the screenshot below:

Now it was time to attempt the recovery.

The scenario as described would be unrecoverable if it were SAS 9.1.3 or SAS 9.2. There is no snapshot backup in between the creation and the deletion of the content. None of the available snapshot backups I had, when restored on their own (without roll-forward), would have the deleted content in them. However, by default, SAS 9.3 keeps a log of all metadata transactions since the most recent prior backup in a journal file (unless you configure it otherwise). This allows the transactions to be replayed, or rolled-forward, to any point in time.

To perform a roll-forward recovery in my test scenario, I just needed to choose a point in time where the content still existed, somewhere between the creation at 19:58 and the deletion at 20:02. Half way in between at 20:00 seemed like a reasonable choice. I could recover the deleted content by restoring the most recent backup prior to 20:00 (the 19:49 one) and then roll-forward the transactions up to 20:00.

As shown in the screenshot below I right mouse clicked on the 19:49 backup in the backup history list to select the “Recover from This Backup…” context menu item:

The initial “Recover from This Backup” dialog was then displayed:

A few changes were made to this dialog to control how much recovery was required. I provided some comments, ticked the “Roll forward transactions” checkbox and provided the end time of 2011-10-06T20:00:00 as a local time (Brisbane is GMT+10).

At 20:22 I clicked the OK button to start the restore process. A progress dialog was displayed:

… and when the restore was finished an information dialog popped up:

After the restore I checked the backup history list. I was a little surprised when I didn’t see a Restore entry type, but instead a Backup entry type.

After thinking about it for a little while, it kind of makes sense that a restore should involve a new backup and new journal from the point of restore.

Checking in the SAS Management Console 9.3 Folders tab I confirmed that the ‘accidentally’ deleted content had indeed been restored:

I even ran the stored process to double check it still ran successfully.

SAS 9.3 Metadata Backup Storage

It is interesting to see how the backups are stored too. The screenshot below shows the directory structure for the Metadata Server/Backups folder:

Each metadata backup directory is named using the date and time of the backup. It contains a snapshot backup of the metadata repositories, repository manager, and key configuration files, at the time the backup was taken. It also contains a journal file containing a log of the metadata transactions that have occurred since the backup snapshot was taken. This journal file continues to be used and grow until the time of the next backup, when the journal is closed and retained, and a new journal file is started in the new backup directory. The current journal in use will be the one in the most recent backup directory.

Last Thoughts

As I’m sure you’ve worked out by now, I’m most impressed with these enhancements to the SAS metadata backup and restore process in SAS 9.3. Whilst we hope to never need to use metadata backups, they can be business critical when they are required, and flexibility is paramount in limiting metadata loss in those rare situations. It’s very comforting to know that I should be able to recover from a much wider set of recovery scenarios with SAS 9.3, including a very fine level of control over the recovery time point. It really feels like we now have a metadata time machine that allows recovery with the absolute minimum loss of metadata changes right up to the disaster point itself.