Wednesday, March 10, 2010

OCFS2 1.2 - FREQUENTLY ASKED QUESTIONS

CONTENTS



General

Download and Install

Configure

O2CB Cluster Service

Format

Resize

Mount

Oracle RAC

Migrate Data from OCFS (Release 1) to OCFS2

Coreutils

Exporting via NFS

Troubleshooting

Limits

System Files

Heartbeat

Quorum and Fencing

Novell's SLES9 and SLES10

Release 1.2

Upgrade to the Latest Release

Processes

Build RPMs for Hotfix Kernels

Backup Super block

Configuring Cluster Timeouts

Enterprise Linux 5



GENERAL



How do I get started?



Download and install the module and tools rpms.

Create cluster.conf and propagate to all nodes.

Configure and start the O2CB cluster service.

Format the volume.

Mount the volume.

How do I know the version number running?



# cat /proc/fs/ocfs2/version

OCFS2 1.2.1 Fri Apr 21 13:51:24 PDT 2006 (build bd2f25ba0af9677db3572e3ccd92f739)



How do I configure my system to auto-reboot after a panic?

To auto-reboot system 60 secs after a panic, do:

# echo 60 > /proc/sys/kernel/panic



To enable the above on every reboot, add the following to /etc/sysctl.conf:

kernel.panic = 60



DOWNLOAD AND INSTALL



Where do I get the packages from?

For Oracle Enterprise Linux 4 and 5, use the up2date command as follows:

# up2date --install ocfs2-tools ocfs2console

# up2date --install ocfs2-`uname -r`



For Novell's SLES9, use yast to upgrade to the latest SP3 kernel to get the required modules installed. Also, install the ocfs2-tools and ocfs2console packages.

For Novell's SLES10, install ocfs2-tools and ocfs2console packages. For Red Hat's RHEL4 and RHEL5, download and install the appropriate module package and the two tools packages, ocfs2-tools and ocfs2console. Appropriate module refers to one matching the kernel version, flavor and architecture. Flavor refers to smp, hugemem, etc.



What are the latest versions of the OCFS2 packages?

The latest module package version is 1.2.9-1 for both Enterprise Linux 4 and 5.

The latest tools/console package version is 1.2.7-1 for both Enterprise Linux 4 and 5.



How do I interpret the package name ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm?

The package name is comprised of multiple parts separated by '-'.



ocfs2 - Package name

2.6.9-22.0.1.ELsmp - Kernel version and flavor

1.2.1 - Package version

1 - Package subversion

i686 - Architecture

How do I know which package to install on my box?

After one identifies the package name and version to install, one still needs to determine the kernel version, flavor and architecture.

To know the kernel version and flavor, do:

# uname -r

2.6.9-22.0.1.ELsmp



To know the architecture, do:

# rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n"

i686



Why can't I use uname -p to determine the kernel architecture?

uname -p does not always provide the exact kernel architecture. Case in point the RHEL3 kernels on x86_64. Even though Red Hat has two different kernel architectures available for this port, ia32e and x86_64, uname -p identifies both as the generic x86_64.



How do I install the rpms?

First install the tools and console packages:

# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm



Then install the appropriate kernel module package:

# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm



Do I need to install the console?

No, the console is not required but recommended for ease-of-use.



What are the dependencies for installing ocfs2console?

ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or later, pygtk2 (RHEL4) or python-gtk (SLES9) 1.99.16 or later, python 2.3 or later and ocfs2-tools.



What modules are installed with the OCFS2 1.2 package?



ocfs2.ko

ocfs2_dlm.ko

ocfs2_dlmfs.ko

ocfs2_nodemanager.ko

configfs.ko (only Enterprise Linux 4)

debugfs.ko (only Enterprise Linux 4)



The kernel shipped alongwith Enterprise Linux 5 includes configfs.ko and debugfs.ko.



What tools are installed with the ocfs2-tools 1.2 package?



mkfs.ocfs2

fsck.ocfs2

tunefs.ocfs2

debugfs.ocfs2

mount.ocfs2

mounted.ocfs2

ocfs2cdsl

ocfs2_hb_ctl

o2cb_ctl

o2cb - init service to start/stop the cluster

ocfs2 - init service to mount/umount ocfs2 volumes

ocfs2console - installed with the console package

What is debugfs and is it related to debugfs.ocfs2?

debugfs is an in-memory filesystem developed by Greg Kroah-Hartman. It is useful for debugging as it allows kernel space to easily export data to userspace. It is currently being used by OCFS2 to dump the list of filesystem locks and could be used for more in the future. It is bundled with OCFS2 as the various distributions are currently not bundling it. While debugfs and debugfs.ocfs2 are unrelated in general, the latter is used as the front-end for the debugging info provided by the former. For example, refer to the troubleshooting section.

CONFIGURE



How do I populate /etc/ocfs2/cluster.conf?

If you have installed the console, use it to create this configuration file. For details, refer to the user's guide. If you do not have the console installed, check the Appendix in the User's guide for a sample cluster.conf and the details of all the components. Do not forget to copy this file to all the nodes in the cluster. If you ever edit this file on any node, ensure the other nodes are updated as well.



Should the IP interconnect be public or private?

Using a private interconnect is recommended. While OCFS2 does not take much bandwidth, it does require the nodes to be alive on the network and sends regular keepalive packets to ensure that they are. To avoid a network delay being interpreted as a node disappearing on the net which could lead to a node-self-fencing, a private interconnect is recommended. One could use the same interconnect for Oracle RAC and OCFS2.



What should the node name be and should it be related to the IP address?

The node name needs to match the hostname. The IP address need not be the one associated with that hostname. As in, any valid IP address on that node can be used. OCFS2 will not attempt to match the node name (hostname) with the specified IP address.



How do I modify the IP address, port or any other information specified in cluster.conf?

While one can use ocfs2console to add nodes dynamically to a running cluster, any other modifications require the cluster to be offlined. Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one and copy to the rest, and restart the cluster on all nodes. Always ensure that cluster.conf is the same on all the nodes in the cluster.



How do I add a new node to an online cluster?

You can use the console to add a new node. However, you will need to explicitly add the new node on all the online nodes. That is, adding on one node and propagating to the other nodes is not sufficient. If the operation fails, it will most likely be due to bug#741. In that case, you can use the o2cb_ctl utility on all online nodes as follows:

# o2cb_ctl -C -i -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME



Ensure the node is added both in /etc/ocfs2/cluster.conf and in /config/cluster/CLUSTERNAME/node on all online nodes. You can then simply copy the cluster.conf to the new (still offline) node as well as other offline nodes. At the end, ensure that cluster.conf is consistent on all the nodes.

How do I add a new node to an offline cluster?

You can either use the console or use o2cb_ctl or simply hand edit cluster.conf. Then either use the console to propagate it to all nodes or hand copy using scp or any other tool. The o2cb_ctl command to do the same is:

# o2cb_ctl -C -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME



Notice the "-i" argument is not required as the cluster is not online.

O2CB CLUSTER SERVICE



How do I configure the cluster service?



# /etc/init.d/o2cb configure



Enter 'y' if you want the service to load on boot, the name of the cluster (as listed in /etc/ocfs2/cluster.conf) and the cluster timeouts.



How do I start the cluster service?



To load the modules, do:

# /etc/init.d/o2cb load



To Online it, do:

# /etc/init.d/o2cb online [cluster_name]



If you have configured the cluster to load on boot, you could combine the two as follows:

# /etc/init.d/o2cb start [cluster_name]



The cluster name is not required if you have specified the name during configuration.



How do I stop the cluster service?



To offline it, do:

# /etc/init.d/o2cb offline [cluster_name]



To unload the modules, do:

# /etc/init.d/o2cb unload



If you have configured the cluster to load on boot, you could combine the two as follows:

# /etc/init.d/o2cb stop [cluster_name]



The cluster name is not required if you have specified the name during configuration.



How can I learn the status of the cluster?

To learn the status of the cluster, do:

# /etc/init.d/o2cb status



I am unable to get the cluster online. What could be wrong?

Check whether the node name in the cluster.conf exactly matches the hostname. One of the nodes in the cluster.conf need to be in the cluster for the cluster to be online.



FORMAT



Should I partition a disk before formatting?

Yes, partitioning is recommended even if one is planning to use the entire disk for ocfs2. Apart from the fact that partitioned disks are less likely to be "reused" by mistake, some features like mount-by-label only work with partitioned volumes.

Use fdisk or parted or any other tool for the task.



How do I format a volume?

You could either use the console or use mkfs.ocfs2 directly to format the volume. For console, refer to the user's guide.

# mkfs.ocfs2 -L "oracle_home" /dev/sdX



The above formats the volume with default block and cluster sizes, which are computed based upon the size of the volume.

# mkfs.ocfs2 -b 4k -C 4k -L "oracle_home" -N 8 /dev/sdX



The above formats the volume for 8 nodes with a 4K block size and a 4K cluster size.



What does the number of node slots during format refer to?

The number of node slots specifies the number of nodes that can concurrently mount the volume. This number is specified during format and can be increased using tunefs.ocfs2. This number cannot be decreased.



What should I consider when determining the number of node slots?

OCFS2 allocates system files, like Journal, for each node slot. So as to not to waste space, one should specify a number within the ballpark of the actual number of nodes. Also, as this number can be increased, there is no need to specify a much larger number than one plans for mounting the volume.



Does the number of node slots have to be the same for all volumes?

No. This number can be different for each volume.



What block size should I use?

A block size is the smallest unit of space addressable by the file system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K. The block size cannot be changed after the format. For most volume sizes, a 4K size is recommended. On the other hand, the 512 bytes block is never recommended.



What cluster size should I use?

A cluster size is the smallest unit of space allocated to a file to hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. For Oracle home, use 4K cluster size. For database volumes, use any value equal to or larger than the database blocksize. This ensures that the entire Oracle data block will be contiguous on disk. Earlier, we used to recommend 128K cluster size for the database volumes. The only problem with that value was that it could lead to space wastage if the volume was used to store many small files. The new recommendation gives a hard lower limit and allows users to pick any larger value.



Any advantage of labelling the volumes?

As in a shared disk environment, the disk name (/dev/sdX) for a particular device be different on different nodes, labelling becomes a must for easy identification. You could also use labels to identify volumes during mount.

# mount -L "label" /dir



The volume label is changeable using the tunefs.ocfs2 utility.



RESIZE



Can OCFS2 file systems be grown in size?

Yes, you can grow an OCFS2 file system using tunefs.ocfs2. It should be noted that the tool will only resize the file system and not the underlying partition. You can use fdisk(8) (or any appropriate tool for your disk array) to resize the partition.



What do I need to know to use fdisk(8) to resize the partition?

To grow a partition using fdisk(8), you will have to delete it and recreate it with a larger size. When recreating it, ensure you specify the same starting disk cylinder as before and a ending disk cylinder that is greater than the existing one. Otherwise, not only will the resize operation fail, but you may lose your entire file system. Backup your data before performing this task.



Short of reboot, how do I get the other nodes in the cluster to see the resized partition?

Use blockdev(8) to rescan the partition table of the device on the other nodes in the cluster.

# blockdev --rereadpt /dev/sdX



What is the tunefs.ocfs2 syntax for resizing the file system?

To grow a file system to the end of the resized partition, do:

# tunefs.ocfs2 -S /dev/sdX



For more, refer to the tunefs.ocfs2 manpage.



Can the OCFS2 file system be grown while the file system is in use?

No. tunefs.ocfs2 1.2.2 only allows offline resize. i.e., the file system cannot be mounted on any node in the cluster. The online resize capability will be added later.



Can the OCFS2 file system be shrunk in size?

No. We have no current plans on providing this functionality. However, if you find this feature useful, file an enhancement request on bugzilla listing your reasons for the same.



MOUNT



How do I mount the volume?

You could either use the console or use mount directly. For console, refer to the user's guide.

# mount -t ocfs2 /dev/sdX /dir



The above command will mount device /dev/sdX on directory /dir.



How do I mount by label?

To mount by label do:

# mount -L "label" /dir



What entry to I add to /etc/fstab to mount an ocfs2 volume?

Add the following:

/dev/sdX /dir ocfs2 _netdev 0 0



The _netdev option indicates that the devices needs to be mounted after the network is up.



What do I need to do to mount OCFS2 volumes on boot?



Enable o2cb service using:

# chkconfig --add o2cb



Enable ocfs2 service using:

# chkconfig --add ocfs2



Configure o2cb to load on boot using:

# /etc/init.d/o2cb configure



Add entries into /etc/fstab as follows:

/dev/sdX /dir ocfs2 _netdev 0 0



How do I know my volume is mounted?



Enter mount without arguments, or,

# mount



List /etc/mtab, or,

# cat /etc/mtab



List /proc/mounts, or,

# cat /proc/mounts



Run ocfs2 service.

# /etc/init.d/ocfs2 status



mount command reads the /etc/mtab to show the information.



What are the /config and /dlm mountpoints for?

OCFS2 comes bundled with two in-memory filesystems configfs and ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the in-kernel node manager the list of nodes in the cluster and to the in-kernel heartbeat thread the resource to heartbeat on. ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel dlm to take and release clusterwide locks on resources.



Why does it take so much time to mount the volume?

It takes around 5 secs for a volume to mount. It does so so as to let the heartbeat thread stabilize. In a later release, we plan to add support for a global heartbeat, which will make most mounts instant.



Why does it take so much time to umount the volume?

During umount, the dlm has to migrate all the mastered lockres' to an another node in the cluster. In 1.2, the lockres migration is a synchronous operation. We are looking into making it asynchronous so as to reduce the time it takes to migrate the lockres'. (While we have improved this performance in 1.2.5, the task of asynchronously migrating lockres' has been pushed to the 1.4 time frame.) To find the number of lockres in all dlm domains, do:

# cat /proc/fs/ocfs2_dlm/*/stat

local=60624, remote=1, unknown=0, key=0x8619a8da



local refers to locally mastered lockres'.



ORACLE RAC



Any special flags to run Oracle RAC?

OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry (OCR), Data files, Redo logs, Archive logs and Control files must be mounted with the datavolume and nointr mount options. The datavolume option ensures that the Oracle processes opens these files with the o_direct flag. The nointr option ensures that the ios are not interrupted by signals.

# mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db



What about the volume containing Oracle home?

Oracle home volume should be mounted normally, that is, without the datavolume and nointr mount options. These mount options are only relevant for Oracle files listed above.

# mount -t ocfs2 /dev/sdb1 /software/orahome



Also as OCFS2 does not currently support shared writeable mmap, the health check (GIMH) file $ORACLE_HOME/dbs/hc_ORACLESID.dat and the ASM file $ASM_HOME/dbs/ab_ORACLESID.dat should be symlinked to local filesystem. We expect to support shared writeable mmap in the OCFS2 1.4 release.

Does that mean I cannot have my data file and Oracle home on the same volume?

Yes. The volume containing the Oracle data files, redo-logs, etc. should never be on the same volume as the distribution (including the trace logs like, alert.log).



Any other information I should be aware off?

The 1.2.3 release of OCFS2 does not update the modification time on the inode across the cluster for non-extending writes. However, the time will be locally updated in the cached inodes. This leads to one observing different times (ls -l) for the same file on different nodes on the cluster.

While this does not affect most uses of the filesystem, as one variably changes the file size during write, the one usage where this is most commonly experienced is with Oracle datafiles and redologs. This is because Oracle rarely resizes these files and thus almost all writes are non-extending.

In OCFS2 1.4, we intend to fix this by updating modification times for all writes while providing an opt-out mount option (nocmtime) for users who would prefer to avoid the performance overhead associated with this feature.



MIGRATE DATA FROM OCFS (RELEASE 1) TO OCFS2



Can I mount OCFS volumes as OCFS2?

No. OCFS and OCFS2 are not on-disk compatible. We had to break the compatibility in order to add many of the new features. At the same time, we have added enough flexibility in the new disk layout so as to maintain backward compatibility in the future.



Can OCFS volumes and OCFS2 volumes be mounted on the same machine simultaneously?

No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's SLES8). OCFS2, on the other hand, only works on the 2.6 kernels (RHEL4, SLES9 and SLES10).



Can I access my OCFS volume on 2.6 kernels (SLES9/SLES10/RHEL4)?

Yes, you can access the OCFS volume on 2.6 kernels using FSCat tools, fsls and fscp. These tools can access the OCFS volumes at the device layer, to list and copy the files to another filesystem. FSCat tools are available on oss.oracle.com.



Can I in-place convert my OCFS volume to OCFS2?

No. The on-disk layout of OCFS and OCFS2 are sufficiently different that it would require a third disk (as a temporary buffer) inorder to in-place upgrade the volume. With that in mind, it was decided not to develop such a tool but instead provide tools to copy data from OCFS without one having to mount it.



What is the quickest way to move data from OCFS to OCFS2?

Quickest would mean having to perform the minimal number of copies. If you have the current backup on a non-OCFS volume accessible from the 2.6 kernel install, then all you would need to do is to retore the backup on the OCFS2 volume(s). If you do not have a backup but have a setup in which the system containing the OCFS2 volumes can access the disks containing the OCFS volume, you can use the FSCat tools to extract data from the OCFS volume and copy onto OCFS2.



COREUTILS



Like with OCFS (Release 1), do I need to use o_direct enabled tools to perform cp, mv, tar, etc.?

No. OCFS2 does not need the o_direct enabled tools. The file system allows processes to open files in both o_direct and bufferred mode concurrently.



EXPORTING VIA NFS



Can I export an OCFS2 file system via NFS?

Yes, you can export files on OCFS2 via the standard Linux NFS server. Please note that only NFS version 3 and above will work. In practice, this means clients need to be running a 2.4.x kernel or above.



Is there no solution for the NFS v2 clients?

NFS v2 clients can work if the server exports the volumes with the no_subtree_check option. However, this has some security implications that is documented in the exports manpage.



TROUBLESHOOTING



How do I enable and disable filesystem tracing?

To list all the debug bits along with their statuses, do:

# debugfs.ocfs2 -l



To enable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER allow



To disable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER off



To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do:

# debugfs.ocfs2 -l SUPER deny



To enable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow



To disable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny



How do I get a list of filesystem locks and their statuses?

OCFS2 1.0.9+ has this feature. To get this list, do:

Mount debugfs is mounted at /debug (EL4) or /sys/kernel/debug (EL5).

# mount -t debugfs debugfs /debug

- OR -

# mount -t debugfs debugfs /sys/kernel/debug



Dump the locks.

# echo "fs_locks"
debugfs.ocfs2 /dev/sdX >/tmp/fslocks



How do I read the fs_locks output?

Let's look at a sample output:

Lockres: M000000000000000006672078b84822 Mode: Protected Read

Flags: Initialized Attached

RO Holders: 0 EX Holders: 0

Pending Action: None Pending Unlock Action: None

Requested Mode: Protected Read Blocking Mode: Invalid



First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation.

To get the inode number and generation from lockname, do:

#echo "stat "
debugfs.ocfs2 -n /dev/sdX

Inode: 419616 Mode: 0666 Generation: 2025343010 (0x78b84822)

....



To map the lockname to a directory entry, do:

# echo "locate "
debugfs.ocfs2 -n /dev/sdX

419616 /linux-2.6.15/arch/i386/kernel/semaphore.c



One could also provide the inode number instead of the lockname.

# echo "locate <419616>"
debugfs.ocfs2 -n /dev/sdX

419616 /linux-2.6.15/arch/i386/kernel/semaphore.c



To get a lockname from a directory entry, do:

# echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c"
debugfs.ocfs2 -n /dev/sdX

M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822



The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource.



The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive.



If you have a dlm hang, the resource to look for would be one with the "Busy" flag set.



The next step would be to query the dlm for the lock resource.



Note: The dlm debugging is still a work in progress.



To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

# echo "stats"
debugfs.ocfs2 -n /dev/sdX
grep UUID:
while read a b ; do echo $b ; done

82DA8137A49A47E4B187F74E09FBBB4B



Then do:

# echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug



For example:

# echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug

# dmesg
tail

struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985

lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no

granted queue:

type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)

converting queue:

blocked queue:



It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource.



This is just to give a flavor of dlm debugging.



LIMITS



Is there a limit to the number of subdirectories in a directory?

Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.).



Is there a limit to the size of an ocfs2 file system?

Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system.



SYSTEM FILES



What are system files?

System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do:



# echo "ls -l //"
debugfs.ocfs2 -n /dev/sdX

18 16 1 2 .

18 16 2 2 ..

19 24 10 1 bad_blocks

20 32 18 1 global_inode_alloc

21 20 8 1 slot_map

22 24 9 1 heartbeat

23 28 13 1 global_bitmap

24 28 15 2 orphan_dir:0000

25 32 17 1 extent_alloc:0000

26 28 16 1 inode_alloc:0000

27 24 12 1 journal:0000

28 28 16 1 local_alloc:0000

29 3796 17 1 truncate_log:0000



The first column lists the block number.



Why do some files have numbers at the end?

There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do:



# echo "slotmap"
debugfs.ocfs2 -n /dev/sdX

Slot# Node#

0 39

1 40

2 41

3 42



HEARTBEAT



How does the disk heartbeat work?

Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.



When is a node deemed dead?

An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=31) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.



What about self fencing?

A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.



How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?

This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.



What should one set O2CB_HEARTBEAT_THRESHOLD to?

It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.



O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)



How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?



# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold

7



What if a node umounts a volume?

During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.



I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?

We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to reading from and writing to the hb area atleast once every 12 secs (default). This bug has been addressed by Red Hat in RHEL4 U4 (2.6.9-42.EL) and Novell in SLES9 SP3 (2.6.5-7.257). If you wish to use the DEADLINE io scheduler, you could do so by appending "elevator=deadline" to the kernel command line as follows:





For SLES9, edit the command line in /boot/grub/menu.lst.

title Linux 2.6.5-7.244-bigsmp (with deadline)

kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5

vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off

initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp



For RHEL4, edit the command line in /boot/grub/grub.conf:

title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)

root (hd0,0)

kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off

initrd /initrd-2.6.9-22.EL.img



To see the current kernel command line, do:

# cat /proc/cmdline



QUORUM AND FENCING



What is a quorum?

A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.



How does OCFS2's cluster services define a quorum? The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.

A node has quorum when:



it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.

OR,



it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.



What is fencing?

Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described above, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.

Due to user reports of nodes hanging during fencing, OCFS2 1.2.5 no longer uses "panic" for fencing. Instead, by default, it uses "machine restart". This should not only prevent nodes from hanging during fencing but also allow for nodes to quickly restart and rejoin the cluster. While this change is internal in nature, we are documenting this so as to make users aware that they are no longer going to see the familiar panic stack trace during fencing. Instead they will see the message "*** ocfs2 is very sorry to be fencing this system by restarting ***" and that too probably only as part of the messages captured on the netdump/netconsole server.

If perchance the user wishes to use panic to fence (maybe to see the familiar oops stack trace or on the advise of customer support to diagnose frequent reboots), one can do so by issuing the following command after the O2CB cluster is online.

# echo 1 > /proc/fs/ocfs2_nodemanager/fence_method



Please note that this change is local to a node.

How does a node decide that it has connectivity with another?

When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for O2CB_IDLE_TIMEOUT_MS. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.



How long does the quorum process take?

First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of O2CB_IDLE_TIMEOUT_MS idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 31 iterations of 2 seconds results in waiting for 33 iterations or 66 seconds. By default, then, a maximum of 96 seconds can pass from the time a network fault occurs until a node fences itself.



How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?

This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

# chkconfig --list ocfs2

ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off



How does one list out the startup and shutdown ordering of the OCFS2 related services?



To list the startup order for runlevel 3 on RHEL4, do:

# cd /etc/rc3.d

# ls S*ocfs2* S*o2cb* S*network*

S10network S24o2cb S25ocfs2



To list the shutdown order on RHEL4, do:

# cd /etc/rc6.d

# ls K*ocfs2* K*o2cb* K*network*

K19ocfs2 K20o2cb K90network



To list the startup order for runlevel 3 on SLES9/SLES10, do:

# cd /etc/init.d/rc3.d

# ls S*ocfs2* S*o2cb* S*network*

S05network S07o2cb S08ocfs2



To list the shutdown order on SLES9/SLES10, do:

# cd /etc/init.d/rc3.d

# ls K*ocfs2* K*o2cb* K*network*

K14ocfs2 K15o2cb K17network



Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.



NOVELL'S SLES9 and SLES10



Why are OCFS2 packages for SLES9 and SLES10 not made available on oss.oracle.com?

OCFS2 packages for SLES9 and SELS10 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.



What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com?

As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.



The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.

The latest kernel with the SP3 release is 2.6.5-7.283. It ships with OCFS2 1.2.3. Please contact Novell to get the latest OCFS2 modules on SLES9 SP3.

What versions of OCFS2 are available with SLES10? SLES10 is currently shipping OCFS2 1.2.3. SLES10 SP1 is currently shipping 1.2.5-1.

RELEASE 1.2



What is new in OCFS2 1.2?

OCFS2 1.2 has two new features:

It is endian-safe. With this release, one can mount the same volume concurrently on little-endian architectures x86, x86-64, ia64 and the big endian architecture ppc64.

Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).

Do I need to re-make the volume when upgrading?

No. OCFS2 1.2 is fully on-disk compatible with 1.0.



Do I need to upgrade anything else?

Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.



UPGRADE TO THE LATEST RELEASE



How do I upgrade to the latest release?



Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)





Umount all OCFS2 volumes.

# umount -at ocfs2



Shutdown the cluster and unload the modules.



# /etc/init.d/o2cb offline

# /etc/init.d/o2cb unload



If required, upgrade the tools and console.

# rpm -Uvh ocfs2-tools-1.2.2-1.i386.rpm ocfs2console-1.2.2-1.i386.rpm



Upgrade the module.

# rpm -Uvh ocfs2-2.6.9-42.0.3.ELsmp-1.2.4-2.i686.rpm



Ensure init services ocfs2 and o2cb are enabled.

# chkconfig --add o2cb

# chkconfig --add ocfs2



To check whether the services are enabled, do:

# chkconfig --list o2cb

o2cb 0:off 1:off 2:on 3:on 4:on 5:on 6:off

# chkconfig --list ocfs2

ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off



To update the cluster timeouts, do:

# /etc/init.d/o2cb configure



At this stage one could either reboot the node or simply, restart the cluster

and mount the volume.



Can I do a rolling upgrade from 1.2.3 to 1.2.4?

No. The network protocol had to be updated in 1.2.4 to allow for proper reference counting of lockres' across the cluster. This fix was necessary to fix races encountered during lockres purge and migrate. Effectively, one cannot run 1.2.4 on one node while another node is still on an earlier release (1.2.3 or older).



Can I do a rolling upgrade from 1.2.4 to 1.2.5?

No. The network protocol had to be updated in 1.2.5 to ensure all nodes were using the same O2CB timeouts. Effectively, one cannot run 1.2.5 on one node while another node is still on an earlier release. (For the record, the protocol remained the same between 1.2.0 to 1.2.3 before changing in 1.2.4 and 1.2.5.)

Can I do a rolling upgrade from 1.2.6 to 1.2.7 on EL5?

Yes. The network protocol is fully compatible across both releases.



Can I do a rolling upgrade from 1.2.5 to 1.2.7 on EL4?

Yes. However, there is a catch. While the network protocol is fully compatible across the two releases, the default cluster timeouts are not. So if you were using the default timeouts, you will have to specifically set those timeouts on the new nodes using service o2cb configure command. Use service o2cb status to review current timeouts.

Users that are not careful with the above are likely to encounter failed mounts on the upgraded node. dmesg will indicate the differing timeout values.

Can I do a rolling upgrade from 1.2.7 to 1.2.8 or 1.2.9 on EL4 and EL5?

Yes. OCFS2 1.2.7, 1.2.8 and 1.2.9 are fully compatible. Users upgrading to 1.2.8/9 from 1.2.5/1.2.6 can expect the same behaviour as described above for upgrading to 1.2.7.

After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".

Do "dmesg
tail". If you see the error:

ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value



it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.



The cluster fails to load. What do I do?

Check "demsg
tail" for any relevant errors. One common error is as follows:

SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc: denied { mount } for ...



The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.



PROCESSES



List and describe all OCFS2 threads?



[o2net]

One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.

[user_dlm]

One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.

[ocfs2_wq]

One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.

[o2hb-14C29A7392]

One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.

[ocfs2vote-0]

One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.

[dlm_thread]

One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.

[dlm_reco_thread]

One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.

[dlm_wq]

One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.

[kjournald]

One per mount. Is used as OCFS2 uses JDB for journalling.

[ocfs2cmt-0]

One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.

[ocfs2rec-0]

Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.

BUILD RPMS FOR HOTFIX KERNELS



How to build OCFS2 packages for a hotfix kernel?



Download and install all the kernel-devel packages for the hotfix kernel.

Download and untar the OCFS2 source tarball.

# cd /tmp

# wget http://oss.oracle.com/projects/ocfs2/dist/files/source/v1.2/ocfs2-1.2.3.tar.gz

# tar -zxvf ocfs2-1.2.3.tar.gz

# cd ocfs2-1.2.3



Ensure rpmbuild is installed and ~/.rpmmacros contains the proper links.

# cat ~/.rpmmacros

%_topdir /home/jdoe/rpms

%_tmppath /home/jdoe/rpms/tmp

%_sourcedir /home/jdoe/rpms/SOURCES

%_specdir /home/jdoe/rpms/SPECS

%_srcrpmdir /home/jdoe/rpms/SRPMS

%_rpmdir /home/jdoe/rpms/RPMS

%_builddir /home/jdoe/rpms/BUILD



Ensure you have all kernel-*-devel packages installed for the kernel version you wish to build for. If so, then the following command will list it as a possible target.

# ./vendor/rhel4/kernel.guess targets

rhel4_2.6.9-67.EL_rpm

rhel4_2.6.9-67.0.1.EL_rpm

rhel4_2.6.9-55.0.12.EL_rpm



Configure and make.

# ./configure --with-kernel=/usr/src/kernels/2.6.9-67.EL-i686

# make rhel4_2.6.9-67.EL_rpm



The packages will be in %_rpmdir.

Are the self-built packages officially supported by Oracle Support?

No. Oracle Support does not provide support for self-built modules. If you wish official support, contact Oracle via Support or the ocfs2-users mailing list with the link to the hotfix kernel (kernel-devel and kernel-src rpms).



BACKUP SUPER BLOCK



What is a Backup Super block? A backup super block is a copy of the super block. As the super block is typically located close to the start of the device, it is susceptible to be overwritten, say, by an errant write (dd if=file of=/dev/sdX). Moreover, as the super block stores critical information that is hard to recreate, it becomes important to backup the block and use it when the super block gets corrupted.



Where are the backup super blocks located? In OCFS2, the super blocks are backed up to blocks at the 1G, 4G, 16G, 64G, 256G and 1T byte offsets. The actual number of backups depend on the size of the device. It should be noted that the super block is not backed up on devices smaller than 1G.



How does one enable this feature? mkfs.ocfs2 1.2.3 or later automatically backs up super blocks on devices larger than 1G. One can disable this by using the --no-backup-super option.



How do I detect whether the super blocks are backed up on a device?

# debugfs.ocfs2 -R "stats" /dev/sdX
grep "Feature Compat"

Feature Compat: 1 BackupSuper



How do I backup the super block on a device formatted by an older mkfs.ocfs2? tunefs.ocfs2 1.2.3 or later can attempt to retroactively backup the super block.

# tunefs.ocfs2 --backup-super /dev/sdX

tunefs.ocfs2 1.2.3

Adding backup superblock for the volume

Proceed (y/N): y

Backed up Superblock.

Wrote Superblock



However, it is quite possible that one or more backup locations are in use by the file system. (tunefs.ocfs2 backs up the block only if all the backup locations are unused.)

# tunefs.ocfs2 --backup-super /dev/sdX

tunefs.ocfs2 1.2.3

tunefs.ocfs2: block 262144 is in use.

tunefs.ocfs2: block 4194304 is in use.

tunefs.ocfs2: Cannot enable backup superblock as backup blocks are in use



If so, use the verify_backup_super script to list out the objects using these blocks.

# ./verify_backup_super /dev/sdX

Locating inodes using blocks 262144 1048576 4194304 on device /dev/sdX

Block# Inode Block Offset

262144 27 65058

1048576 Unused

4194304 4161791 25

Matching inodes to object names

27 //journal:0003

4161791 /src/kernel/linux-2.6.19/drivers/scsi/BusLogic.c



If the object happens to be user created, move that object temporarily to an another volume before re-attempting the operation. However, this will not work if one or more blocks are being used by a system file (shown starting with double slashes //), say, a journal.

How do I ask fsck.ocfs2 to use a backup super block? To recover a volume using the second backup super block, do:

# fsck.ocfs2 -f -r 2 /dev/sdX

[RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? n

Checking OCFS2 filesystem in /dev/sdX

label: myvolume

uuid: 4d 1d 1f f3 24 01 4d 3f 82 4c e2 67 0c b2 94 f3

number of blocks: 13107196

bytes per block: 4096

number of clusters: 13107196

bytes per cluster: 4096

max slots: 4



/dev/sdX was run with -f, check forced.

Pass 0a: Checking cluster allocation chains

Pass 0b: Checking inode allocation chains

Pass 0c: Checking extent block allocation chains

Pass 1: Checking inodes and blocks.

Pass 2: Checking directory entries.

Pass 3: Checking directory connectivity.

Pass 4a: checking for orphaned inodes

Pass 4b: Checking inodes link counts.

All passes succeeded.



For more, refer to the man pages.

CONFIGURING CLUSTER TIMEOUTS



List and describe all the configurable timeouts in the O2CB cluster stack? OCFS2 1.2.5 has 4 different configurable O2CB cluster timeouts:

O2CB_HEARTBEAT_THRESHOLD - The Disk Heartbeat timeout is the number of two second iterations before a node is considered dead. The exact formula used to convert the timeout in seconds to the number of iterations is as follows:

O2CB_HEARTBEAT_THRESHOLD = (((timeout in seconds) / 2) + 1)



For e.g., to specify a 60 sec timeout, set it to 31. For 120 secs, set it to 61. The current default for this timeout is 60 secs (O2CB_HEARTBEAT_THRESHOLD = 31). In releases 1.2.5 and earlier, it was 12 secs (O2CB_HEARTBEAT_THRESHOLD = 7).

O2CB_IDLE_TIMEOUT_MS - The Network Idle timeout specifies the time in miliseconds before a network connection is considered dead. The current default for this timeout is 30000 ms. In releases 1.2.5 and earlier, it was 10000 ms.

O2CB_KEEPALIVE_DELAY_MS - The Network Keepalive specifies the maximum delay in miliseconds before a keepalive packet is sent. As in, a keepalive packet is sent if a network connection between two nodes is silent for this duration. If the other node is alive and is connected, it is expected to respond. The current default for this timeout is 2000 ms. In releases 1.2.5 and earlier, it was 5000 ms.

O2CB_RECONNECT_DELAY_MS - The Network Reconnect specifies the minimum delay in miliseconds between connection attempts. The default has always been 2000 ms.

What are the recommended timeout values? As timeout values depend on the hardware being used, there is no one set of recommended values. For e.g., users of multipath io should set the disk heartbeat threshold to atleast 60 secs, if not 120 secs. Similarly, users of Network bonding should set the network idle timeout to atleast 30 secs, if not 60 secs.

What are the currect defaults for the cluster timeouts? The timeouts were updated in the 1.2.6 release to the following:

O2CB_HEARTBEAT_THRESHOLD = 31

O2CB_IDLE_TIMEOUT_MS = 30000

O2CB_KEEPALIVE_DELAY_MS = 2000

O2CB_RECONNECT_DELAY_MS = 2000







Can one change these timeout values in a round robin fashion? No. The o2net handshake protocol ensures that all the timeout values for both the nodes are consistent and fails if any value differs. This failed connection results in a failed mount, the reason for which is always listed in dmesg.

How does one set these O2CB timeouts? Umount all OCFS2 volumes and shutdown the O2CB cluster. If not already, upgrade to OCFS2 1.2.5+ and Tools 1.2.4+. Then use o2cb configure to set the new values. Do the same on all nodes. Start mounting volumes only after the timeouts have been set on all nodes.

# service o2cb configure

Configuring the O2CB driver.



This will configure the on-boot properties of the O2CB driver.

The following questions will determine whether the driver is loaded on

boot. The current values will be shown in brackets ('[]'). Hitting

without typing an answer will keep that current value. Ctrl-C

will abort.



Load O2CB driver on boot (y/n) [n]: y

Cluster to start on boot (Enter "none" to clear) []: mycluster

Specify heartbeat dead threshold (>=7) [7]: 31

Specify network idle timeout in ms (>=5000) [10000]: 30000

Specify network keepalive delay in ms (>=1000) [5000]: 2000

Specify network reconnect delay in ms (>=2000) [2000]: 2000

Writing O2CB configuration: OK

Starting O2CB cluster mycluster: OK



How to find the O2CB timeout values in effect?

# /etc/init.d/o2cb status

Module "configfs": Loaded

Filesystem "configfs": Mounted

Module "ocfs2_nodemanager": Loaded

Module "ocfs2_dlm": Loaded

Module "ocfs2_dlmfs": Loaded

Filesystem "ocfs2_dlmfs": Mounted

Checking O2CB cluster mycluster: Online

Heartbeat dead threshold: 31

Network idle timeout: 30000

Network keepalive delay: 2000

Network reconnect delay: 2000

Checking O2CB heartbeat: Not active



Where are the O2CB timeout values stored?

# cat /etc/sysconfig/o2cb

#

# This is a configuration file for automatic startup of the O2CB

# driver. It is generated by running /etc/init.d/o2cb configure.

# Please use that method to modify this file

#



# O2CB_ENABELED: 'true' means to load the driver on boot.

O2CB_ENABLED=true



# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.

O2CB_BOOTCLUSTER=mycluster



# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.

O2CB_HEARTBEAT_THRESHOLD=31



# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.

O2CB_IDLE_TIMEOUT_MS=30000



# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent

O2CB_KEEPALIVE_DELAY_MS=2000



# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts

O2CB_RECONNECT_DELAY_MS=2000





ENTERPRISE LINUX 5



What are the changes in EL5 as compared to EL4 as it pertains to OCFS2? The in-memory filesystems, configfs and debugfs, have different mountpoints. configfs is mounted at /sys/kernel/config, instead of /config, while debugfs at /sys/kernel/debug, instead of /debug. (dlmfs still mounts at the old mountpoint /dlm.)

Thursday, March 4, 2010

Booting the Failsafe Archive on a SPARC Based System

Booting the Failsafe Archive on a SPARC Based SystemBooting a system from a root (/) file system image that is a boot archive, and then remounting this file system on the actual root device can sometimes result in a boot archive and root file system that do not match, or are inconsistent. Under these conditions, the proper operation and integrity of the system is compromised. After the root (/) file system is mounted, and before relinquishing the in-memory file system, the system performs a consistency verification against the two files systems. If an inconsistency is detected, the normal boot sequence is suspended and the system reverts to failsafe mode.
Also, if a system failure, a power failure, or a kernel panic occurs immediately following a kernel file update, the boot archives and the root (/) file system might not be synchronized. Although the system might still boot with the inconsistent boot archives, it is recommended that you boot the failsafe archive to update the boot archives. You can also use the bootadm command to manually update the boot archives. For more information, see Using the bootadm Command to Manage the Boot Archives.
The failsafe archive can be booted for recovery purposes or to update the boot archive on both the SPARC and x86 platforms.
On the SPARC platform the failsafe archive is:
/platform/`uname -m`/failsafe
You would boot the failsafe archive by using the following syntax:

ok boot -F failsafe


Failsafe booting is also supported on systems that are booted from ZFS. When booting from a ZFS-rooted BE, each BE has its own failsafe archive. The failsafe archive is located where the root (/) file system is located, as is the case with a UFS-rooted BE. The default failsafe archive is the archive that is in the default bootable file system. The default bootable file system (dataset) is indicated by the value of the pool's bootfs property.

For information about booting an x86 based failsafe archive, see Booting the Failsafe Archive on an x86 Based System.

Another method that can be used to update the boot archives is to clear the boot-archive service. However, the preferred methods for updating the boot archives are to boot the failsafe archive or use the bootadm command. For more information, see How to Update an Inconsistent Boot Archive by Clearing the boot-archive Service.

How to Boot the Failsafe Archive on a SPARC Based SystemUse this procedure to boot the failsafe archive on a SPARC based system. If the system does not boot after the boot archive is updated, you might need to boot the system in single-user mode. For more information, see SPARC: How to Boot a System to Run Level S (Single-User Level).

--------------------------------------------------------------------------------
Note – This procedures also includes instructions for booting the failsafe archive for a specific ZFS dataset.
--------------------------------------------------------------------------------

Become superuser or assume an equivalent role.
Roles contain authorizations and privileged commands. For more information about roles, see Configuring RBAC (Task Map) in System Administration Guide: Security Services.
Bring the system to the ok prompt:

# init 0

Boot the failsafe archive.
To boot the default failsafe archive, type:

ok boot -F failsafe

To boot the failsafe archive of a specific ZFS dataset:

ok boot -F failsafe -Z dataset

For example:

ok boot -F failsafe -Z rpool/ROOT/zfsBE2
--------------------------------------------------------------------------------
Note – To determine the name of the dataset to boot, first use the boot -L command to display a list of the available BEs on the system. For more information, see SPARC: How to List Available Bootable Datasets Within a ZFS Root Pool.
--------------------------------------------------------------------------------
If an inconsistent boot archive is detected a message is displayed.
To update the boot archive, type y and press Return.

An out of sync boot archive was detected on rpool.The boot archive is a cache of files used during bootand should be kept in sync to ensure proper system operation.
Do you wish to automatically update this boot archive? [y,n,?] y
If the archive was updated successfully, a message is displayed:

The boot archive on rpool was updated successfully.
--------------------------------------------------------------------------------
Example 12–7 SPARC: Booting the Failsafe Archive
This example shows how to boot the failsafe archive on a SPARC based system. If no device is specified, the failsafe archive for the default boot device is booted.

ok boot -F failsafeResetting ...screen not found.Can't open input device. Keyboard not present. Using ttya for input and output.
Sun Enterprise 220R (2 X UltraSPARC-II 450MHz), No KeyboardOpenBoot 3.23, 1024 MB memory installed, Serial #13116682.Ethernet address 8:0:20:c8:25:a, Host ID: 80c8250a.
Rebooting with command: boot -F failsafeBoot device: /pci@1f,4000/scsi@3/disk@1,0:a File and args: -F failsafeSunOS Release 5.10tCopyright 1983-2007 Sun Microsystems, Inc. All rights reserved.Use is subject to license terms.Configuring /dev Searching for installed OS instances...
An out of sync boot archive was detected on /dev/dsk/c0t1d0s0.The boot archive is a cache of files used during boot andshould be kept in syncto ensure proper system operation.
Do you wish to automatically update this boot archive? [y,n,?] y Updating boot archive on /dev/dsk/c0t1d0s0.The boot archive on /dev/dsk/c0t1d0s0 was updated successfully.
Solaris 5.10 was found on /dev/dsk/c0t1d0s0.Do you wish to have it mounted read-write on /a? [y,n,?] nStarting shell.#
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Example 12–8 SPARC: Booting the Failsafe Archive for a Specified ZFS Dataset
This example shows how to boot the failsafe archive of a ZFS dataset. Note that the boot -L command is first used to display a list of available boot environments. This command must be run at the ok prompt.

ok boot -LRebooting with command: boot -L Boot device: /pci@1f,4000/scsi@3/disk@1,0 File and args: -L1 zfsBE2Select environment to boot: [ 1 - 1 ]: 1
To boot the selected entry, invoke:boot [] -Z rpool/ROOT/zfsBE2
Program terminated{0} ok


Resetting ...
screen not found.Can't open input device.Keyboard not present. Using ttya for input and output.
Sun Enterprise 220R (2 X UltraSPARC-II 450MHz), No KeyboardOpenBoot 3.23, 1024 MB memory installed, Serial #13116682.Ethernet address 8:0:20:c8:25:a, Host ID: 80c8250a.

{0} ok boot -F failsafe -Z rpool/ROOT/zfsBE2Boot device: /pci@1f,4000/scsi@3/disk@1,0 File and args: -F failsafe -Z rpool/ROOT/zfsBE2SunOS Release 5.10Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.Use is subject to license terms.Configuring /devSearching for installed OS instances...
ROOT/zfsBE2 was found on rpool.Do you wish to have it mounted read-write on /a? [y,n,?] ymounting rpool on /a
Starting shell.# # # # zpool listNAME SIZE USED AVAIL CAP HEALTH ALTROOTrpool 16.8G 6.26G 10.5G 37% ONLINE /a# # zpool status pool: rpool state: ONLINE scrub: none requestedconfig:
NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0t1d0s0 ONLINE 0 0 0
errors: No known data errors# # df -hFilesystem size used avail capacity Mounted on/ramdisk-root:a 163M 153M 0K 100% //devices 0K 0K 0K 0% /devices/dev 0K 0K 0K 0% /devctfs 0K 0K 0K 0% /system/contractproc 0K 0K 0K 0% /procmnttab 0K 0K 0K 0% /etc/mnttabswap 601M 344K 601M 1% /etc/svc/volatileobjfs 0K 0K 0K 0% /system/objectsharefs 0K 0K 0K 0% /etc/dfs/sharetabswap 602M 1.4M 601M 1% /tmp/tmp/root/etc 602M 1.4M 601M 1% /.tmp_proto/root/etcfd 0K 0K 0K 0% /dev/fdrpool/ROOT/zfsBE2 16G 5.7G 9.8G 37% /arpool/export 16G 20K 9.8G 1% /a/exportrpool/export/home 16G 18K 9.8G 1% /a/export/homerpool

Wednesday, March 3, 2010

RM6 (Raid Manager6) commands for A1000 Storage Array

Login to the server where the A1000 Storage array is connected . Please login as root for executing below commands

To perform a health check


cd /usr/sbin/osa
# ./healthck -a

Health Check Summary Information
gb029_001: Dead LUN at Drive [1,2];[2,2];[1,3] -- FAILED DISKS TO BE REPLACEDgb029_001: Battery Alert
healthck succeeded!
solarisserver:[/usr/sbin/osa]

Drive Information for controller named gb029_001

# /usr/sbin/osa/drivutil -i gb029_001

Drive Information for gb029_001


Location Capacity Status Vendor Product Firmware Serial
(MB) ID Version Number
[1,0] 34732 Optimal FUJITSU MAN3367M SUN36G 1804 01X18729
[2,0] 34732 Optimal FUJITSU MAN3367M SUN36G 1804 01X18483
[1,1] 34732 Optimal FUJITSU MAN3367M SUN36G 1804 01X18023
[2,1] 34732 Optimal SEAGATE ST336607LSUN36G 0707 3JAY3EAD00
[1,2] 0 Failed FUJITSU MAN3367M SUN36G 1804 01X07505
[2,2] 0 Failed
[1,3] 0 Failed FUJITSU MAN3367M SUN36G 1804 01X10350
[2,3] 34732 Optimal FUJITSU MAN3367M SUN36G 1804 01X08112
[1,4] 34732 Optimal FUJITSU MAN3367M SUN36G 1804 03X19438
[2,4] 34732 Optimal FUJITSU MAJ3364M SUN36G 0804 02M16646
[1,5] 34732 Optimal FUJITSU MAN3367M SUN36G 1804 01X08928
[2,5] 34732 Spare[2,2] FUJITSU MAN3367M SUN36G 1804 01X10820
Screen shot



To display which drives belong to which drive groups

# ./drivutil -d gb029_001

Drives in Group for gb029_001
Group Drive List [Channel,Id]
Hot Spare [2,5]; Group 1: [1,0]; [2,0]; [1,1]; [2,1]; [1,2]; [2,2]; [1,3]; [2,3]; [1,4]; [2,4]; [1,5];

drivutil succeeded!
solarisserver:[/usr/sbin/osa]#

To display a list of LUNs and their status in a specific RAID module
# ./drivutil -l gb029_001

Logical Unit Information for gb029_001
LUN Group Device RAID Capacity Status Name Level (MB)
0 1 c3t0d0 5 346927 Dead
drivutil succeeded!
solarisserver:[/usr/sbin/osa]

To verify the status of lun 0


# ./drivutil -p 0 c3t0d0gb029_001

unit 0: dead
drivutil succeeded!

To display the drive group information

# ./drivutil -I c3t0d0

Group Information for gb029_001
Group No. of RAID No. of Total Remaining LUNs Level Drives Space(MB) Space(MB)
Hot Spare - - 1 - - 1 1 5 11 346928 1


To display a list of the luns, their size and firmware info in a RAID Module

# ./raidutil -c c3t0d0 -i

LUNs found on c3t0d0. LUN 0 RAID 5 346927 MB
Vendor ID SymbiosProductID StorEDGE A1000Product Revision 0301Boot Level 03.01.04.00Boot Level Date 04/05/01Firmware Level 03.01.04.71Firmware Date 09/25/01raidutil succeeded!
solarisserver:[/usr/sbin/osa]

To delete an existing LUN
./raidutil -c c3t0d0 -D 0 OR ./raidutil -c c3t0d0 -D ALL
To confirm your previous step

./raidutil -c c3t0d0 -B

To create a MAX size raid 5 LUN using specific drives

./raidutil -c c3t0d0 -n 0 -l 5 -s 0 -g 10,20,11,21,12,22,13,23,14,24,15

n = lun no.
l= raid 5
s= size in MB ( 0 indicated raid 5 with max size) or u can use -s 346927
g = disks

To set up this disk as a hot spare
./raidutil -h 25

To check the raid creation progress

./drivutil -p 0 c3t0d0
once its done , pls run
cfgadm -c configure
Eg : cfgadm -c configure c3

devfsadm -Cv

Sunday, February 14, 2010

Uinstall the Veritas Cluster Service .

Below are the general outline for uninstall veritas cluster services without removing Veritas Volume manger and file system . The below stpes need to perform in both the cluster nodes . Please perfrom the steps in one node at a time.

Make a backup of the VCS configuration file, to reference for file systems, shares and apps if needed

o cp /etc/VRTSvcs/conf/config/main.cf $HOME/main.cf

Stop VCS, but keep all applications running

o hastop -all -force

Disable VCS startup

o mv /etc/rc3.d/S99vcs /etc/rc3.d/s99vcs
o mv /etc/rc2.d/S92gab /etc/rc2.d/s92gab
o mv /etc/rc2.d/S92gab /etc/rc2.d/s92gab
o rm /etc/llttab
o rm /etc/gabtab
o rm /etc/llthost

Add all file systems to /etc/vfstab

Add Solaris Services back that were removed per VCS requirement

o /network/nfs/status
o /network/nfs/server
o /network/nfs/mapid
o Here are the commands used to remove those services:
 svccfg delete -f svc:/network/nfs/status:default
 svccfg delete -f svc:/network/nfs/server:default
 svccfg delete -f svc:/network/nfs/mapid:default
o The service manifest files to import should be located in /var/svc/manifest/network/nfs
o Below steps will import the manifest (You will get the error that “ partial import because of default-milestone is already online” while importing the nfs server manifest , but it will come online after server reboot)
 svccfg
 svc:> validate /var/svc/manifest/network/nfs/server.xml
 svc:> import /var/svc/manifest/network/nfs/server.xml

 svc:> validate /var/svc/manifest/network/nfs/status.xml
 svc:> import /var/svc/manifest/network/nfs/status.xml

 svc:> validate /var/svc/manifest/network/nfs/mapid.xml
 svc:> import /var/svc/manifest/network/nfs/mapid.xml

Configure NFS shares in /etc/dfs/dfstab

Enable the nfs/server service (and all dependency services)

Ensure NFS shares are shared

o May need to log into every server that mounts these shares to verify they are still mounted, and remount if they are not.

Configure IPMP
o Generally, the Primary Interface is “ce0” and the secondary interface is “ce4”, but verify within the main.cf file, looking for the MultiNICA definition stanza.

Ensure all apps that were managed by VCS (Oracle, SAP, etc) have the appropriate startup/shutdown scripts in system startup/shutdown. The app teams will need to write the startup/shutdown scripts.
o Ensure you link the shutdown scripts into every rc directory except rc3.d (app teams typically state to only put into one rc directory, but that is incorrect for Solaris)

Reboot the servers and verify everything comes up appropriately

o IPMP
o File systems
o NFS Shares
o Apps

May need to log into every server that mounts these shares to verify they are still mounted, and remount if they are not.

Once verified, uninstall VCS
o Not VxVM or VxFS as those are still used, just the VCS components. This should be the complete list, but verify:
 VRTSsap
 VRTSvcs
 VRTSvcsag
 VRTSvcsdc
 VRTSvcsmg
 VRTSvcsmn
 VRTSvcsor
 VRTSvcsvr
 VRTSvcsw
 VRTScscw
 VRTScsocw
 VRTSagtfw

Reboot the servers again with VCS uninstalled, ensuring proper startup

May need to log into every server that mounts these shares to verify they are still mounted, and remount if they are not.



Some of the work can be done before the actual scheduled maintenance window:

Backup copy of the VCS configuration file

Add all file systems to /etc/vfstab, though commented out in the event the server is rebooted before the maintenance window

Configure NFS shares in /etc/dfs/dfstab (though cannot add the services back yet)

Obtain the needed IP Addresses to properly configure IPMP

Pre-create the new /etc/hostname.* files, though naming them differently in case the server is rebooted before the maintenance window

Place the app startup/shutdown scripts into place, though naming their links to ensure they do not run on server startup/shutdown in the event the server is rebooted before the maintenance window

Thursday, January 28, 2010

Solaris 10 - Increasing Number of Processes Per User

The below example is given for increasing the number of processes on Solaris 10 system on PER UID. The hardware used here is UltraSPARC T2 based system with Solaris 10 and 32 GB RAM.
We needed to increase the number of processesper user to more than current setting of 30000

bash-3.00# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
open files (-n) 260000
pipe size (512 bytes, -p) 10
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 29995
virtual memory (kbytes, -v) unlimited

Trying to increase the "max user processes" would fail with the following error:

bash-3.00# ulimit -u 50000
bash: ulimit: max user processes: cannot modify limit: Invalid argument
bash-3.00#

After going through the Solaris 10 Tunable Guide for Process sizing learned that there are 5 related parameters related to process sizing.

maxusers - The maximum number of processes on the system, The number of quota structures held in the system. The size of the directory name look-up cache (DNLC)
reserved_procs - Specifies the number of system process slots to be reserved in the process table for processes with a UID of root
pidmax - Specifies the value of the largest possible process ID. Specifies the value of the largest possible process ID. Valid for Solaris 8 and later releases.
max_nprocs - Specifies the maximum number of processes that can be created on a system. Includes system processes and user processes. Any value specified in /etc/system is used in the computation of maxuprc.
maxuprc - Specifies the maximum number of processes that can be created on a system by any one user

Looked at the current values for these parameter:

bash-3.00# echo reserved_procs/D | mdb -k
reserved_procs:
reserved_procs: 5

bash-3.00# echo pidmax/D| mdb -k
pidmax:
pidmax: 30000

bash-3.00# echo maxusers/D | mdb -k
maxusers:
maxusers: 2048
bash-3.00#

bash-3.00# echo max_nprocs/D | mdb -k
max_nprocs:
max_nprocs: 30000
bash-3.00#

bash-3.00# echo maxuprc/D| mdb -k
maxuprc:
maxuprc: 29995

So, in order to set the max per user processes in this scenario, we were required to make the changes to "pidmax" (upper cap), maxusers, max_nprocs & maxuprc
Sample entries in /etc/system & reboot


set pidmax=60000
set maxusers = 4096
set maxuprc = 50000
set max_nprocs = 50000

After making the above entries, we were able to increase the max user processes to 50000.

bash-3.00# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
open files (-n) 260000
pipe size (512 bytes, -p) 10
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 49995
virtual memory (kbytes, -v) unlimited
bash-3.00#

bash-3.00# echo reserved_procs/D |mdb -k
reserved_procs:
reserved_procs: 5
bash-3.00# echo pidmax/D |mdb -k
pidmax:
pidmax: 60000
bash-3.00# echo max_nprocs/D |mdb -k
max_nprocs:
max_nprocs: 50000
bash-3.00# echo maxuprc/D | mdb -k
maxuprc:
maxuprc: 50000
bash-3.00#

Note: If you are operating within the 30000 limit (default pidmax setting) the blog entry referred above seems to work fine. If you are looking at increasing the processes beyond 30000, it we need to make adjustment to other dependent parameters stated in this blog entry.