FreeBSD and Multipath

I didn’t find any blog posts of discussions on FreeBSD and multipath (for storage) that wasn’t a man page.

That means it is up to me to write about it :)

Hardware

CPU

Machine class:	amd64
CPU Model:	Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
No. of Cores:	16

Memory

Total real memory available:	65511 MB
Logically used memory:		3945 MB
Logically available memory:	61565 MB

Storage

The storage is a large ~90TB Enterprise class Fibre Channel array, a Data Direct Networks S2A9900. Connected to that are two, dual port QLogic 2532 8Gb HBA’s. We also have two SSD drives (configured as a RAID1 device) for the ZFS Intent Log.

The storage array was configured from 120 1TB, 7200RPM Hitachi drives. It has 12 volumes in total, composed of 10 of the SATA drives (1 parity, 1 Spare), or ~7TB.

The S2N9900 has two controllers, one controller is responsible for LUN’s 1-6, the other controller is responsible for LUN’s 7-12. every LUN is presented to all four Fibre Channel ports. This got a little messy, trying to sort out 48 raw disk devices takes some patience and a decent attention span…

yeah, I did make a few typo’s here and there, thankfully creating and clearing disk labels is easy.

# camcontrol devlist|grep lun\ 0
                at scbus0 target 0 lun 0 (pass0,da0)
                at scbus1 target 0 lun 0 (pass6,da6)
                at scbus4 target 0 lun 0 (pass24,da24)
                at scbus5 target 0 lun 0 (pass30,da30)
# camcontrol inquiry da0 -S
108EA1B10001
# camcontrol inquiry da6 -S
108EA1B10001
# camcontrol inquiry da24 -S
108EA1B10001
# camcontrol inquiry da30 -S
108EA1B10001
# gmultipath label -v DDN-v00 /dev/da0 /dev/da6 /dev/da24 /dev/da30
Done.
# gmultipath status
             Name  Status  Components
multipath/DDN-v00     N/A  da0
                           da6
                           da24
                           da30

Now, to do that 12 more times…

Whew, hard work!

Now, to create a simple ZFS volume across all 12 luns:

# zpool create zfs multipath/DDN-v00 multipath/DDN-v01 multipath/DDN-v02 multipath/DDN-v03 multipath/DDN-v04 multipath/DDN-v05 multipath/DDN-v06 multipath/DDN-v07 multipath/DDN-v08 multipath/DDN-v09 multipath/DDN-v10 multipath/DDN-v11 log mfid1

# zpool status
  pool: zfs
 state: ONLINE
 scrub: none requested
config:

	NAME                 STATE     READ WRITE CKSUM
	zfs                   ONLINE       0     0     0
	  multipath/DDN-v00  ONLINE       0     0     0
	  multipath/DDN-v01  ONLINE       0     0     0
	  multipath/DDN-v02  ONLINE       0     0     0
	  multipath/DDN-v03  ONLINE       0     0     0
	  multipath/DDN-v04  ONLINE       0     0     0
	  multipath/DDN-v05  ONLINE       0     0     0
	  multipath/DDN-v06  ONLINE       0     0     0
	  multipath/DDN-v07  ONLINE       0     0     0
	  multipath/DDN-v08  ONLINE       0     0     0
	  multipath/DDN-v09  ONLINE       0     0     0
	  multipath/DDN-v10  ONLINE       0     0     0
	  multipath/DDN-v11  ONLINE       0     0     0
	logs                 ONLINE       0     0     0
	  mfid1              ONLINE       0     0     0

errors: No known data errors

Results

These results wre obtained from two similar servers. The other server is using a Winchester Systems Storage array, and has 24GB of system memory. The Winchester Storage is ~40TB of 2TB SATA disks:

Another RAID array, just for a comparison

I used IOZone for the test (iozone -a). The default iozone test is using 64k files to 512MB files, and since I’m trying to see how the server might actually react to the real worl, I’m okay with this (ie, I fully understand that a LOT of caching is taking place, and I want that for right now).

Forward Re-Write

Forward Re-Read

Forward Read

Forward Read


Backwards Read


Random Read


Re-Read


Rec? Re-Write

Write


Strided Read

The S2N9900 is a pretty nice device. Although you have to use TELNET (yeesh, couldn’t they spend a few more bucks on a small ARM processor and use ssh?), the controllers have a decent command line environment with HELP pages. What is also nice is the company provides the documentation for their products for free, and no registration is required. Good Job!

As far as raw read and write speeds, that is hard to nail down. I’ve been using IOZone, and when I run that, and take a look at ‘zpool iostat 1′, the ZFS Pool stays at a constant 200MB/sec for writes. I’ve seen in pop up higher, like 250MB to 500MB, but 200 seems to be the ceiling. I’ve done with and without a dedicated log device, with and without gmultipath, and finally, using the SSD RAID1 as a L2ARC cache device. All results are nearly identical. Reads are pretty crazy though, with 64GB of system memory, reading a file is nearly 1GB/sec.

Stuff and Things

I don’t have a central theme with this post, but I wanted to at least do something (it has been a while).

Packet Filter

Based on Chris’s “Falling in love with pf(4)” google status, I decided to take the plunge and move off of ipfw(4) to pf(4). I’m not at the point where I could write my one filter, however, I do feel I at least understand what is happening here. I also took the time to update all my ports, so I’m even running PHP 5.3.2, the latest WP release and about 600 other installed ports (Yikes, I’ve got a LOT of stuff on this server!).

I’ve always built my own kernel, at least on my home server, so the first thing to do is sync my /usr/src tree:

$ sudo su -
root# csup ~/bin/src-supfile
...
root# vim /usr/src/sys/amd64/conf/BLACKHOLE
# pf
device          pf
device          pflog
# pf's QoS - ALTQ
options         ALTQ
options         ALTQ_CBQ        # Class Bases Queuing (CBQ)
options         ALTQ_RED        # Random Early Detection (RED)
options         ALTQ_RIO        # RED In/Out
options         ALTQ_HFSC       # Hierarchical Packet Scheduler (HFSC)
options         ALTQ_PRIQ       # Priority Queuing (PRIQ)
options         ALTQ_NOPCC      # Required for SMP build

root# cd /usr/src ; make -j8 buildkernel && make installkernel && reboot

I use tcsh, a C Shell variant, and I find the AND (&&) operator really useful to chain commands together but I require that they succeed. This way, if my build fails, it will abort and not proceed with the install and reboot.

Now that I have a updated kernel with PF enabled, I had to steal Chris’s configuration:

/etc/rc.conf

root# vim /etc/rc.conf
#
# Packet Filter
#
pf_enable="YES"
pf_rules="/etc/pf.conf"
pflog_enable="YES"

#
# Unused, pf replaces all of this
#
#natd_program="/sbin/natd"       # path to natd, if you want a different one.
#natd_enable="YES"                # Enable natd (if firewall_enable == YES).
#natd_interface="em0"               # Public interface or IPaddress to use.
#natd_flags="-u -s -m"                   # Additional flags for natd.
#firewall_enable="YES"
#firewall_script="/usr/local/etc/rc.firewall"
#firewall_logging="YES"

/etc/pf.conf

root# vim /etc/pf.conf
# ----------------------------------------------------------------------------
# "THE BEER-WARE LICENSE" (Revision 42):
# cshumway@titan-project.org wrote this file. As long as you retain this notice you
# can do whatever you want with this stuff. If we meet some day, and you think
# this stuff is worth it, you can buy me a beer in return Christopher Shumway
# ----------------------------------------------------------------------------
#
# pf.conf
ext_if="em0"
int_if="em1"
lan_net="192.168.2.0/24"
open_ports="{ domain, ssh, http, https }"

# options
set skip on lo0
set skip on $int_if
set limit states 25000
set loginterface $ext_if
set state-policy if-bound

# scrub traffic
scrub in all

# NAT
nat on $ext_if from $lan_net to any -> ($ext_if)

# upnp redirection
rdr-anchor "miniupnpd"
anchor "miniupnpd"

# antispoofing
antispoof for $ext_if

# rules start here
block in
pass out on $ext_if keep state
pass in on $ext_if inet proto { tcp, udp } from any to ($ext_if) port $open_ports flags S/SA keep state
pass in on $ext_if inet proto icmp

Pretty simple, and after a reboot my top process is java, and not natd(8). I can almost feel the internet becoming faster :)

Rock n Roll Owen

Unlike me, Owen still looks like a nice guy with shades on. I look like someone who would drive a black acura and cut you off...

Owen doesn't like the Paparazzi treatment!

We’ve made up, and did a publicity photo-op together

Caralyne’s Garden

Caralyne is skillful with both tending plants, and stapling things.

Dogs

Zoey and Coal sharing the morning sun

Link Aggregation on FreeBSD

I recently configured a NFS/Samba server with FreeBSD’s Link Aggregation protocol. Here is how I set it up.

FreeBSD Configuration

/boot/loader.conf

I recommend adding the if_lagg_load=YES and kern.hz=”2000″ to /boot/loader.conf.

The OS will automatically load the lagg kernel module when your network configuration loads, I prefer to explicitly set it to load.

ispfw_load="YES"
kern.hz="2000"
aio_load="YES"
hw.igb.rxd=4096
hw.igb.txd=4096
if_lagg_load="YES"

Since I am using the igb ethernet device (Intel 82575 and 82576 chipsets), I also set the max number of send and receive descriptors from the default from 256 to the maximum 4096. Give some thought to this step, increasing this will allocate more memory per interface. Since there are four in use in this setup, that is an order of magnitude higher than the stock configuration.

/etc/rc.conf

ifconfig_igb0="UP polling"
ifconfig_igb1="UP polling"
ifconfig_igb2="UP polling"
ifconfig_igb3="UP polling"
ifconfig_lagg0="create laggproto lacp laggport igb0 laggport igb1 laggport igb2 laggport igb3 128.115.132.165 netmask 255.255.255.0"
Jumbo Frames

Normally, I set a MTU of 9194 for my igb/82575 ethernet controllers.

Setting an arbitrary MTU size above the default 1500 can cause the unexpected, and decrease the stability in your environment. This configuration is using the Intel 82575 Quad Port 1000 VT adapter, which has the maximum MTU of 9194. Not all ethernet controllers support the same MTU sizes, for instance, Broadcom chipsets have a max MTU of 9022. Also, verify your switch can support Jumbo Frames, and have the ports in use set to the appropriate MTU.

Some Notes from Intel’s 8257[5-6] README:

- Only enable Jumbo Frames if your network infrastructure supports them.
- To enable Jumbo Frames, increase the MTU size on the interface beyond
1500.
- The Jumbo Frames setting on the switch must be set to at least 22 bytes
larger than that of the MTU.
- The maximum MTU setting for Jumbo Frames is 9216. This value coincides
with the maximum Jumbo Frames size of 9234 bytes.
- Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or
loss of link.

Since the Cisco hardware used can be set to a max MTU of 9216, our igb interfaces can be set to 9194.

Of course, the lagg interface wont let me set a mtu higher than 1500. If anyone has additional information on this, that would be great to know.

Results

[root@laggy ~]> ifconfig lagg0

lagg0: flags=8843 metric 0 mtu 9194
	options=1bb
	ether 00:26:b9:62:ae:c8
	inet 128.115.132.165 netmask 0xffffff00 broadcast 128.115.132.255
	media: Ethernet autoselect
	status: active
	laggproto lacp
	laggport: igb3 flags=1c
	laggport: igb2 flags=1c
	laggport: igb1 flags=1c
	laggport: igb0 flags=1c

[root@laggy ~]> netstat -I lagg0 -w 1

            input        (lagg0)           output
   packets  errs      bytes    packets  errs      bytes colls
     87581     0  127400293          0     0          0     0
     85126     0  123891200          0     0          0     0
     84926     0  124237023          0     0          0     0

[root@laggy ~]> netstat -I igb0 -w 1

            input         (igb0)           output
   packets  errs      bytes    packets  errs      bytes colls
     38118     0   54749173          0     0          0     0
     35833     0   51498282          0     0          0     0

[root@laggy ~]> netstat -I igb1 -w 1

            input         (igb1)           output
   packets  errs      bytes    packets  errs      bytes colls
     12889     0   18315538          0     0          0     0
     16303     0   23159114          0     0          0     0
     27672     0   39275792          0     0          0     0

[root@laggy ~]> netstat -I igb2 -w 1

            input         (igb2)           output
   packets  errs      bytes    packets  errs      bytes colls
     23709     0   35778378          0     0          0     0
     24445     0   36901194          0     0          0     0

[root@laggy ~]> netstat -I igb3 -w 1

            input         (igb3)           output
   packets  errs      bytes    packets  errs      bytes colls
        11     0       1535          0     0          0     0
         1     0         60          0     0          0     0

Cisco Configuration

interface GigabitEthernet 8/41
 description (L5D13) laggy
 switchport                           # Required for L2 Etherchannel
 switchport access vlan 132           # VLAN assignment (optional)
 spanning-tree portfast               # Recommended
 channel-group 4 mode active          # Required assign channel # and mode
                                         see table below
 channel-protocol lacp                # Required assign lacp or pagp
 no shutdown

Verify the interface port-channel is not shutdown and add a description

interface Port-channel 4
 description Test Channel 2/22/10 laggy
 switchport
 switchport access vlan 101
 switchport trunk encapsulation dot1q
 no shutdown

This command will cause load balancing to occur for the source and destination IP addresses and applied to all Etherchannels on the switch. Other load balancing options exists.

port-channel load-balance src-dst-ip

Use the following show commands for verifying the condition of the Etherchannel

show interfaces port-channel ‘channel number’
show etherchannel port-channel
show etherchannel summary
show Etherchannel load-balance

Switch# show interfaces port-channel 4
Port-channel4 is up, line protocol is up (connected)
  Hardware is EtherChannel, address is 0017.9499.94ac (bia 0017.9499.94ae)
  Description: TEST CHANNEL 2/22/10 laggy
  MTU 9214 bytes, BW 4000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s
  input flow-control is off, output flow-control is on
  Members in this channel: Gi8/41 Gi8/43 Gi8/45 Gi8/47

Switch#show etherchannel port-channel
                Channel-group listing:
                -----------------------
Group: 4
----------
                Port-channels in the group:
                ----------------------
Port-channel: Po4    (Primary Aggregator)

------------
Age of the Port-channel   = 0d:05h:25m:33s
Logical slot/port   = 14/1          Number of ports = 4
Port state          = Port-channel Ag-Inuse
Protocol            =   LACP

Ports in the Port-channel: 

Index   Load   Port     EC state        No of bits
------+------+------+------------------+-----------
  1     11     Gi8/41   Active    2
  2     22     Gi8/43   Active    2
  3     44     Gi8/45   Active    2
  0     88     Gi8/47   Active    2

Time since last port bundled:    0d:04h:47m:26s    Gi8/41
Time since last port Un-bundled: 0d:05h:22m:14s    Gi8/47

Switch#show etherchannel summary
Flags:  D - down        P - bundled in port-channel
        I - stand-alone s - suspended
        H - Hot-standby (LACP only)
        R - Layer3      S - Layer2
        U - in use      f - failed to allocate aggregator

        M - not in use, minimum links not met
        u - unsuitable for bundling
        w - waiting to be aggregated
Number of channel-groups in use: 1
Number of aggregators:           1

Group  Port-channel  Protocol    Ports
------+-------------+-----------+-----------------------------------------------
4      Po4(SU)         LACP      Gi8/41(P)  Gi8/43(P)  Gi8/45(P)  Gi8/47(P)
Switch#show Etherchannel load-balance
EtherChannel Load-Balancing Configuration:
        src-dst-ip
        mpls label-ip

EtherChannel Load-Balancing Addresses Used Per-Protocol:
Non-IP: Source XOR Destination MAC address
  IPv4: Source XOR Destination IP address
  IPv6: Source XOR Destination IP address
  MPLS: Label or IP

FreeBSD 8.0 = A Great NAS Server

I need to share this. When I google for “Samba performance”, I never see real numbers, real configuration files, or real hardware environments. All I read are anecdotal recollections, and that is not good enough. I like numbers, and I’ll let the numbers speak for themselves:

    > netstat -I em0 -w 1
                input          (em0)           output
       packets  errs      bytes    packets  errs      bytes colls
         90166     0   98762637      95363     0    5332847     0
         18131     0   24713156      20042     0    1123684     0
             4     0        310          1     0        178     0
             8     0        518          1     0        178     0
         10153     0   10952920      10696     0     598129     0
         92990     0  102837002      98476     0    5514994     0
         92025     0  102680574      97277     0    5439496     0
         92080     0  101799874      97403     0    5448637     0
         75348     0   90861608      80972     0    4537737     0
         90895     0  100323946      95781     0    5360948     0
         89313     0   97371154      94364     0    5278618     0
         81363     0   89229738      85861     0    4803589     0
             2     0        126          3     0        286     0

I was so shocked that I had to use gstat and zpool iostat to verify the information:

    dT: 1.002s  w: 1.000s  filter: da0
     L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
       35   1476      0      0    0.0   1476 188421   23.7  100.0| da0

    > zpool iostat  1
                   capacity     operations    bandwidth
    pool         used  avail   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    tank        5.68T  4.32T      1     81   250K  10.1M
    tank        5.68T  4.32T      0  1.37K      0   175M
    tank        5.68T  4.32T      0  1.44K      0   184M
    tank        5.68T  4.32T      0  1.44K      0   184M
    tank        5.68T  4.32T      0  1.44K      0   184M
    tank        5.68T  4.32T      0  1.44K      0   184M
    tank        5.68T  4.32T      0  1.44K      0   184M
    tank        5.68T  4.32T      0  1.44K      0   184M

This is all through Samba (3.3.9), There was no local work being done. I unfortunately didn’t configure MRTG correctly, so it had built a malformed graph while all this happened. Having a picture from all of this would have been nice.

The underlying storage is a SATABoy2 RAID6 array, with a simple “flat” ZFS filesystem (version 13). As cheap as the SATABoy’s are (and come on, they have a terrible IIS web interface), they can at least keep up with the current load.

I have felt that if you are going to use ZFS, you should let it manage the RAID, and not bother with a hardware RAID controller. While the hardware RAID may be faster, ZFS’s ability to self-correct bad blocks is a great feature despite the performance set back. However, RAID6 is pretty good in itself, and having dual parity would ideally reduce the risk of a bad block being detrimental.

One thing I noticed with Samba is it doesn’t seem to be a threaded daemon. When I do a top(1) -H, there are only 2-3 smbd processes, and one of them is running around 30%. Though I don’t really know how well Samba can scale out, this environment only has about 10 users. I would like to see how samba reacts if there are a couple hundred active users. Furthermore, how does a native Windows server handle a couple hundred users? It may handle it a little better, however, I don’t think I would enjoy watching NTFS handling a multi-terabyte volume… it would be like watching a stroke victim eat a bowl of soup. I do admit I am biased and I have no working experience with Windows as a large file server, most of them that I have worked on are horribly limited and underpowered, and no one seems to care if they perform well or not.

Hardware

CPU information

    Machine class:    amd64
    CPU Model:    Dual Core AMD Opteron(tm) Processor 285
    No. of Cores:    4
    Cores per CPU:

RAM information

    Memory information from dmidecode(8)
    Maximum Capacity: 8 GB
    Number Of Devices: 4
    Maximum Capacity: 8 GB
    Number Of Devices: 4

    INFO: Run `dmidecode -t memory` to see further information.

    System memory summary
    Total real memory available:    8048 MB
    Logically used memory:        2876 MB
    Logically available memory:    5172 MB

    Swap information
    Device          1K-blocks     Used    Avail Capacity
    /dev/da1s1b       8373844      28K     8.0G     0%

Storage information

    Available hard drives:
    cd0:  Removable CD-ROM SCSI-0 device
    cd0: 1.000MB/s transfers
    da2:  Fixed Direct Access SCSI-5 device
    da2: 300.000MB/s transfers
    da2: Command Queueing enabled
    da2: 140009MB (286739329 512 byte sectors: 255H 63S/T 17848C)
    da1:  Fixed Direct Access SCSI-2 device
    da1: 300.000MB/s transfers
    da1: Command Queueing enabled
    da1: 69618MB (142577664 512 byte sectors: 255H 63S/T 8875C)
    da0:  Fixed Direct Access SCSI-5 device
    da0: 200.000MB/s transfers
    da0: Command Queueing enabled
    da0: 10491861MB (21487333120 512 byte sectors: 255H 63S/T 1337524C)

    Raid controllers:
    umass-sim0:
    mpt0:
    vendor='LSI Logic (Was: Symbios Logic, NCR)'
    device='SAS 3000 series, 4-port with 1064 -StorPort'
    isp0:
    vendor='QLogic Corporation'
    device='QLA6322 Fibre Channel Adapter'

    Currently mounted filesystems:
    /dev/da1s1a on /
    devfs on /dev
    tank on /tank
    /dev/ufs/EXPORT on /export

    I/O statistics:
           tty             da0              da1              da2             cpu
     tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
       0    40 63.61 167 10.36  16.53   2  0.03  61.65   0  0.00   1  0  4  0 94
    INFO: Run iostat(8) or gstat(8) to see live statistics.

    Disk usage:
    Filesystem         Size    Used   Avail Capacity  Mounted on
    /dev/da1s1a         58G    3.4G     50G     6%    /
    devfs              1.0K    1.0K      0B   100%    /dev
    tank               9.8T    5.7T    4.1T    58%    /tank
    /dev/ufs/EXPORT    126G    148K    116G     0%    /export

Software

  • FreeBSD 8.0-RELEASE-p1 FreeBSD 8.0-RELEASE-p1 amd64
  • samba-3.3.9 A free SMB and CIFS client and server for UNIX

Samba 3.3.9 Compile-Time Config

> make showconfig
===> The following configuration options are available for samba-3.3.9:
     LDAP=on "With LDAP support"
     ADS=on "With Active Directory support"
     CUPS=off "With CUPS printing support"
     WINBIND=on "With WinBIND support"
     SWAT=off "With SWAT WebGUI"
     ACL_SUPPORT=on "With ACL support"
     AIO_SUPPORT=on "With Asyncronous IO support"
     FAM_SUPPORT=on "With File Alteration Monitor"
     SYSLOG=on "With Syslog support"
     QUOTAS=on "With Disk quota support"
     UTMP=off "With UTMP accounting support"
     PAM_SMBPASS=on "With PAM authentication vs passdb backends"
     DNSUPDATE=off "With dynamic DNS update(require ADS)"
     DNSSD=off "With DNS service discovery support"
     EXP_MODULES=on "With experimental modules"
     POPT=on "With system-wide POPT library"
     MAX_DEBUG=off "With maximum debugging"
     SMBTORTURE=off "With smbtorture"
===> Use 'make config' to modify these settings

System Tuning

The Kernel

I enabled device polling, and took out debugging in the kernel (Sanders, get it! Mmm, I’m hungry…)

diff /usr/src/sys/amd64/conf/GENERIC /usr/src/sys/amd64/conf/SANDERS
    33d32
    < makeoptions    DEBUG=-g        # Build kernel with gdb(1) debug symbols
    78c77
    <
    ---
    > options        DEVICE_POLLING

/boot/loader.conf

    ispfw_load="YES"
    kern.hz="2000"
    aio_load="YES"

/etc/sysctl.conf

    kern.coredump=0
    security.bsd.see_other_uids=0
    security.bsd.see_other_gids=0
    kern.ipc.maxsockbuf=16777216
    kern.ipc.nmbclusters=32768
    kern.ipc.somaxconn=32768
    kern.maxfiles=65536
    kern.maxfilesperproc=32768
    kern.maxvnodes=800000
    net.inet.tcp.delayed_ack=0
    net.inet.tcp.inflight.enable=0
    net.inet.tcp.path_mtu_discovery=0
    net.inet.tcp.recvbuf_auto=1
    net.inet.tcp.recvbuf_inc=524288
    net.inet.tcp.recvbuf_max=16777216
    net.inet.tcp.recvspace=65536
    net.inet.tcp.sendbuf_auto=1
    net.inet.tcp.sendbuf_inc=524288
    net.inet.tcp.sendspace=65536
    net.inet.udp.maxdgram=57344
    net.inet.udp.recvspace=65536
    net.local.stream.recvspace=65536
    net.inet.tcp.sendbuf_max=16777216
    net.inet.tcp.mssdflt=9142

rc.conf (em0 flags)

I want to thank Zilla (see post comments) for the sysctl.conf help.

    ifconfig_em0="inet xxx.xxx.xxx.xxx  netmask 255.255.255.0 polling tso mtu 9194"

smb.conf

        min receivefile size = 131072
        aio read size = 1
        aio write size = 1
        use sendfile = yes
        lock directory = /var/run/samba/
        keepalive = 300

I’m also using LDAP users and group. I wasn’t sure if there would be a noticible performance hit for local users or LDAP users. There doesn’t seem to be one.

We use Active Directory, and since Quest/Vintela still won’t make a FreeBSD client for the Quest Authentication Servers ( a sales rep once told me “There are just too many versions of BSD…”) , I have to use all the open source utilities like OpenSSL, OpenLDAP Client and Kerberos. I don’t mind having to do it, but it is always nice if you can maintain one standard process across ALL systems, and we have a lot more Linux and Solaris systems than FreeBSD. I’m the odd one.

That aside, I use the latest OpenSSL in FreeBSD 8.0, OpenLDAP 2.4.20, and the built-in version of Heimdal Kerberos.

I get similar performance form NFS, however, most desktop users have are either on a Windows or OS X, and CIFS seems to be the unifying network storage protocol.

One thing I have yet to really figure out is configuring Samba to use proper NT ACL’s. However, if you can live with UNIX style permissions, a setup like this is pretty good at serving out lots and lots of data. Maybe that will be next.

Why you should use disk labels

I recently had a little problem with a new FreeBSD install, and it is one of those times were I sort of appreciate how FreeBSD assigns device handles, yet at the same time hate it :)

The setup is this:
The OS was installed on a mirrored hardware raid device (using the mpt(4) driver), and then I had a large RAID6 array attached via a FC controller (using the isp(4) driver). When I installed the OS, the mpt device was showing up as da0. So I went ahead with the install and rebooted the system, so far so good.

What I didn’t realize was the FC device was not seen yet, so after some fiddling, Jenny and I got the large RAID6 array to show up… unfortunately, the isp card was before the mpt card on the PCI bus:

isp0@pci0:2:1:0: class=0x0c0400 card=0x01321077 chip=0x63221077 rev=0x03 hdr=0x00
vendor = 'QLogic Corporation'
device = 'QLA6322 Fibre Channel Adapter'
class = serial bus
subclass = Fibre Channel
mpt0@pci0:2:3:0: class=0x010000 card=0x30601000 chip=0x00501000 rev=0x02 hdr=0x00
vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
device = 'SAS 3000 series, 4-port with 1064 -StorPort'
class = mass storage
subclass = SCSI

and the RAID6 now became da0, and the OS device now became da1.

Doh!

The system prompted for the / drive, so I had to call out the correct device at the mount> prompt:

mount> ufs:/dev/da1s1a

After that, the system continue to boot into mult-user mode, which cause some very strange console behavior (it acted like the return key was being held down), and my only option was to SSH in as local user, su to root, and then fix /etc/fstab.

This was not devastating, however, it show the importance of using disk labels instead of device handles in certain use cases. I haven’t fixed the / mount, but to get a comfort level with using GEOM labels I added another drive to the system and called it EXPORT.

You can assign a permanent label in two ways (that I know of). When you newfs the device, you can specify the L flag (BTW, -O2 means to use UFS2, and -U will use Soft-Updates):
[root@paper ~]> newfs -O2 -U -L EXPORT /dev/da2s1a
OR using glabel (which is what you would have to do for a non UFS filesystem.
[root@paper ~]> glabel create EXPORT da2s1a
Now we can see our newly labeled device in action:
[root@paper ~]> ls /dev/label
. .. EXPORT
[root@paper ~]> glabel status
Name Status Components
label/EXPORT N/A da2s1a

To add it to /etc/fstab, you can either edit the file, or append the correctly tab-delimited line like so:

[root@paper ~]> echo "/dev/label/EXPORT\t/export\tufs\trw\t2\t2" >> /etc/fstab
[root@paper ~]> mkdir /export
[root@paper ~]> mount export

Hurray!

[root@paper ~]> df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/da1s1a 60931274 4754540 51302234 8% /
devfs 1 1 0 100% /dev
tank 10569645824 107237376 10462408448 1% /tank
/dev/label/EXPORT 132022788 4 121460962 0% /export

[root@paper ~]> mount
/dev/da1s1a on / (ufs, local, soft-updates)
devfs on /dev (devfs, local, multilabel)
tank on /tank (zfs, NFS exported, local)
/dev/label/EXPORT on /export (ufs, local)

This is now a persistent label. To be safe, I’ll have to boot off of a CD/USB drive and modify the root device.

64bit nVidia driver for FreeBSD

I’ve always had a vested interest in the entire nvidia display driver for FreeBSD project, and I’ve pretty attached to the project. So much, that back in 2001 I started a little petition, got enough attention (and more importantly, a large list of people who signed my petition), and ever since 2002 FreeBSD users have been able to use high quality nvidia drivers. It wasn’t all me, whoever ran nvidia.netexplorer.org asked me to combine efforts, and I gave them my list, and they continued to market it and work with some folks at nvidia.

It is really nice to see that both the FreeBSD team and nvidia have worked together to do the necessary kernel development and get a 64bit driver. I used to use FreeBSD as my primary desktop at work, and it was great to use the hardware drivers for my displays. What is also nice is people in the nvidia forums are also asking for CUDA drivers on FreeBSD, that would be slick as well.

Digg the story if you want too:
http://digg.com/linux_unix/Official_64bit_NVIDIA_drivers_for_FreeBSD

FreeBSD 8.0 is (un-officially) available

So, it looks like FreeBSD 8.0 has been pre-released; the official date is going to be 11/25, as noted in src/UPDATING:

Updating Information for FreeBSD current users

This file is maintained and copyrighted by M. Warner Losh
.  See end of file for further details.  For commonly
done items, please see the COMMON ITEMS: section later in the file.

Items affecting the ports and packages system can be found in
/usr/ports/UPDATING.  Please read that file before running
portupgrade.

NOTE TO PEOPLE WHO THINK THAT FreeBSD 8.x IS SLOW ON IA64 OR SUN4V:
        For ia64 the INVARIANTS and INVARIANT_SUPPORT kernel options
        were left in the GENERIC kernel because the kernel does not
        work properly without them.  For sun4v all of the normal kernel
        debugging tools present in HEAD were left in place because
        sun4v support still needs work to become production ready.

20091125:
        8.0-RELEASE.
...

Thanks for the warning, and I don’t feel that 8.0 is slow in any way :)

You can now update to FreeBSD 8.0 with either syncing your source with csup:

*default host=cvsup.FreeBSD.org
*default base=/usr
*default prefix=/usr
*default delete use-rel-suffix
*default compress
src-all release=cvs tag=RELENG_8_0

Or with freebsd-update(8):

# freebsd-update -r 8.0-RELEASE upgrade

then

# freebsd-update install

and after the reboot, possibly another round of ‘freebsd-update install” to finish things up. You can actually upgrade from 7.2 to 8.0, which is pretty impressive since they are considered major releases (and minor release upgrades work just fine as well).

Why would you upgrade to 8.0 over 7.2? Well, Ivan Voras already has a very nice page on the notable features in 8:
http://ivoras.sharanet.org/freebsd/freebsd8.html
In case you want my short list version of that, here are the big highlights for me:

  • Kernel Stuff
    • Kernel limit on amd64 increased (this greatly benefits ZFS)
    • Superpages
    • Network stack virtualization, equal cost multipath routing and other really cool network improvements
    • NGROUPS has been increased from 16 to 1024
    • Other kernel improvements like light weight threads, the new ULE 3.0 Scheduler
    • NFS Locking
    • Qlogic 8GB HBA support
    • New AHCI driver
  • Userland Stuff
    • Parallel port builds
    • Jails v2
    • Dtrace
    • CLANG/LLVM Compiler

One of the cool things about FreeBSD is its focus on improving what is there. There have been some really big additions to FreeBSD from time to time, but overall, the goal has been to constantly refine and improve the performance. That is what I’m mostly excited about, the continual refinement of an already robust OS.

There are other features, like CLANG and LLVM or Dtrace, where I’m excited about them, but only because I can’t wait to see how others use them. I myself cannot obtain a lot of useful information from Dtrace, however, a kernel developer who knows what they are doing probably can, and that helps them out (which sometimes helps me out).

I’ve used the BETA and RC versions of 8.0, so not only was I pleased with the experience, I’m also excited to see its adoption with the new improvements. I’ve seen some PostgreSQL and MySQL benchmarks and there was a clear performance gain between 7 and 8.

Now is also a good time to mention that the FreeBSD Foundation is rounding up this years donations.

It’s pretty amazing that FreeBSD is a non-profit group; they do not have a CEO, a marketing department, or a horde of full-time developers… and yet they put out a extremely well engineered OS ( that is the boon of not having a marketing department :) all decisions are driven by the community demand and the developers, and not buzz-words like “the cloud”) with a killer network stack, and over 22,000 available ports.

PuppetCamp09

This was a very cool conference. I picked up a lot of useful information on both the open source tool, Puppet, and some ideas on infrastructure.

What also made this conference unique, is how honest the Puppet team and community were about the projects strength and weaknesses. Those that have deployed Puppet on a larger scale (MessageOne and Google) seemed to go through the same iterations in attempting to scale out their Puppetmaster’s. From WEBrick (which is what I’m currently running Puppet with :) ), which is hated by all since its a single process/thread web server that can only handle one request at a time. To Mongrel, which you have to manage a mongrel cluster script, feed it lots of memory, and then throw an apache proxy server in front of them. Now, people are starting to settle on using Passenger/mod_rack, which is what I spent most of yesterday looking into and setting up. This allows apache to mount a rails instance, and then you don’t actually have to run puppetmasterd. This still requires some decent hardware, and I’m currently running my puppetmaster on a VM with 2GB or memory, so I’ll have to watch out for that. Chris, the one who introduced me to Puppet, said he still uses WEBrick for all of his DB, Tomcat, and Apache servers (I think he said something like 200 systems) and it has been working out nicely. He, like the guys at Google, also doesn’t run puppet as a daemon.

Anyway, the point is, we learned a lot about the project, way more than if a sales person had come to us and just told us the things puppet does well, or how it operates on paper (*cough* LANDesk *cough*). It was really awesome to talk with Andrew Pollock and Nigel Kersten from Google. See, I was a little unsure about Puppet in our environment, where we have multi-purpose servers, computer servers, and desktops that we have to manage. It seemed, at a first glance, that most of the Puppet users out there have a homogeneous environment, and Andrew (Shafer) had stressed the concept of single role servers. After talking with them, I felt a lot more comfortable pursuing Puppet across our servers and desktops. Did I mention they were super cool and friendly?

We also learned a lot about the Puppet developers, which had its own interesting advantage. I have a lot of respect for what Luke Kanies has been able to do, and by the end of the conference, he showed significant mastery in what he has done, as well as some humility and admitting what he has not been able to do and why. I was a little put off the first day though, when both him and Andrew came off a little arrogant and crass. It did make me step back and think, “Is this project going to be well managed in the future with personalities like this in charge? Is their answer of ‘don’t do that!’ tongue in cheek, or are they not supportive of a diverse environment?”. In the end, I have more respect for the project than ever, and with it still being a young project, I hope they listened to some of the feedback, and I also can’t wait to see where it ends up in the next year.

Andrew, the Puppet Andrew, came up to us a lot during the conference, and he was fun to talk too, and he’s very academic and he had a lot of abstract concepts to talk about. Also, he said this was the first conference he has arranged, and I think he did a fantastic job. Jenny had commented that this was the first conference she had lasted the entire duration, so that says a lot about the pacing and content of PuppetCamp. I felt the same way, every session was incredibly engaging, and how Andrew had setup the democratic and chaotic Open Sessions was very impressive. Lets put it this way, I even got up there and pitched a topic, which is something I would have never done. Hurray for me stepping outside of my comfort zone!

Warning: side topic!

Now that I’ve had the weekend to google all the cool technologies I was exposed too, I’m also reminded why I really like having a FreeBSD server at my disposal. They had talked about CouchDB, so on a whim I did a

~> cd /usr/ports
/usr/ports> make search name=couchdb
Port: couchdb-0.9.0_1,1
Path: /usr/ports/databases/couchdb
Info: A document database server, accessible via a RESTful JSON API
Maint: till@php.net
B-deps: ca_root_nss-3.11.9_2 curl-7.19.6_1 erlang-lite-r13b01_6,1 gettext-0.17_1 gmake-3.81_3 icu-3.8.1_2 libiconv-1.13.1 libtool-2.2.6a nspr-4.8 perl-5.8.9_3 spidermonkey-1.7.0
R-deps: ca_root_nss-3.11.9_2 curl-7.19.6_1 erlang-lite-r13b01_6,1 gettext-0.17_1 gmake-3.81_3 icu-3.8.1_2 libiconv-1.13.1 libtool-2.2.6a nspr-4.8 perl-5.8.9_3 spidermonkey-1.7.0
WWW: http://couchdb.org/


Port: py26-simplecouchdb-0.9.26
Path: /usr/ports/databases/py-simplecouchdb
Info: Simple Librairy to Allow Python Applicationto Use CouchDB
Maint: wenheping@gmail.com
B-deps: py26-httplib2-0.5.0 py26-py-restclient-1.3.2 py26-setuptools-0.6c9 python26-2.6.2_3
R-deps: py26-httplib2-0.5.0 py26-py-restclient-1.3.2 py26-setuptools-0.6c9 python26-2.6.2_3
WWW: http://code.google.com/p/py-simplecouchdb/

I did a ‘make install’, and I had a cool little couchdb up and running. What is also cool is FreeBSD likes to give you very helpful information when you install something. For example, this is what is printed out when you install the CouchDB port:

===> COMPATIBILITY NOTE:
CouchDB is still pre-stable; between 0.8 and 0.9 the database format
changed which breaks BC. In current trunk, the format changed again, so
please double-check in case you are updating an existing installation.

More info:
* http://wiki.apache.org/couchdb/Breaking_changes?action=show&redirect=BreakingChanges
* http://wiki.apache.org/couchdb/BreakingChangesUpdateTrunkTo0Dot9

See, isn’t that helpful? Best of all, I didn’t have to enable additional repositories, or fetch the src manually, and its dependencies and then figure out how to run the right configure script flags… FreeBSD makes it easy, and since it automatically uses what you already have with what is required, its an incredibly stable build. Removing it is pretty simple as well, just:

> pkg_deinstall -R couchdb
---> Deinstalling 'couchdb-0.9.0_1,1'
---> Deinstalling 'erlang-lite-r13b02,1'
[Updating the pkgdb
in /var/db/pkg ... - 118 packages found (-1 +0) (...) done]
---> Deinstalling 'curl-7.19.6_1'
[Updating the pkgdb
in /var/db/pkg ... - 117 packages found (-1 +0) (...) done]
---> Deinstalling 'ca_root_nss-3.11.9_2'
---> Deinstalling 'spidermonkey-1.7.0'
---> Deinstalling 'nspr-4.8'
[Updating the pkgdb
in /var/db/pkg ... - 116 packages found (-1 +0) (...) done]
---> Deinstalling 'gmake-3.81_3'
[Updating the pkgdb
in /var/db/pkg ... - 115 packages found (-1 +0) (...) done]
---> Deinstalling 'perl-threaded-5.8.9_3'
[Updating the pkgdb
in /var/db/pkg ... - 114 packages found (-1 +0) (...) done]
---> Deinstalling 'gettext-0.17_1'
---> Deinstalling 'libiconv-1.13.1'
---> Deinstalling 'icu-3.8.1_2'
---> Deinstalling 'libtool-2.2.6a'
** Listing the failed packages (-:ignored / *:skipped / !:failed)
! curl-7.19.6_1 (pkg_delete failed)
! ca_root_nss-3.11.9_2 (pkg_delete failed)
! perl-threaded-5.8.9_3 (pkg_delete failed)
! gettext-0.17_1 (pkg_delete failed)
! libiconv-1.13.1 (pkg_delete failed)

This does a upwards recursive dependency removal. Also, if one dependency is relied on by another, it wont get removed. Like, if Perl58 was a dependency of a package, it wouldn’t be removed if perl58 is used by many other packages. This is smart. So, above, the packages that failed to deinstall where ones that are required dependencies of other installed packages.

Speaking of package management; have you ever installed something that ended up having a few dozen dependencies, then you want to uninstall that package with a “rpm -e cba8″, or something equivalent, but what about all the other cruft that came along with it? You would have to keep track of each dependency, and specify all of them and hope you don’t break another program. FreeBSD has a few tools to do this, one in particular, portmaster can remove all ports that were once a dependency but no longer used:

> portmaster -s
Information for neon28-0.28.4:
Comment:
An HTTP and WebDAV client library for Unix systems
===>>> neon28-0.28.4 is no longer depended on, delete? [n] y
===>>> Delete old and new distfiles for www/neon28
without prompting? [n] y
===>>> Running pkg_delete -f neon28-0.28.4
Information for rubygem-actionwebservice-1.2.6:
...

I ended up removing 4 packages that were no longer used.

CentOS and RHEL are the larger Puppet consumers, I’m still a big proponent for FreeBSD, and at work, it has allowed me to quickly build an Apache + Puppet + RubyPassenger/mod_rack stack with the minimal dependencies installed. So, the puppet server is still pretty lean, which means updates are smaller and faster. It still surprises me that its relatively unknown, even though Netcraft always has it listed in the top domains with the best uptime and consistently growing over the years. Why do I feel like an AmigaOS fan sometimes?

Hmm, it is sort of weird that this turned into a FreeBSD ports management entry :)

Okay, final word: PuppetCamp09 was Freaking awesome. There were a lot of smart developers and sysadmins there. We even got a very cool git howto, which I found useful. It was very diverse, which is strange for a conference based on one project in particular.

PC-BSD 7.1.1

PC-BSD KDE Desktop

PC-BSD KDE Desktop

PC-BSD is a nice mesh between FreeBSD and a ready to use Desktop (which uses about 6GB of disk space). It is based on FreeBSD 7.2, so it has all the cool features of the latest release. Best of all, without ANY additional configuration, I was able to:

  • Use the official FreeBSD nVidia driver for hardware acceleration
  • Watch clips on YouTube (with flashplayer)
  • Play back all sorts of media types like mp3′s, divx, mpegs,wmv,qt…
  • use ZFS
  • Create and edit documents with the latest OpenOffice 3.1
  • Browse the web with Firefox 3.5
  • Create VM’s with VirtualBox

Plus, if there wasn’t a PBI package for what I wanted, I could still use FreeBSD’s pkg_add or, cd to /usr/ports and make one. I would say, that’s pretty impressive for a commercially supported Unix platform.

Otherwise, it is very much like Fedora or Ubuntu, where it has an update manager (updates PBI’s and the system), network manager, helpful tutorials, and for once (for FreeBSD at least) a full blown X11/QT graphical installer. FreeBSD has always had a simple ncurses installer, which I like, but it tends to frighten a lot of people who are used to GUI installers.

Once strange thing it does is place all PC-BSD binaries in /usr/PCBSD. I guess this is to remain independent and out of the way of the base FreeBSD binaries, as well as /usr/local, which is the normal prefix for all Ports.

To wrap it up; my initial impression of PC-BSD is a positive one. I like how I could use the FreeBSD ports and package system and it did not conflict with the PC-BSD packages that were installed. I like the installer, and the storage options at install time (UFS2+SU, or UFS2+Journal, Encrypted swap…). With all OS’s, it normally takes a few weeks of using it to see its weaknesses, so I’m sure PC-BSD has some issues waiting to pop up. The only one I see right now is that KDE4 is the default GUI, and I prefer Gnome. I could install it, but it wasn’t an up front, as all of PC-BSD’s install tools are written in QT. The initial X setup tool was pretty slick, and it worked with my picky laptop.

ZFS updated in FreeBSD 7.2!

FreeBSD 7.x has been using version 6 of ZFS, and originally only 8.0 was going to have the newly updated ZFS version: 13.

Last week the core team MFC’d (Merge From Current) the ZFS updates to 7.2, so I cvsup’d and re-build my server’s kernel and world ( with a simple “make buildworld && make buildkernel && make installworld && make installkernel ), rebooted, and now I have the latest ZFS version running:

[root@server ~]> zpool upgrade -v
This system is currently running ZFS pool version 13.

The following versions are supported:

VER  DESCRIPTION
---  --------------------------------------------------------
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z
 4   zpool history
 5   Compression using the gzip algorithm
 6   bootfs pool property
 7   Separate intent log devices
 8   Delegated administration
 9   refquota and refreservation properties
 10  Cache devices
 11  Improved scrub performance
 12  Snapshot properties
 13  snapused property
For more information on a particular version, including supported releases, see:

http://www.opensolaris.org/os/community/zfs/version/N

Where 'N' is the version number.