Archive

Microserver

I finally got a bit of time to write about the new server I built last year. It has been up and running since May 2012. It runs Mountain Lion Server with Open Directory allowing me to manage multiple users and devices on my home network; stores and shares my media collection; handles Time Machine backups; and serves as a web server for some of my small projects.

The Microserver is now serves as an offsite backup unit storing nightly zfs snapshots of the data. It seems to work fine for this purpose.

Lets start with listing the requirements I had for the server box. Here they are in no particular order.

  1. Capacity. I want the server to maintain a huge data storage pool. The box has to hold a lot of hard drives. How many hard drives should I plan for? That sort of comes down to the next topic…
  2. Reliability. I want the storage to be very reliable to possible hard drive failures. If I use zfs for storage, I can dedicate some drives for the actual data and some for redundant information that can recover files in case of drive failure. (It does not work exactly like this, but it’s OK for the analysis.) I estimated that I need space for 6 drives: 4 to old the data and 2 for redundancy. If I go with 2TB drives, it gives me 8TB storage. If I use 3TB drives, it comes to 12TB storage, which should serve me well for quite a while.
  3. Blu-ray. I need a space for a Blu-ray drive to watch movies and write and occasional disk.
  4. Finally, I need space for a system drive, because it’s very likely I would not be able to boot from a ZFS drive.
  5. The box has to be small. I have a Mac Pro at home and I do not need another box of similar size sitting around.
  6. It has to be quiet. The box will be sitting in a living room and I do not want to hear it.
  7. I want the box to run MacOS X. I can configure a Linux box, but it would take me too much time and effort that I do not want to expend.
  8. It has to be powerful enough to run some computational tasks, like indexing and searching of document collections.
  9. It’s a server, so it does not need support for a very good graphics card. At the same I want to be able to plugin a DVI monitor occasionally.

Why did I want to replace the Microserver? With 6 drives mounted inside, it has no space for an optical drive (fail on #3). It uses a very power efficient CPU, which is fine for a file server, but insufficient for any other tasks (fail on #8). It uses AMD CPU. OSX for AMD is getting less and less support from the hackintosh community and AMD compatibility with the current OSX versions is falling behind. For example, OSX kernel runs in 32 bit mode on an AMD CPU. The latest ZEVO requires 64 bit kernel. So, I cannot update my ZFS setup on the Microserver beyond the ZEVO Developer Edition beta from last summer.

So, why did I not go with a Mac Pro? Because of three reasons: Mac Pro is huge (#5), there is not enough space for 6 hard drives in a Mac Pro (#1), and it’s expensive. Very expensive. Finally, while researching the information about the Microserver, I got drawn into the experience of building a computer and I wanted to build one.

This story will have multiple parts. I will cover the hardware parts, the assembly, the system installation, the software, and configuration. I plan to organize the notes I made, write down the reasons for the choices I made while assembling the server, and describe the lessons I learned in the meantime.

Advertisements

One of the 3TB drives started acted weird. I got emails from the smartd that the drive keeps failing the offline tests. I did a scrub – it found a couple errors. The drive went to WD for replacement, I got another one next day,

Two weeks later another drive, – a 2TB, – started misbehave. smartctl showed 300+ “Pending Sectors”. I took the drive offline and wiped it with zeroes using Disk Utility: diskutil zeroDisk /dev/diska, the Pending Sectors count went down to 0 with no more errors reported. It looks like the drive fixed itself. I placed the drive back into the pool and reslivered it.

In part 2 I considered how the new kernel and some kernel tcp variables affect network transfer speed. In that analysis I evaluated disks shared over AFP and ran simple file copy from and to the share. Here I compare AFP with NFS.

NFS and SMB vs. AFP. All numbers in MB/s
AFP NFS change vs AFP SMB change vs AFP
Write 57.37 25.02 -90.57% 36.71 -24.00%
Read 63.99 72.99 25.06% 41.50 -28.89%

The Table shows that NFS is significantly faster than AFP when reading from the server. But it is also extremely slow when writing to the server. I cannot figure out why this is happening and I need to run more experiments.

For comparison I also measured the same read/write action while mounting the share via SMB. It is slower than AFP.

I have already looked at the network transfer speed and concluded that increasing MTU significantly improves the speed. I decided to revisit this question after updating the kernel. The second reason is that I keep having weird issues with the network. My two computers: the Microserver and Mac Pro are connected to an Airport Extreme base station. Since I increased MTU on the Mac Pro, some outgoing http connections stopped responding. For example, Xcode stopped accessing my svn at the 127.0.0.1 address. I lost access to some web pages on the Mac Pro when using my external hostname: e.g., I can access http://my.hostname.net/ but accessing http://my.hostname.net/afolder/ would fail, while apache log claims that it did serve the page. It looked like the data could not find the way back to me. Also, Wake On Demand stopped working for Mac Pro. I was not happy with the situation. Update: Apparently, WOD does not work on the Mac Pro even after resetting network parameters to default values. Looks like a Lion problem (booting in 10.6.7 I get WOD back)

I also realized that my experiments measuring the network speed were somewhat flawed – I ran copy from a disk on one machine to a disk on another machine. That speed is affected by the network throughput and by the disk speed. This time I copied 1GB file to and from a RAM disk on Mac Pro to eliminate the Pro’s drives from the equation.

To make a RAM disk you can use the one-liner

diskutil erasevolume HFS+ "ram disk" \
 `hdiutil attach -nomount ram://2500000`
Table 1. Comparing network speed transfers from a RAM disk on the server to the RAM disk on the client. The values are in MB/s.
MTU 9000 MTU 1500 9000 vs 1500
Write 90.5 87.6 0.53%
Read 81.1 80.7 3.33%

You can see that the improvement for MTU 9000 with the new kernel is less dramatic. I saw 10% speed improvement on local server IO. Now I observe a similar increase in speed over the network. It looks like the new kernel is more efficient and gives the CPU more room to breathe.

You can also see that I’m getting almost 90MB/s writes and 80MB/s reads over AFP. The theoretical limit of a gigabit link is somewhere around 125MB/s. I did some experiments with iperf and saw transfer speeds around 114MB/s. So my link is running close to the theoretical maximum. However, AFP does add some noticeable overhead to the transfer.

I have been reading TCP tuning guides on the web. Most of them suggest tweaking tcp parameters. Specifically, they recommend increasing values for net.inet.tcp.sendspace and net.inet.tcp.recvspace. The default values for these variables are set to 64K (sysctl net.inet.tcp.sendspace). I decided to raise them to 524288.

This time I measure the speed from the RAM disk on the client to the raidz array on the server:

Table 2. Comparing network speed transfers for different values of net.inet.tcp.sendspace and net.inet.tcp.recvspace. The values are in MB/s.
default increased difference
Write 48.30 57.37 9.64%
Read 58.36 63.99 18.77%

Unfortunately, you cannot directly compare those numbers with the numbers in the old table – the experimental conditions were different. But you can see the overhead caused by using the physical drive vs. using the RAM disk.

Would changing the tcp constants increase the speed for the higher MTU value? A quick run showed that they do not affect it. But a more detailed analysis would be needed to explore that questions. I’ll see if I can run more experiments. However, given the problems created by the higher MTU values, I’m inclined to go back to the default setting on MTU and raise the tcp constants.

I have created /etc/sysctl.conf file with the following lines on both the server and the client:

net.inet.tcp.sendspace=524288
net.inet.tcp.recvspace=524288

There is a new legacy kernel out for 10.6.8. Lets see if it makes a difference for my setup.

The numbers are in MBs.
old new change
write 84.8 96.7 14.0%
read 263.1 290.7 10.5%
Version 1.03d
       ------Sequential Output------ --Sequential Input- --Random-
       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
  Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
  300M 32849  72 81068  40 45718  38 53773  94 160056  68  2092  19
       ------Sequential Create------ --------Random Create--------
       -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   16  8128  88 12142  83  7367  93  7690  93 14005  99  7919  99

ZFS array became significantly faster with the new kernel.

How fast can I get my data out from the server?

It turns out there is a lot of overhead and some settings can make a difference.

Reading speed from the Microserver back to my Mac Pro. The speed is in MBs.

MTU = 1500 (default) MTU = 9000 (custom)
rsync over ssh 21.6
rsync from afp-mounted share 35.8 45.1
cp from afp-mounted share 55.1 61.6

Tweaking a simple thing like MTU makes a noticeable difference in speed in a local environment. I also observed that CPU load on the Microserver went down with the higher MTU setting.

Caution. To get the full effect of the change, you have to make sure the other machines on your network have the same MTU settings. And, changing MTU does affect other things. While it looks like my outside network connection was unaffected, I did observe some weird behavior locally. For example, XCode svn client on my desktop stopped connecting to the local svn repository using 127.0.0.1 address. I switched to using localhost name instead of 127.0.0.1 and everything works again.

Introduction and experimental questions

I’m setting up a home server/NAS using Snow Leopard and MacZFS. I have an odd mixture of WD disks: 2x30EZRSDTL, 2x20EADS, and 1x20EARS. I want to join them in one large storage array, so, naturally, I am interested in what would be the optimal configuration from performance/redundancy point.

I need a long time archive storage for music, photos, and videos, so I am looking at raidz2 or possibly raidz level of setup. Yes, I know that neither is a sufficient replacement for a good backup strategy, but I’m setting up an offsite backup for the most important data. Potentially I can always reconstruct the rest of the information, I just don’t want to loose it to a trivial disk error. I figure that raidz2 is a much more reliable configuration, albeit a slower one. But how much slower?

My first question is How different are raidz and raidz2 arrays in performance?

I read that giving a whole drive to ZFS improves performance as ZFS can make use of the disk write cache. On the other hand, setting the drives up following the MacZFS Quick Start guide, — basically creating a small EFI partition on each drive, — saves some hassle of getting the final pool mounting and unmounting correctly.

My second question is How different is performance of the array when ZFS uses whole disks vs. ZFS only using the disk partitions as recommended by the MacZFS guide?

Considering that my disks are of different sizes, giving ZFS whole drives seemed like wasting good TBs (I would have to sacrifice a TB from each 3TB drive), but I am willing to do that if the performance hit would be significant. On the other hand it would be good to have a few relatively small slices (1TB each 8-)) for things that do not require redundancy, but require support for Spotlight.

My third question is How different is performance of a whole disk array vs. an array consisting of disk slices?

Finally, some of my drives, specifically, 20EARS and 30EZRSDTL, are the Advance Format (AF) drives with 4K sectors advertised as 512B sectors. I read about the sector size problem (the ‘ashiftgate’?) and how wrong size sectors may decrease performance. I downloaded MacZFS source, applied the ashift patch suggested by JasonRM on MacZFS mailing list, and compiled myself a copy of zpool tool with ashift=12 hardcoded. Now I can use that version of zpool (zpool12) to initialize my pools.

My last question is How different is the performance of a pool created with ashift 9 vs one with ashift 12?

Experimental setup

The test system is HP Proliant Microserver with AMD Athlon II Processor NEO @1.30 GHz, with 4GB of DDR3 ECC memory, running OSX 10.6.8 and MacZFS 74.1.0. Disk write cache enabled in BIOS. At least for now, the kernel runs in 32 bit mode.

To test pools of different configurations I’ve ran

  • dd bs=1m count=10000 if=/dev/zero of=test
  • dd bs=1m count=10000 of=/dev/null if=test

3 times on each configuration and averaged the results.

I considered 3 arrays:

  • 5 drives: 2x30EZRSDTL, 2x20EADS, and 1x20EARS
  • 4 drives: 2x30EZRSDTL and 2x20EADS
  • 3 drives: 2x20EADS and 1x20EARS

and also looked at individual drives. The speed numbers are in MiB/s.

Results

Here I summarize the findings. Let’s start with the most dramatic result. Question 4: How different is the performance of a pool created with ashift 9 vs one with ashift 12?

With the AF drives (WD 20EARS, 30EZRSDTL) having ashift=12 (4K sectors) enabled is a must.

operation ashift=9 ashift=12 change
30EZRSDTL Write 76.04 81.85 7.6%
Read 120.43 118.70 -1.4%
20EADS Write 66.78 67.54 1.1%
Read 95.97 95.81 -0.2%
20EARS Write 55.36 71.05 28.3%
Read 99.74 98.97 -0.8%
Table 1. Single drive pool. ZFS owns the whole disk.

A single drive pool write speed increased by 28.3% for 20EARS and by 7.7% for 30EZRSDTL when the pool was created with ashift=12. The write speed was not affected for a single 20EADS. The read speed was not affected for any of the drives.

For the multi disks array configurations the effect was even more pronounced.

disk count operation ashift=9 ashift=12 change
5 Write 79.04 132.74 67.9%
Read 175.76 262.01 49.1%
4 Write 86.12 98.21 14.0%
Read 172.72 202.69 17.4%
3 Write 58.26 94.25 61.8%
Read 82.22 114.12 38.8%
Table 2. raidz. ZFS owns whole disks.
disk count operation ashift=9 ashift=12 change
5 Write 12.03 92.40 668.4%
Read 46.18 232.11 402.6%
4 Write 50.68 71.18 40.5%
Read 126.17 170.74 35.3%
3 Write 38.79 58.65 51.2%
Read 116.38 113.68 -2.3%
Table 3. raidz2. ZFS owns whole disks.

Performance benefits range from 14% to 668% depending on the configuration. Apparently there is a huge performance hole when my collection of disks is joined into a raidz2 array with ashift=9. Write and read speeds are abysmal at 12MiBs and 46MiBs. When using zpool12 to initialize the storage I observe 92 and 232 MiBs for the same configuration.

Another exception is the read speed for raidz2 array of 3 drives (who would want to build this anyway?), which did not change significantly between ashift 9 and 12.

Question 1: How different are raidz and raidz2 arrays in performance?

disk count operation raidz raidz2 change
5 Write 132.74 92.40 -30.39%
Read 262.01 232.11 -11.41%
4 Write 98.21 71.18 -27.52%
Read 202.69 170.74 -15.76%
3 Write 94.25 58.65 -37.77%
Read 114.12 113.68 -0.38%
Table 5. ashift=12. ZFS owns whole disks.

The table shows that there is about 30% drop in write speed when going from raidz to raidz2 for 5 drives. That’s OK. I can live with it, I’m not going to write to the storage that often. There is also an 11% drop in read speed. That’s OK too assuming the speed stays above 1Gb — my network is going to be the system’s bottleneck anyhow. Of course, the speed will drop as the drives fill out, but I cannot test that right now.

Question 2: How different is performance of the array when ZFS uses whole disks vs. ZFS only using the disk partitions as recommended by the MacZFS guide? and Question 3: How different is performance of a whole disk array vs. an array consisting of disk slices?

disk count operation whole drive full disk partition change vs full disk 2TB partition change vs full disk
5 Write 92.40 90.98 -1.54% 87.88 -4.90%
Read 232.11 222.66 -4.07% 223.28 -3.81%
4 Write 71.18 70.72 -0.65% 68.63 -3.59%
Read 170.74 173.32 1.51% 168.61 -1.25%
3 Write 58.65 57.92 -1.25%
Read 113.68 112.17 -1.33%
Table 6. raidz2, ashift=12.

I see that I am taking about 5% hit on writing and 4% hit on reading when going from whole disk vdevs to 2TB partitions as vdevs. But I’m saving 2TB space in the process. I’ll take it. Did I say it’s a server on a budget?

My final configuration uses 6 disks, — I have one more 30EZRSDTL, I could not experiment with it because at that time it had all my data. Now that the system is complete, there is a raidz2 array of 6x2TB partitions. The write and read speeds of the array at 25% full capacity are 80.9 and 250.9 MiBs.

Note of caution: Firstly, the performance numbers reflect sustained read/write speed of the array. I cannot draw any conclusion about random read/write performance. But given the nature of the storage, that’s less important. Secondly, I have an odd collection of disks and an odd system, my conclusions may not transfer accurately to another setup — YMMV. I thought I would share them to provide a perspective on a real-life application of MacZFS.

Here is the output from bonnie++

Version 1.03d

     ------Sequential Output------ --Sequential Input- --Random-
     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
300M 17916  32 82532  44 46038  44 47504  88 160887  71  1359  15
     ------Sequential Create------ --------Random Create--------
     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
  16  7718  96 32695  99 10259  98  7863  90 +++++ +++ 10386  97