Archive

ZFS

Reading ZEVO forums at greenbytes.com I ran into a nifty way to solve most of ZFS-over-AFP-sharing problems. To recap: I have a few zfs systems defined on my server pool. When I enable file sharing, only some of those systems show up as volumes on the network.  There seems to be no clear indicator of why some systems do show up as volumes so users can connect to them and others do not. Form reading comments around the web it looks like Apple is now hardcoding some HFS file system properties into the software, which delegates alternative files systems to the role of second-class citizens and breaks functionality (including AFP sharing) for them. Bad Apple.

The solution is to trick the AFP server into believing that every file system it sees is HFS: a small library I installed on the server and now all volumes show up in AFP network browser on the client machine.

Advertisements

Ten’s Complement, a company that was quietly bringing ZFS for MacOSX to the level competitive with other platforms, finally announced their product. It’s called ZEVO and comes at 4 different price points, each with a different set of technological limitations. The basic edition offers only metadata redundancy, limits the pool size to 3TB and costs $30. The most advance edition offers all the ZFS bells and whistles, includes GUI admin interface but limits the pool size to 20TB. The price for the latter has not been announced, but it would probably be in $150 range.

I’m a not an expert in marketing, but for me their strategy seems flawed. It seems to me that the silver (basic) and gold (the next after basic) packages are for people who wants to get a bit more reliability from their file system, but do not have much data to deal with. If I were in that category, (and, I was in that category 6 months ago) I would say that a backup drive and TimeMachine is good enough for me. ZFS is not a replacement for a good backup anyway and I’d be concerned spending more money for an add-on to the OS than I’m spending for the whole OS. So, Ten’s Complement would have to make a lot of effort to convince people to buy into this. And the price entry is steep for what ZEVO offers.

The other two packages are for people with larger data collections. These people already know about ZFS and it’s advantages. There is no need to convince them. But these guys are the demanding lot, they know what ZFS specs offer and what ZFS offers on other platforms. Limiting the storage size to 16 and 20T seems like an artificial bar that would alienate them. I’m not even talking about that this limit is lower than what some of those potential customers already have in their servers. I have 15TB disk space on 6 drives, so I’m pushing that limit. If I were to upgrade to 4TB drives, I’d be well beyond the top limit.

ZEVO is not for sale yet: it is listed as “available in early 2012”. If I were buying it now, I would have to go for the platinum package, but they have not announced the price for it. For now the free MacZFS option seems much more attractive.

There is a new legacy kernel out for 10.6.8. Lets see if it makes a difference for my setup.

The numbers are in MBs.
old new change
write 84.8 96.7 14.0%
read 263.1 290.7 10.5%
Version 1.03d
       ------Sequential Output------ --Sequential Input- --Random-
       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
  Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
  300M 32849  72 81068  40 45718  38 53773  94 160056  68  2092  19
       ------Sequential Create------ --------Random Create--------
       -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   16  8128  88 12142  83  7367  93  7690  93 14005  99  7919  99

ZFS array became significantly faster with the new kernel.

Introduction and experimental questions

I’m setting up a home server/NAS using Snow Leopard and MacZFS. I have an odd mixture of WD disks: 2x30EZRSDTL, 2x20EADS, and 1x20EARS. I want to join them in one large storage array, so, naturally, I am interested in what would be the optimal configuration from performance/redundancy point.

I need a long time archive storage for music, photos, and videos, so I am looking at raidz2 or possibly raidz level of setup. Yes, I know that neither is a sufficient replacement for a good backup strategy, but I’m setting up an offsite backup for the most important data. Potentially I can always reconstruct the rest of the information, I just don’t want to loose it to a trivial disk error. I figure that raidz2 is a much more reliable configuration, albeit a slower one. But how much slower?

My first question is How different are raidz and raidz2 arrays in performance?

I read that giving a whole drive to ZFS improves performance as ZFS can make use of the disk write cache. On the other hand, setting the drives up following the MacZFS Quick Start guide, — basically creating a small EFI partition on each drive, — saves some hassle of getting the final pool mounting and unmounting correctly.

My second question is How different is performance of the array when ZFS uses whole disks vs. ZFS only using the disk partitions as recommended by the MacZFS guide?

Considering that my disks are of different sizes, giving ZFS whole drives seemed like wasting good TBs (I would have to sacrifice a TB from each 3TB drive), but I am willing to do that if the performance hit would be significant. On the other hand it would be good to have a few relatively small slices (1TB each 8-)) for things that do not require redundancy, but require support for Spotlight.

My third question is How different is performance of a whole disk array vs. an array consisting of disk slices?

Finally, some of my drives, specifically, 20EARS and 30EZRSDTL, are the Advance Format (AF) drives with 4K sectors advertised as 512B sectors. I read about the sector size problem (the ‘ashiftgate’?) and how wrong size sectors may decrease performance. I downloaded MacZFS source, applied the ashift patch suggested by JasonRM on MacZFS mailing list, and compiled myself a copy of zpool tool with ashift=12 hardcoded. Now I can use that version of zpool (zpool12) to initialize my pools.

My last question is How different is the performance of a pool created with ashift 9 vs one with ashift 12?

Experimental setup

The test system is HP Proliant Microserver with AMD Athlon II Processor NEO @1.30 GHz, with 4GB of DDR3 ECC memory, running OSX 10.6.8 and MacZFS 74.1.0. Disk write cache enabled in BIOS. At least for now, the kernel runs in 32 bit mode.

To test pools of different configurations I’ve ran

  • dd bs=1m count=10000 if=/dev/zero of=test
  • dd bs=1m count=10000 of=/dev/null if=test

3 times on each configuration and averaged the results.

I considered 3 arrays:

  • 5 drives: 2x30EZRSDTL, 2x20EADS, and 1x20EARS
  • 4 drives: 2x30EZRSDTL and 2x20EADS
  • 3 drives: 2x20EADS and 1x20EARS

and also looked at individual drives. The speed numbers are in MiB/s.

Results

Here I summarize the findings. Let’s start with the most dramatic result. Question 4: How different is the performance of a pool created with ashift 9 vs one with ashift 12?

With the AF drives (WD 20EARS, 30EZRSDTL) having ashift=12 (4K sectors) enabled is a must.

operation ashift=9 ashift=12 change
30EZRSDTL Write 76.04 81.85 7.6%
Read 120.43 118.70 -1.4%
20EADS Write 66.78 67.54 1.1%
Read 95.97 95.81 -0.2%
20EARS Write 55.36 71.05 28.3%
Read 99.74 98.97 -0.8%
Table 1. Single drive pool. ZFS owns the whole disk.

A single drive pool write speed increased by 28.3% for 20EARS and by 7.7% for 30EZRSDTL when the pool was created with ashift=12. The write speed was not affected for a single 20EADS. The read speed was not affected for any of the drives.

For the multi disks array configurations the effect was even more pronounced.

disk count operation ashift=9 ashift=12 change
5 Write 79.04 132.74 67.9%
Read 175.76 262.01 49.1%
4 Write 86.12 98.21 14.0%
Read 172.72 202.69 17.4%
3 Write 58.26 94.25 61.8%
Read 82.22 114.12 38.8%
Table 2. raidz. ZFS owns whole disks.
disk count operation ashift=9 ashift=12 change
5 Write 12.03 92.40 668.4%
Read 46.18 232.11 402.6%
4 Write 50.68 71.18 40.5%
Read 126.17 170.74 35.3%
3 Write 38.79 58.65 51.2%
Read 116.38 113.68 -2.3%
Table 3. raidz2. ZFS owns whole disks.

Performance benefits range from 14% to 668% depending on the configuration. Apparently there is a huge performance hole when my collection of disks is joined into a raidz2 array with ashift=9. Write and read speeds are abysmal at 12MiBs and 46MiBs. When using zpool12 to initialize the storage I observe 92 and 232 MiBs for the same configuration.

Another exception is the read speed for raidz2 array of 3 drives (who would want to build this anyway?), which did not change significantly between ashift 9 and 12.

Question 1: How different are raidz and raidz2 arrays in performance?

disk count operation raidz raidz2 change
5 Write 132.74 92.40 -30.39%
Read 262.01 232.11 -11.41%
4 Write 98.21 71.18 -27.52%
Read 202.69 170.74 -15.76%
3 Write 94.25 58.65 -37.77%
Read 114.12 113.68 -0.38%
Table 5. ashift=12. ZFS owns whole disks.

The table shows that there is about 30% drop in write speed when going from raidz to raidz2 for 5 drives. That’s OK. I can live with it, I’m not going to write to the storage that often. There is also an 11% drop in read speed. That’s OK too assuming the speed stays above 1Gb — my network is going to be the system’s bottleneck anyhow. Of course, the speed will drop as the drives fill out, but I cannot test that right now.

Question 2: How different is performance of the array when ZFS uses whole disks vs. ZFS only using the disk partitions as recommended by the MacZFS guide? and Question 3: How different is performance of a whole disk array vs. an array consisting of disk slices?

disk count operation whole drive full disk partition change vs full disk 2TB partition change vs full disk
5 Write 92.40 90.98 -1.54% 87.88 -4.90%
Read 232.11 222.66 -4.07% 223.28 -3.81%
4 Write 71.18 70.72 -0.65% 68.63 -3.59%
Read 170.74 173.32 1.51% 168.61 -1.25%
3 Write 58.65 57.92 -1.25%
Read 113.68 112.17 -1.33%
Table 6. raidz2, ashift=12.

I see that I am taking about 5% hit on writing and 4% hit on reading when going from whole disk vdevs to 2TB partitions as vdevs. But I’m saving 2TB space in the process. I’ll take it. Did I say it’s a server on a budget?

My final configuration uses 6 disks, — I have one more 30EZRSDTL, I could not experiment with it because at that time it had all my data. Now that the system is complete, there is a raidz2 array of 6x2TB partitions. The write and read speeds of the array at 25% full capacity are 80.9 and 250.9 MiBs.

Note of caution: Firstly, the performance numbers reflect sustained read/write speed of the array. I cannot draw any conclusion about random read/write performance. But given the nature of the storage, that’s less important. Secondly, I have an odd collection of disks and an odd system, my conclusions may not transfer accurately to another setup — YMMV. I thought I would share them to provide a perspective on a real-life application of MacZFS.

Here is the output from bonnie++

Version 1.03d

     ------Sequential Output------ --Sequential Input- --Random-
     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
300M 17916  32 82532  44 46038  44 47504  88 160887  71  1359  15
     ------Sequential Create------ --------Random Create--------
     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
  16  7718  96 32695  99 10259  98  7863  90 +++++ +++ 10386  97