Oracle Direct NFS and Infiniband: A Less-Than-Perfect Match

Readers of an earlier post on this blog will know about my latest forays into the world of Direct NFS. Part of that means stumbling over configuration hiccups or slamming into brick walls when you find new bugs.

To quickly re-set the table, my organization purchased the Oracle ZFS Storage Appliance (ZFSSA) 7420. Oracle sold us on the Infiniband connectivity as a way to make a possible future transition to Exadata easier. However the pre-sales POC testing was done over 10gb Ethernet (10gigE). So it was that everything (including their Infiniband switches and cables) arrived at the datacenter and was installed and connected by the Oracle technicians. There were a few initial hiccups and frustrating inconsistencies with their installation and configuration, but those are outside the scope of this post.

We decided to put a copy of our standby database on the ZFSSA and have it run as a second standby. The performance problems were quick to appear, and they weren't pretty.

Configuring Shares

We configured the ZFS project shares by the common Oracle best practices in terms ZFS recordsize and write bias. For example, datafile shares were set to an 8k recordsize (to match the db_block_size) and throughput write bias, where as redo log shares were set to 128k recordsize and latency bias. Note that with Oracle Database 12c, Direct NFS over NFSv4, and the more recent ZFSSA firmware, you gain the benefit of Oracle Intelligent Storage Protocol (OISP), which will determine the recordsize and write bias automatically based on the type of file it recognizes.

Copying the Database

To start out we needed to get a copy of the database onto the ZFSSA shares. This was easily done with RMAN's backup as copy database command, specifying the ZFSSA mount as the format destination. We were fairly impressed with the Direct NFS transfer speed during the copy and so we were optimistic about how it would stand up with our production load.

Starting Recovery!

Once everything was set, we started managed recovery on the standby. Our earlier excitement gave way to a sort of soul-crushing disappointment as the recovery performance basically ground to a standstill and traffic to the ZFSSA went from hundreds of Mbps to barely a trickle. We could stop recovery and copy a big file with great speed, but something in managed recovery was not playing nicely.

We found that we could disable Direct NFS (requires a database restart and software relinking), and managed recovery would actually perform better over the kernel NFS, although still not nearly as well as we would need.

This started a blizzard of SR creations, including SRs being spawned from other SRs. We had SRs open for the ZFSSA team, the Direct NFS team, the Data Guard team, and even the Oracle Linux and Solaris teams, even though we were not on Oracle Linux or Solaris (we use RHEL). It came to a point where I had to tell our account manager to have support stop creating new SRs, since every new SR meant I had to explain the situation to a new technician all over again.

At this point we were having twice-daily conference calls with our account manager and technical leads from the various departments. Their minions were working hard on their end to replicate the problem and find a solution, but we were running into a 4th week of this craziness.

The Infiniband Bandit

After many frustrating weeks of changing configurations, cables, cards, and just generally grasping at straws, it was finally narrowed down to the Infiniband. Or rather, a bug in the open fabric (OFA) linux kernel module that dealt with Infiniband that was triggered when Direct NFS would fire off a whole lot of connections, like when DataGuard managed recover would fire up 80 parallel slaves. We tested out the 10gigE channel we had for the management UI and performance was like night and day with just the one channel.

Oracle Support suggested it might be related to bug 15824316, which also deals with dramatic performance loss with Direct NFS over Infiniband. The bug in the OFA kernel module was fixed in recent versions of Oracle Enterprise Linux (OEL) (specifically the UEK kernel), but Oracle is not sharing this fix with Red Hat (or anyone else, presumably). Since we're on RHEL, we had little choice but to send all the Infiniband networking hardware back and order up some 10gigE replacements.

We're still in the process of getting the 10gigE switches and cables all in place for the final production setup. If you're curious, it's 4 10gigE cards per server, bonded to a single IP to a 10gigE switch into the ZFSSA heads. This 10gigE network is dedicated exclusively to ZFSSA traffic.

So, in the end, if you're on (a recent version of) of OEL/UEK, you should have nothing to worry about. But if you're on RHEL and planning to use Direct NFS, you're going to want to use 10gigE and NOT Infiniband.

Update - 6 Mar 2014

Some of have asked, and I want to re-iterate: Oracle have claimed that the OFA module was entirely re-written, and their fix is specific to OEL and is not covered by GPL or any similar license. We were told that they have no plans to share their code with RHEL. Also there is no MOS bug number for the OFA issue, it was apparently re-written from scratch with no bug to track the issue. If this all sounds rather dubious to you, join the club. But it's what our account manager told us at the end of last year.

Another Update - 6 Mar 2014

Bjoern Rost and I discussed this privately and after quite a bit of research and analysis he shared this conclusion:

Oracle Support suggested that this issue would not occur with the OFA module used in Oracle Linux with the UEK kernel. RedHat changed their support in RHEL6 from shipping the whole openfabrics stack to just including the drivers that were also present in the upstream mainline kernel. This is RedHat’s policy to ensure stability in the version of the kernel they ship. Oracle offers an OFA package with some additional patches (all GPL/BSD license) for the UEKr1 and UEKr2 kernels. Unfortunately, these two different approaches make it very hard to pinpoint specific patches or create backports for the RedHat kernel version.