When I started my current role a few months ago, I was very interested to learn that direction had been set to migrate away from ASM and onto NFS storage that had some read flash cache in front of it. I'm not the world's biggest fan of ASM and the Grid Infrastructure management overhead that it requires, but once it's up and running it has been fairly solid for me. However I do like to have direct access to the files (although that can be dangerous as well), and so I was excited to see how this would go.

However, the configuration was not quite as straight-forward as we had assumed. In this post I'd like to walk though the points where we went off course and fill in some of the gaps that I couldn't find in the documentation or other blog posts.

Direct NFS

I won't explain what Oracle Direct NFS is, other than to say that you want it if you're using NFS to host your datafiles and/or backups. Furthermore, if you're going to use Direct NFS, then you'll want to review MOS Doc ID 1495104.1 for a list of recommended patches. These patches can yield some great performance improvements in addition to fixing a few known bugs. As far as the NFS definitions go in /etc/fstab, they are still required. As far as the NFS mount options, I'll refer you to MOS Doc ID 359515.1.

There are also plenty of other blog posts detailing how to enable DNFS and configure your oranfstab. I won't repeat that info here. I want to focus on the issue that I specifically ran into.


Splitting the Pools


Our NFS solution would be two have two different pools of disk handled by different controllers. One pool would be for the faster 10K RPM disks for production, and the other for slower 7200 RPM disks for development, staging, backups, etc. We had Infiniband connections going from the database host to the NFS servers on a private network dedicated just for this traffic. We'll refer to these servers by their DNS entries fast-ib and slow-ib.


The idea then is for the Direct NFS connections to use those dedicated Infiniband pipes to access the NFS shares that we need. In the case that we want to copy production datafiles to a share for staging, we'll have mounts from both NFS servers mounted on the database host. This is where I learned a couple of things about Linux routing ...

The NFS server Inifiniband ports were configured like this:

  • fast-ib: 192.168.200.1
  • slow-ib: 192.168.200.2
The DB host had 2 dual-port infiniband cards as well, configured like this:
  • ib0: 192.168.200.100
  • ib1: 192.168.200.101
  • ib2: 192.168.200.102
  • ib3: 192.168.200.103
The idea was to have ib0 be our conduit to fast-ib, and for ib1 to be our conduit to slow-ib. So our /etc/oranfstab was configured as such, based on what I had seen from other blog posts about oranfstab and DirectNFS:


server: fast-ib
local: 192.168.200.100
path: 192.168.200.1
export: /export/fast mount:/mnt/fast

server: slow-ib
local: 192.168.200.101
path: 192.168.200.2
export: /export/slow mount:/mnt/slow

However what we saw was that any attempt to mount a database from the slow shares would hang and then core dump. Commenting out the local specification fixed this, but then everything went out of ib3 on the DB host, meaning we didn't have that segregation of traffic that we wanted.

At first I thought it was some complication with DirectNFS and oranfstab syntax that I was missing as I read and re-read the Oracle documentation and various blog posts. Turns out it was something a bit more basic.

After putting some chum out via Twitter, I got a big bite from Freek d'Hooge, who quickly recognized it as a common Linux routing comprehension (or lack thereof, on my part) issue. To highlight the issue for those skimming this article:

On Linux, if you want to have segregated traffic on separate network interfaces (not load-balancing), those interfaces must be on separate subnets.

In our case, everything was in the same 192.168.200.255 subnet, and so could only use one interface. In our case it was using ib3 because it was listed first in the routing table, visible by running "route -n":

# route -n | grep "Destination\|ib"

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.200.0   0.0.0.0         255.255.255.0   U     0      0        0 ib3
192.168.200.0   0.0.0.0         255.255.255.0   U     0      0        0 ib0
192.168.200.0   0.0.0.0         255.255.255.0   U     0      0        0 ib1
192.168.200.0   0.0.0.0         255.255.255.0   U     0      0        0 ib2


Offlining (ifdown) ib3 would just cause it to pick the next highest interface in that subnet from the routing table, in our case at this time it would have been ib0. Freek also mentioned that with dNFS on Unix, you can add the dontroute clause in oranfstab to avoid this, but on Linux this option does not work.

Adding to the original misery was that our oranfstab file was trying to force traffic onto the local interfaces that weren't routing traffic (ib0 and ib1). When we commented out the "local" lines in oranfstab, then Direct NFS just used what the route table gave it for whatever subnet it needed.

Armed with this new knowledge, we changed our NIC configurations, moving the slow-ib traffic to a different subnet:

The NFS server ports are now configured like this:
  • fast-ib: 192.168.200.1
  • slow-ib: 192.168.201.2
The DB interfaces are now configured like this:
  • ib0: 192.168.200.100
  • ib1: 192.168.201.101
  • ib2: [offline]
  • ib3: [offline]
We could choose to bond the 4 IB interfaces pairs to increase bandwidth, but for the time being we only had one IB active on each NFS server anyway so the bottleneck would be the same. Moving on, though, this is what the oranfstab looks like now in our working configuration:

server: fast-ib
local: 192.168.200.100
path: 192.168.200.1
export: /export/fast mount:/mnt/fast

server: slow-ib
local: 192.168.201.101
path: 192.168.201.2
export: /export/slow mount:/mnt/slow

Since we disabled ib2 and ib3, we now know for certain which interface will be used for each server. This means we also now know the IP address and can specify that in the "local" parameter line. We could have left ib2 & ib3 enabled and relied again on default Linux routing, but I prefer to eliminate variables whenever possible.

So now we have our production database files on /mnt/fast and our slower-is-somewhat-acceptable database files for staging and development on /mnt/slow. Traffic to each one is on a dedicated Infiniband port that is routed separated from the other.

Of course, it hasn't been all milk and honey. Next week I'll be back with some of the issues we've encountered actually using the Infiniband and the adjustments and fixes that were made to overcome them.