Originally posted on The Pythian Group blog.

Full credit for this tale should go to my colleague Christo Kutrovsky for the inspiration and basic plan involved.


We recently migrated a large database from Solaris SPARC to Solaris x86-64. All seemed to go well with the migration, but in the next few weeks, we noticed some I/O issues cropping up. Some research led us to find that the ZFS filesystem used to hold the datafiles was killing us on I/O. The default “recordsize” setting for ZFS was 128k.

$ /usr/sbin/zfs get recordsize zfs-data
NAME PROPERTY VALUE SOURCE
zfs-data recordsize 128K default


An Oracle database typically uses 8k for the block size, but in this case it was 16k. We saw basically the same thing that Neelakanth Nadgir described in his blog post, Databases and ZFS:
With ZFS, not only was the throughput much lower, but we used more [than] twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized. We noticed that we were not only reading in more data, but we were also doing more IO operations [than] what is needed.
The fix is to set the ZFS recordsize for a datafile filesystem to match the Oracle instance’s db_block_size. We also read in the ZFS Best Practices Guide that redo logs should be in a separate filesystem with the default ZFS recordsize of 128k. We already had them separate, so we just needed to get our datafiles on a ZFS filesystem with a 16k recordsize.



The first step is to change the ZFS recordsize for the current filesystem. Easy enough, but the change in recordsize will only apply to new files written there. It doesn’t do anything for our existing terabyte-plus of datafiles. This means we need to move the files to a new ZFS filesystem with the correct recordsize already set. That’s where the magic happens.

My first thought was that we’d need whole other slab of disk. Then I learned that we can simply create a new ZFS filesystem on the same pool of disks as the old filesystem. Obviously we won’t be able to hold two complete copies of the datafiles, but this would let us migrate groups of files at a time (for example, a tablespace). When the move is complete, disk space is released from the old filesystem, so the sum of used disk is essentially unchanged.

The rest is just a matter of moving datafiles. Now, we all know you can’t just move or copy Oracle datafiles from a running instance. We also know that clients want as little downtime as possible, so doing a complete shutdown/move/mount/rename/open is out of the question. We could offline a tablespace, move its datafiles, rename them in the database, then online the tablespace again. Here’s a sample scenario, starting with tablespace and datafile creation:

SQL> create tablespace zfstest
2 datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf'
3 size 128m;


Tablespace created.


SQL> alter tablespace zfstest
2 add datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf'
3 size 128m;


Tablespace altered.


SQL> create table zfsobjects tablespace zfstest
2 as select * from dba_objects;


Table created.


Accept the following:
  • /u01/app/oracle/oradata/orcl/zfs128k/ is a ZFS filesystem with default 128k recordsize
  • /u01/app/oracle/oradata/orcl/zfs16k/ is a ZFS filesystem with a 16k recordsize
So here we have datafiles created on a ZFS filesystem with the default 128k recordsize, and a small table just to test data access. To move the entire tablespace in one go, we do this:

SQL> alter tablespace zfstest offline;


Tablespace altered.


SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf
/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf


SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' to
'/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf';


Database altered.


SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf
/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf


SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' to
'/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf';
Database altered.


SQL> alter tablespace zfstest online;


Tablespace altered.


SQL> select count(*) from zfsobjects;


COUNT(*)
----------
68844


Pretty simple. We do the one escape to the OS mv command to move the datafile. If you want to be safe, you could use cp instead of mv to copy the file. Just know that disk space from the old filesystem won’t be released until you delete the old files.

Note that this operation could take a very long time in the case of large tablespaces. We could try a somwhat similar approach of taking one datafile offline at a time and moving it. We’d just need to perform a quick recovery before bringing it online. That would look like this:

SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' offline;


Database altered.


SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf
/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf


SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' to
'/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf';


Database altered.


SQL> recover datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf';
Media recovery complete.


SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf' online;


Database altered.


SQL> alter database datafile
'/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' offline;


Database altered.


SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf
/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf


SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf'
to '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf';


Database altered.


SQL> recover datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf';
Media recovery complete.


SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs16k
/zfstest02.dbf' online;


Database altered.


SQL> select count(*) from zfsobjects;


COUNT(*)
----------
68844


Seems fine. Note that your instance needs to be in ARCHIVELOG mode to be able to take datafiles offline. Obviously, if someone tried to access data on an offline datafile they’d get this error:

SQL> select count(*) from zfsobjects;
select count(*) from zfsobjects
*
ERROR at line 1:
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf'


So if a datafile is particularly large, this can mean still a few minutes of some data being offline. We have one more tool at our disposal that provides for a quicker turnaround: RMAN’s BACKUP AS COPY command. As it says, this creates a copy of a datafile as a backup. The benefit is that it does this while the current datafile (and data) remains online. We then just need to offline, switch, recover, and online each datafile, which takes very little time compared to the time spent moving a large datafile. That process would be something like this.

First, we want to know the file numbers of our datafiles. This is because the RMAN procedure doesn’t provide for an easy way to preserve datafile names when we backup an entire tablespace (which we’re going to do):

SQL> select file_id, file_name
2 from dba_data_files
3 where tablespace_name='ZFSTEST';


FILE_ID FILE_NAME
---------- -------------------------------------------------------
6 /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf
7 /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf


So we have file numbers 6 and 7. Now we create our RMAN command file and execute it:

RMAN> backup as copy
2> tablespace zfstest format '/u01/app/oracle/oradata/orcl/zfs16k/%U';
3>
4> sql 'alter database datafile 6 offline';
5> switch datafile 6 to copy;
6> recover datafile 6;
7> sql 'alter database datafile 6 online';
8>
9> sql 'alter database datafile 7 offline';
10> switch datafile 7 to copy;
11> recover datafile 7;
12> sql 'alter database datafile 7 online';
13>
Starting backup at 18-AUG-09
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=134 device type=DISK
channel ORA_DISK_1: starting datafile copy
input datafile file number=00006 name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf
output file name=/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh tag=TAG20090818T140640 RECID=2 STAMP=695225208
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf
output file name=/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0 tag=TAG20090818T140640 RECID=3 STAMP=695225222
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:07
Finished backup at 18-AUG-09


sql statement: alter database datafile 6 offline


datafile 6 switched to datafile copy
"/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh"


Starting recover at 18-AUG-09
using channel ORA_DISK_1


starting media recovery
media recovery complete, elapsed time: 00:00:00


Finished recover at 18-AUG-09


sql statement: alter database datafile 6 online


sql statement: alter database datafile 7 offline


datafile 7 switched to datafile copy
"/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0"


Starting recover at 18-AUG-09
using channel ORA_DISK_1


starting media recovery
media recovery complete, elapsed time: 00:00:00


Finished recover at 18-AUG-09


sql statement: alter database datafile 7 online


Recovery Manager complete.


Voila! The tablespace is moved with minimal downtime:


SQL> select file_id, file_name
2 from dba_data_files
3 where tablespace_name='ZFSTEST';


FILE_ID
----------
FILE_NAME
--------------------------------------------------------------------------------
6
/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh


7
/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0


SQL> select count(*) from zfsobjects;


COUNT(*)
----------
68844


Obviously, you can generate all your statements from SQL scripts, and you’ll definitely want to do this if your tablespace has a large amount of datafiles. If you are scripting, you could opt to backup one datafile at a time if you need to preserve file names. In this case, I wasn’t too concerned.

UPDATE: Teammate Marc Billette pointed out that the old datafiles were not removed when doing a datafile copy and switch. I clearly left them on disk, continuing to use space! So, once we are done with all the datafile switches, then our original datafiles become the copies, and we can list and delete those in RMAN like this:

RMAN> list datafilecopy all;


List of Datafile Copies
=======================


Key File S Completion Time Ckp SCN Ckp Time
------- ---- - --------------- ---------- ---------------
4 6 A 18-AUG-09 629982 18-AUG-09
Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf


5 7 A 18-AUG-09 630001 18-AUG-09
Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf


RMAN> delete datafilecopy all;


allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=135 device type=DISK
List of Datafile Copies
=======================


Key File S Completion Time Ckp SCN Ckp Time
------- ---- - --------------- ---------- ---------------
4 6 A 18-AUG-09 629982 18-AUG-09
Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf


5 7 A 18-AUG-09 630001 18-AUG-09
Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf


Do you really want to delete the above objects (enter YES or NO)? YES
deleted datafile copy
datafile copy file name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf RECID=4 STAMP=695225227
deleted datafile copy
datafile copy file name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf RECID=5 STAMP=695225228
Deleted 2 objects


Thanks to Marc for catching this early in the published life of the article!

I hope this post can help other people that run into this problem with Oracle on ZFS (I’ve heard from one on twitter already), or people just wanting to move datafiles. Again, thanks to Pythian DBA Christo Kutrovsky for laying this process out for us!

Note: this process was tested on Oracle 10gR2 (10.2.0.2) and 11gR1 (11.1.0.6).