HDFS snapshots are very cost effective to create ( cost is O(1) excluding iNode lookup time). Once created, it is very efficient to find modifications relative to a snapshot and copy over these modifications for disaster recovery (DR). This makes for cost effective HDFS mirroring.
Following is the prerequisite to use HDFS Snapshot based Mirrroring.
Create and manage snapshots on source/target directories. Mirror data from source to target for disaster recovery using these snapshots. Perform retention on the snapshots created on source and target.
$FALCON_HOME/bin/falcon entity -submit -type cluster -file source-cluster-definition.xml $FALCON_HOME/bin/falcon entity -submit -type cluster -file target-cluster-definition.xml
Extension artifacts are expected to be installed on HDFS at the path specified by "extension.store.uri" in startup properties. hdfs-snapshot-mirroring-properties.json file located at "<extension.store.uri>/hdfs-snapshot-mirroring/META/hdfs-snapshot-mirroring-properties.json" lists all the required and optional parameters/arguments for scheduling the mirroring job.
Here is a sample set of properties,
## Job Properties jobName=hdfs-snapshot-test jobClusterName=backupCluster jobValidityStart=2016-01-01T00:00Z jobValidityEnd=2016-04-01T00:00Z jobFrequency=hours(12) jobTimezone=UTC jobTags=consumer=consumer@xyz.com jobRetryPolicy=periodic jobRetryDelay=minutes(30) jobRetryAttempts=3 ## Job owner jobAclOwner=ambari-qa jobAclGroup=users jobAclPermission=* ## Source information sourceCluster=primaryCluster sourceSnapshotDir=/apps/falcon/snapshots/source/ sourceSnapshotRetentionPolicy=delete sourceSnapshotRetentionAgeLimit=days(15) sourceSnapshotRetentionNumber=10 ## Target information targetCluster=backupCluster targetSnapshotDir=/apps/falcon/snapshots/target/ targetSnapshotRetentionPolicy=delete targetSnapshotRetentionAgeLimit=months(6) targetSnapshotRetentionNumber=20 ## Distcp properties distcpMaxMaps=1 distcpMapBandwidth=100 tdeEncryptionEnabled=false
The above properties ensure Falcon hdfs snapshot based mirroring extension does the following every 12 hours.
Note: When TDE encryption is enabled on source/target directories, DistCP ignores the snapshots and treats it like a regular replication. While user may not get the performance benefit of using snapshot based DistCP, the extension is still useful for creating and maintaining snapshots.
User can submit extension using CLI or RestAPI. CLI command looks as follows
$FALCON_HOME/bin/falcon extension -submitAndSchedule -extensionName hdfs-snapshot-mirroring -file propeties-file.txt
Please Refer to Falcon CLI and REST API for more details on usage of CLI and REST API's.