Project Falcon has retired. For details please refer to its Attic page.
Falcon - HDFS Snapshot based Mirroring

HDFS Snapshot based Mirroring

Overview

HDFS snapshots are very cost effective to create ( cost is O(1) excluding iNode lookup time). Once created, it is very efficient to find modifications relative to a snapshot and copy over these modifications for disaster recovery (DR). This makes for cost effective HDFS mirroring.

Prerequisites

Following is the prerequisite to use HDFS Snapshot based Mirrroring.

  • Hadoop version 2.7.0 or higher.
  • User submitting and scheduling falcon snapshot based mirroring job should have permission to create and manage snapshots on both source and target directories.

Use Case

Create and manage snapshots on source/target directories. Mirror data from source to target for disaster recovery using these snapshots. Perform retention on the snapshots created on source and target.

Usage

Setup

  • Submit a source cluster and target cluster entities to Falcon.
    $FALCON_HOME/bin/falcon entity -submit -type cluster -file source-cluster-definition.xml
    $FALCON_HOME/bin/falcon entity -submit -type cluster -file target-cluster-definition.xml
   

  • Ensure that source directory on source cluster and target directory on target cluster exists.
  • Ensure that these dirs are snapshot-able by user submitting extension. You can find more information on snapshots here.

HDFS Snapshot based mirroring extension properties

Extension artifacts are expected to be installed on HDFS at the path specified by "extension.store.uri" in startup properties. hdfs-snapshot-mirroring-properties.json file located at "<extension.store.uri>/hdfs-snapshot-mirroring/META/hdfs-snapshot-mirroring-properties.json" lists all the required and optional parameters/arguments for scheduling the mirroring job.

Here is a sample set of properties,

   ## Job Properties
   jobName=hdfs-snapshot-test
   jobClusterName=backupCluster
   jobValidityStart=2016-01-01T00:00Z
   jobValidityEnd=2016-04-01T00:00Z
   jobFrequency=hours(12)
   jobTimezone=UTC
   jobTags=consumer=consumer@xyz.com
   jobRetryPolicy=periodic
   jobRetryDelay=minutes(30)
   jobRetryAttempts=3

   ## Job owner
   jobAclOwner=ambari-qa
   jobAclGroup=users
   jobAclPermission=*

   ## Source information
   sourceCluster=primaryCluster
   sourceSnapshotDir=/apps/falcon/snapshots/source/
   sourceSnapshotRetentionPolicy=delete
   sourceSnapshotRetentionAgeLimit=days(15)
   sourceSnapshotRetentionNumber=10

   ## Target information
   targetCluster=backupCluster
   targetSnapshotDir=/apps/falcon/snapshots/target/
   targetSnapshotRetentionPolicy=delete
   targetSnapshotRetentionAgeLimit=months(6)
   targetSnapshotRetentionNumber=20

   ## Distcp properties
   distcpMaxMaps=1
   distcpMapBandwidth=100
   tdeEncryptionEnabled=false
   

The above properties ensure Falcon hdfs snapshot based mirroring extension does the following every 12 hours.

  • Create snapshot on dir /apps/falcon/snapshots/source/ on primaryCluster.
  • DistCP data from /apps/falcon/snapshots/source/ on primaryCluster to /apps/falcon/snapshots/target/ on backupCluster.
  • Create snapshot on dir /apps/falcon/snapshots/target/ on backupCluster.
  • Perform retention job on source and target.
    • Maintain at least N latest snapshots and delete all other snapshots older than specified age limit.
    • Today, only "delete" policy is supported for snapshot retention.

Note: When TDE encryption is enabled on source/target directories, DistCP ignores the snapshots and treats it like a regular replication. While user may not get the performance benefit of using snapshot based DistCP, the extension is still useful for creating and maintaining snapshots.

Submit and schedule HDFS snapshot mirroring extension

User can submit extension using CLI or RestAPI. CLI command looks as follows

    $FALCON_HOME/bin/falcon extension -submitAndSchedule -extensionName hdfs-snapshot-mirroring -file propeties-file.txt
   

Please Refer to Falcon CLI and REST API for more details on usage of CLI and REST API's.