This project has retired. For details please refer to its Attic page.
Falcon - Configuring Falcon

Configuring Falcon

By default config directory used by falcon is {package dir}/conf. To override this (to use the same conf with multiple falcon upgrades), set environment variable FALCON_CONF to the path of the conf dir.

falcon-env.sh has been added to the falcon conf. This file can be used to set various environment variables that you need for you services. In addition you can set any other environment variables you might need. This file will be sourced by falcon scripts before any commands are executed. The following environment variables are available to set.

# The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
#export JAVA_HOME=

# any additional java opts you want to set. This will apply to both client and server operations
#export FALCON_OPTS=

# any additional java opts that you want to set for client only
#export FALCON_CLIENT_OPTS=

# java heap size we want to set for the client. Default is 1024MB
#export FALCON_CLIENT_HEAP=

# any additional opts you want to set for prism service.
#export FALCON_PRISM_OPTS=

# java heap size we want to set for the prism service. Default is 1024MB
#export FALCON_PRISM_HEAP=

# any additional opts you want to set for falcon service.
#export FALCON_SERVER_OPTS=

# java heap size we want to set for the falcon server. Default is 1024MB
#export FALCON_SERVER_HEAP=

# What is is considered as falcon home dir. Default is the base location of the installed software
#export FALCON_HOME_DIR=

# Where log files are stored. Default is logs directory under the base install location
#export FALCON_LOG_DIR=

# Where pid files are stored. Default is logs directory under the base install location
#export FALCON_PID_DIR=

# where the falcon active mq data is stored. Default is logs/data directory under the base install location
#export FALCON_DATA_DIR=

# Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.
#export FALCON_EXPANDED_WEBAPP_DIR=

# Any additional classpath elements to be added to the Falcon server/client classpath
#export FALCON_EXTRA_CLASS_PATH=

Advanced Configurations

Configuring Monitoring plugin to register catalog partitions

Falcon comes with a monitoring plugin that registers catalog partition. This comes in really handy during migration from filesystem based feeds to hcatalog based feeds. This plugin enables the user to de-couple the partition registration and assume that all partitions are already on hcatalog even before the migration, simplifying the hcatalog migration.

By default this plugin is disabled. To enable this plugin and leverage the feature, there are 3 pre-requisites:

In {package dir}/conf/startup.properties, add
*.workflow.execution.listeners=org.apache.falcon.catalog.CatalogPartitionHandler

In the cluster definition, ensure registry endpoint is defined.
Ex:
<interface type="registry" endpoint="thrift://localhost:1109" version="0.13.3"/>

In the feed definition, ensure the corresponding catalog table is mentioned in feed-properties
Ex:
<properties>
    <property name="catalog.table" value="catalog:default:in_table#year={YEAR};month={MONTH};day={DAY};hour={HOUR};
    minute={MINUTE}"/>
</properties>

NOTE : for Mac OS users

If you are using a Mac OS, you will need to configure the FALCON_SERVER_OPTS (explained above).

In  {package dir}/conf/falcon-env.sh uncomment the following line
#export FALCON_SERVER_OPTS=

and change it to look as below
export FALCON_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Activemq

* falcon server starts embedded active mq. To control this behaviour, set the following system properties using -D option in environment variable FALCON_OPTS:

  • falcon.embeddedmq=<true/false> - Should server start embedded active mq, default true
  • falcon.embeddedmq.port=<port> - Port for embedded active mq, default 61616
  • falcon.embeddedmq.data=<path> - Data path for embedded active mq, default {package dir}/logs/data

Falcon System Notifications

Some Falcon features such as late data handling, retries, metadata service, depend on JMS notifications sent when the Oozie workflow completes. Falcon listens to Oozie notification via JMS. You need to enable Oozie JMS notification as explained below. Falcon post processing feature continues to only send user notifications so enabling Oozie JMS notification is important.

NOTE : If Oozie JMS notification is not enabled, the Falcon features such as failure retry, late data handling and metadata service will be disabled for all entities on the server.

Enable Oozie JMS notification

  • Please add/change the following properties in oozie-site.xml in the oozie installation dir.
   <property>
      <name>oozie.jms.producer.connection.properties</name>
      <value>java.naming.factory.initial#org.apache.activemq.jndi.ActiveMQInitialContextFactory;java.naming.provider.url#tcp://<activemq-host>:<port></value>
    </property>

   <property>
      <name>oozie.service.EventHandlerService.event.listeners</name>
      <value>org.apache.oozie.jms.JMSJobEventListener</value>
   </property>

   <property>
      <name>oozie.service.JMSTopicService.topic.name</name>
      <value>WORKFLOW=ENTITY.TOPIC,COORDINATOR=ENTITY.TOPIC</value>
    </property>

   <property>
      <name>oozie.service.JMSTopicService.topic.prefix</name>
      <value>FALCON.</value>
    </property>

    <!-- add org.apache.oozie.service.JMSAccessorService to the other existing services if any -->
    <property>
       <name>oozie.services.ext</name>
       <value>org.apache.oozie.service.JMSAccessorService,org.apache.oozie.service.PartitionDependencyManagerService,org.apache.oozie.service.HCatAccessorService</value>
    </property>

  • In falcon startup.properties, set JMS broker url to be the same as the one set in oozie-site.xml property
oozie.jms.producer.connection.properties (see above)
   *.broker.url=tcp://<activemq-host>:<port>

Configuring Oozie for Falcon

Falcon uses HCatalog for data availability notification when Hive tables are replicated. Make the following configuration changes to Oozie to ensure Hive table replication in Falcon:

  • Stop the Oozie service on all Falcon clusters. Run the following commands on the Oozie host machine.
su - $OOZIE_USER

<oozie-install-dir>/bin/oozie-stop.sh

where $OOZIE_USER is the Oozie user. For example, oozie.

  • Copy each cluster's hadoop conf directory to a different location. For example, if you have two clusters, copy one to /etc/hadoop/conf-1 and the other to /etc/hadoop/conf-2.

  • For each oozie-site.xml file, modify the oozie.service.HadoopAccessorService.hadoop.configurations property, specifying clusters, the RPC ports of the NameNodes, and HostManagers accordingly. For example, if Falcon connects to three clusters, specify:

<property>
     <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
     <value>*=/etc/hadoop/conf,$NameNode:$rpcPortNN=$hadoopConfDir1,$ResourceManager1:$rpcPortRM=$hadoopConfDir1,$NameNode2=$hadoopConfDir2,$ResourceManager2:$rpcPortRM=$hadoopConfDir2,$NameNode3 :$rpcPortNN =$hadoopConfDir3,$ResourceManager3 :$rpcPortRM =$hadoopConfDir3</value>
     <description>
          Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of
          the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is
          used when there is no exact match for an authority. The HADOOP_CONF_DIR contains
          the relevant Hadoop *-site.xml files. If the path is relative is looked within
          the Oozie configuration directory; though the path can be absolute (i.e. to point
          to Hadoop client conf/ directories in the local filesystem.
     </description>
</property>


  • Add the following properties to the /etc/oozie/conf/oozie-site.xml file:

<property>
     <name>oozie.service.ProxyUserService.proxyuser.falcon.hosts</name>
     <value>*</value>
</property>

<property>
     <name>oozie.service.ProxyUserService.proxyuser.falcon.groups</name>
     <value>*</value>
</property>

<property>
     <name>oozie.service.URIHandlerService.uri.handlers</name>
     <value>org.apache.oozie.dependency.FSURIHandler, org.apache.oozie.dependency.HCatURIHandler</value>
</property>

<property>
     <name>oozie.services.ext</name>
     <value>org.apache.oozie.service.JMSAccessorService, org.apache.oozie.service.PartitionDependencyManagerService,
     org.apache.oozie.service.HCatAccessorService</value>
</property>

<!-- Coord EL Functions Properties -->

<property>
     <name>oozie.service.ELService.ext.functions.coord-job-submit-instances</name>
     <value>now=org.apache.oozie.extensions.OozieELExtensions#ph1_now_echo,
         today=org.apache.oozie.extensions.OozieELExtensions#ph1_today_echo,
         yesterday=org.apache.oozie.extensions.OozieELExtensions#ph1_yesterday_echo,
         currentMonth=org.apache.oozie.extensions.OozieELExtensions#ph1_currentMonth_echo,
         lastMonth=org.apache.oozie.extensions.OozieELExtensions#ph1_lastMonth_echo,
         currentYear=org.apache.oozie.extensions.OozieELExtensions#ph1_currentYear_echo,
         lastYear=org.apache.oozie.extensions.OozieELExtensions#ph1_lastYear_echo,
         formatTime=org.apache.oozie.coord.CoordELFunctions#ph1_coord_formatTime_echo,
         latest=org.apache.oozie.coord.CoordELFunctions#ph2_coord_latest_echo,
         future=org.apache.oozie.coord.CoordELFunctions#ph2_coord_future_echo
     </value>
</property>

<property>
     <name>oozie.service.ELService.ext.functions.coord-action-create-inst</name>
     <value>now=org.apache.oozie.extensions.OozieELExtensions#ph2_now_inst,
         today=org.apache.oozie.extensions.OozieELExtensions#ph2_today_inst,
         yesterday=org.apache.oozie.extensions.OozieELExtensions#ph2_yesterday_inst,
         currentMonth=org.apache.oozie.extensions.OozieELExtensions#ph2_currentMonth_inst,
         lastMonth=org.apache.oozie.extensions.OozieELExtensions#ph2_lastMonth_inst,
         currentYear=org.apache.oozie.extensions.OozieELExtensions#ph2_currentYear_inst,
         lastYear=org.apache.oozie.extensions.OozieELExtensions#ph2_lastYear_inst,
         latest=org.apache.oozie.coord.CoordELFunctions#ph2_coord_latest_echo,
         future=org.apache.oozie.coord.CoordELFunctions#ph2_coord_future_echo,
         formatTime=org.apache.oozie.coord.CoordELFunctions#ph2_coord_formatTime,
         user=org.apache.oozie.coord.CoordELFunctions#coord_user
     </value>
</property>

<property>
<name>oozie.service.ELService.ext.functions.coord-action-start</name>
<value>
now=org.apache.oozie.extensions.OozieELExtensions#ph2_now,
today=org.apache.oozie.extensions.OozieELExtensions#ph2_today,
yesterday=org.apache.oozie.extensions.OozieELExtensions#ph2_yesterday,
currentMonth=org.apache.oozie.extensions.OozieELExtensions#ph2_currentMonth,
lastMonth=org.apache.oozie.extensions.OozieELExtensions#ph2_lastMonth,
currentYear=org.apache.oozie.extensions.OozieELExtensions#ph2_currentYear,
lastYear=org.apache.oozie.extensions.OozieELExtensions#ph2_lastYear,
latest=org.apache.oozie.coord.CoordELFunctions#ph3_coord_latest,
future=org.apache.oozie.coord.CoordELFunctions#ph3_coord_future,
dataIn=org.apache.oozie.extensions.OozieELExtensions#ph3_dataIn,
instanceTime=org.apache.oozie.coord.CoordELFunctions#ph3_coord_nominalTime,
dateOffset=org.apache.oozie.coord.CoordELFunctions#ph3_coord_dateOffset,
formatTime=org.apache.oozie.coord.CoordELFunctions#ph3_coord_formatTime,
user=org.apache.oozie.coord.CoordELFunctions#coord_user
</value>
</property>

<property>
     <name>oozie.service.ELService.ext.functions.coord-sla-submit</name>
     <value>
         instanceTime=org.apache.oozie.coord.CoordELFunctions#ph1_coord_nominalTime_echo_fixed,
         user=org.apache.oozie.coord.CoordELFunctions#coord_user
     </value>
</property>

<property>
     <name>oozie.service.ELService.ext.functions.coord-sla-create</name>
     <value>
         instanceTime=org.apache.oozie.coord.CoordELFunctions#ph2_coord_nominalTime,
         user=org.apache.oozie.coord.CoordELFunctions#coord_user
     </value>
</property>


  • Copy the existing Oozie WAR file to <oozie-install-dir>/oozie.war. This will ensure that all existing items in the WAR file are still present after the current update.
su - root
cp $CATALINA_BASE/webapps/oozie.war <oozie-install-dir>/oozie.war

where $CATALINA_BASE is the path for the Oozie web app. By default, $CATALINA_BASE is: <oozie-install-dir>

  • Add the Falcon EL extensions to Oozie.

Copy the extension JAR files provided with the Falcon Server to a temporary directory on the Oozie server. For example, if your standalone Falcon Server is on the same machine as your Oozie server, you can just copy the JAR files.


mkdir /tmp/falcon-oozie-jars
cp <falcon-install-dir>/oozie/ext/falcon-oozie-el-extension-<$version>.jar /tmp/falcon-oozie-jars
cp /tmp/falcon-oozie-jars/falcon-oozie-el-extension-<$version>.jar <oozie-install-dir>/libext


  • Package the Oozie WAR file as the Oozie user
su - $OOZIE_USER
cd <oozie-install-dir>/bin
./oozie-setup.sh prepare-war

Where $OOZIE_USER is the Oozie user. For example, oozie.

  • Start the Oozie service on all Falcon clusters. Run these commands on the Oozie host machine.
su - $OOZIE_USER
<oozie-install-dir>/bin/oozie-start.sh

Where $OOZIE_USER is the Oozie user. For example, oozie.

Disabling Falcon Post Processing

Falcon post processing performs two tasks: They send user notifications to Active mq. It moves oozie executor logs once the workflow finishes.

If post processing is failing because of any reason user mind end up having a backlog in the pipeline thats why it has been made optional.

To disable post processing set the following property to false in startup.properties :

*.falcon.postprocessing.enable=false
*.workflow.execution.listeners=org.apache.falcon.service.LogMoverService

NOTE : Please make sure Oozie JMS Notifications are enabled as logMoverService depends on the Oozie JMS Notification.

Enabling Falcon Native Scheudler

$FALCON_HOME/conf/startup.properties before starting the Falcon Server. For details on the same, refer to Falcon Native Scheduler

Titan GraphDB backend

GraphDB backend needs to be configured to properly start Falcon server. You can either choose to use 5.0.73 version of berkeleydb (the default for Falcon for the last few releases) or 1.1.x or later version HBase as the backend database. Falcon in its release distributions will have the titan storage plugins for both BerkeleyDB and HBase.

----++++Using BerkeleyDB backend Falcon distributions may not package berkeley db artifacts (je-5.0.73.jar) based on build profiles. If Berkeley DB is not packaged, you can download the Berkeley DB jar file from the URL:

http://download.oracle.com/otn/berkeley-db/je-5.0.73.zip

The following properties describe an example berkeley db graph storage backend that can be specified in the configuration file

$FALCON_HOME/conf/startup.properties
# Graph Storage
*.falcon.graph.storage.directory=${user.dir}/target/graphdb
*.falcon.graph.storage.backend=berkeleyje
*.falcon.graph.serialize.path=${user.dir}/target/graphdb

Using HBase backend

hbase-site.xml is provided, which can be used to start the standalone mode HBase enviornment for development/testing purposes.

Basic configuration
##### Falcon startup.properties
*.falcon.graph.storage.backend=hbase
#For standalone mode , specify localhost
#for distributed mode, specify zookeeper quorum here - For more information refer http://s3.thinkaurelius.com/docs/titan/current/hbase.html#_remote_server_mode_2
*.falcon.graph.storage.hostname=<ZooKeeper Quorum>

FALCON_EXTRA_CLASS_PATH in $FALCON_HOME/bin/falcon-env.sh. Additionally the correct hbase client libraries need to be added. For example,

export FALCON_EXTRA_CLASS_PATH=`${HBASE_HOME}/bin/hbase classpath`

Table name We recommend that in the startup config the tablename for titan storage be named <verbatim>falcon_titan<verbatim> so that multiple applications using Titan can share the same HBase cluster. This can be set by specifying the tablename using the startup property given below. The default value is shown.

*.falcon.graph.storage.hbase.table=falcon_titan

Starting standalone HBase for testing

HBase can be started in stand alone mode for testing as a backend for Titan. The following steps outline the config changes required:

1. Build Falcon as below to package hbase binaries
   $ export MAVEN_OPTS="-Xmx1024m -XX:MaxPermSize=256m" && mvn clean assembly:assembly -Ppackage-standalone-hbase
2. Configure HBase
   a. When falcon tar file is expanded, HBase binaries are under ${FALCON_HOME}/hbase
   b. Copy ${FALCON_HOME}/conf/hbase-site.xml.template into hbase conf dir in ${FALCON_HOME}/hbase/conf/hbase-site.xml
   c. Set {hbase_home} property to point to a local dir
   d. Standalone HBase starts zookeeper on the default port (2181). This port can be changed by adding the following to hbase-site.xml
       <property>
            <name>hbase.zookeeper.property.clientPort</name>
            <value>2223</value>
       </property>

       <property>
            <name>hbase.zookeeper.quorum</name>
            <value>localhost</value>
       </property>
    e. set JAVA_HOME to point to Java 1.7 or above
    f. Start hbase as ${FALCON_HOME}/hbase/bin/start-hbase.sh
3. Configure Falcon
   a. In ${FALCON_HOME}/conf/startup.properties, uncomment the following to enable HBase as the backend
      *.falcon.graph.storage.backend=hbase
      ### specify the zookeeper host and port name with which standalone hbase is started (see step 2)
      ### by default, it will be localhost and port 2181
      *.falcon.graph.storage.hostname=<zookeeper-host-name>:<zookeeper-host-port>
      *.falcon.graph.serialize.path=${user.dir}/target/graphdb
      *.falcon.graph.storage.hbase.table=falcon_titan
      *.falcon.graph.storage.transactions=false
4. Add HBase jars to Falcon classpath in ${FALCON_HOME}/conf/falcon-env.sh as:
      FALCON_EXTRA_CLASS_PATH=`${FALCON_HOME}/hbase/bin/hbase classpath`
5. Set the following in ${FALCON_HOME}/conf/startup.properties to disable SSL if needed
      *.falcon.enableTLS=false
6. Start Falcon

Permissions

falcon user for the falcon_titan table (or whateven tablename was specified for the property *.falcon.graph.storage.hbase.table

falcon_titan.

Without Ranger, HBase shell can be used to set the permissions.

   su hbase
   kinit -k -t <hbase keytab> <hbase principal>
   echo "grant 'falcon', 'RWXCA', 'falcon_titan'" | hbase shell

Advanced configuration

$FALCON_HOME/conf/startup.properties, by prefixing the Titan property with *.falcon.graph prefix.

http://s3.thinkaurelius.com/docs/titan/0.5.4/titan-config-ref.html#_storage for generic storage properties, http://s3.thinkaurelius.com/docs/titan/0.5.4/titan-config-ref.html#_storage_berkeleydb for berkeley db properties and http://s3.thinkaurelius.com/docs/titan/0.5.4/titan-config-ref.html#_storage_hbase for hbase storage backend properties.

Adding Extension Libraries

Library extensions allows users to add custom libraries to entity lifecycles such as feed retention, feed replication and process execution. This is useful for usecases such as adding filesystem extensions. To enable this, add the following configs to startup.properties: *.libext.paths=<paths to be added to all entity lifecycles>

*.libext.feed.paths=<paths to be added to all feed lifecycles>

*.libext.feed.retentions.paths=<paths to be added to feed retention workflow>

*.libext.feed.replication.paths=<paths to be added to feed replication workflow>

*.libext.process.paths=<paths to be added to process workflow>

The configured jars are added to falcon classpath and the corresponding workflows.