Apache Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
Data Management on Hadoop encompasses data motion, process orchestration, lifecycle management, data discovery, etc. among other concerns. Falcon is a new data processing and management platform for Hadoop that solves this problem and creates additional opportunities by building on existing components within the Hadoop ecosystem without reinventing the wheel.
Falcon will enable easy data management via declarative mechanism for Hadoop. Users of Falcon platform simply define infrastructure endpoints, data sets and processing rules declaratively. These declarative configurations are expressed in such a way that the dependencies between these configured entities are explicitly described. This information about inter-dependencies between various entities allows Falcon to orchestrate and manage various data management functions.
Falcon has gradauted to a top level project in Dec 2014.
Complex data processing logic handled by Falcon instead of hard-coded in apps
Data Set, Infrastructure (Cluster, Database, Filer, etc.), Process
Data management expressed as simple directives, instead of verbosely defining it repeatedly
Allow process owners to keep application/user workflow specific to their application logic than muddy them with the common management functions.
Does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem, Enhances productivity
Data Import & Export from DBs, Filers into HDFS
Late Data Handling, Retries, etc. Multi-cluster management to support Local/Global Aggregations, Rollups, etc. Scheduler integration
Retention, Replication/BCP/DR, Anonymization of PII Data, Archival, etc.
Data Classification, Audit, Lineage.