Skip to content

Using falcon in HDP for backup – ongoing work

by on April 19, 2015

My project wants to check if falcon is suitable for backup between two clusters.

Following the only example I could find: http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/

Main difference between the example and my tests is that I’ve created my own clusters and did not use the sandbox from HDP.

The issues I’ve encountered:

Issue #1

When you run falcon you’ll get the error:

Error: Invalid Execute server or port: h153.amdocs.com:8032
Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.”

In order to overcome this one need to disable the parameter “yarn.timeline-service.enabled

Take from here – http://mail-archives.apache.org/mod_mbox/falcon-dev/201408.mbox/%3CCAF1jEfAcdchXOY5stdVEgPxZvNcf=-ATPSKZYk1DmX+4Aec1Fw@mail.gmail.com%3E

In Ambari UI,click on Yarn, click on Configs, under Application Timeline Server uncheckthe box next to yarn.timeline-service.enabled, Save, then restart Yarn,then restart Falcon

Issue #2

Trying to submit the process entity showedan error:

falcon entity -type process -submit -file emailIngestProcess.xml
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: falcon is not allowed to impersonate falcon

For this you’ll need to change the parameter “hadoop.proxyuser.falcon.groups

In the HDFS–>config to the right user permissions.
I’ve just put “*” (asterix) so it will grant all
then restart HDFS and other services

Issue #3

If you’re behind proxy – you’ll have to change the script in HDFS – can be done from HUE using the file browser
/ user/ ambari-qa/ falcon/ demo/ apps/ ingest/ fs/ ingest.sh
Edit and add your proxy server (export http_proxy=http://proxyserver:8080 – or other port you’re using)

Issue #4
trying to load the rawEmailIngestProcess returns some error:

falcon entity -type process -schedule -name rawEmailIngestProcess
Error: null

Here there is probably a bug – the feed has to have an input. taken from here –
http://mail-archives.apache.org/mod_mbox/falcon-dev/201408.mbox/%3CCAPyZWqot92MOhqqSMHBT2t8d801vmDJ_2b1r0-8N7hWr1+S3ug@mail.gmail.com%3E

Current version I’m using probably does not have the fix

I created an empty feed (copied rawEmailFeed.xml and modified it)

<?xml version="1.0" encoding="UTF-8"?>
 <!--
 A feed representing Hourly customer email data retained for 90 days
 -->
 <feed description="Empty feed" name="emptyFeed"
 xmlns="uri:falcon:feed:0.1">
 <tags>externalSystem=USWestEmailServers,classification=secure</tags>
 <groups>churnAnalysisDataPipeline</groups>
 <frequency>hours(1)</frequency>
 <late-arrival cut-off="hours(4)"/>
 <clusters>
 <cluster name="primaryCluster" type="source">
 <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
 <retention limit="days(90)" action="delete"/>
 </cluster>
 </clusters>
 <locations>
 <location type="data"
 path="/tmp/empty/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
 <location type="stats" path="/none"/>
 <location type="meta" path="/none"/>
 </locations>
 <ACL owner="ambari-qa" group="users" permission="0777"/>
 <schema location="/none" provider="none"/>
 </feed>

and then loaded it

falcon entity -type feed -submit -file emptyFeed.xml

I modified the emailIngestProcess.xml – added an inputs to it:

diff
19a20,23
> <inputs>
> <input name="input" feed="emptyFeed" start="now(0,0)" end="now(0,0)" />
> </inputs>

and deleted and reloaded rawEmailIngestProcess

falcon entity -type process -delete -name rawEmailIngestProcess
falcon entity -type process -submit -file emailIngestProcess.xml

Issue #5

Beacuse i’m installing my own clusters and not using sandbox – one have to configure all correctly:

checkout Chapter 19.3 in http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.1.0.0/bk_installing_manually_book/bk_installing_manually_book-20140110.pdf

Need to change the property oozie.service.HadoopAccessorService.hadoop.configurations into something like:

*=/etc/hadoop/conf,h153:8020=/etc/hadoop/conf/,h153:8032=/etc/hadoop/conf/,h156:8020=/etc/hadoop/conf/,h156:8032=/etc/hadoop/conf/

Where h153 and h156 are host names of the two clusters name nodes and resource managers

This is so far.

Next steps- to backup hive tables.

Advertisements

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: