Skip to content

Using REST APis with Cloudera


We added a new function to the oozie and it forces us to do some extra post installation tasks:

  • Copy the jar to oozie read location
  • Update a few configuration parameters

Posting a question in Cloudera forum returned with a quick answer that helped me write the script below.

Note – I could have the script much shorter, if I created the required json file hard-coded, and just loaded it. But this was mre fun, and it may be useful if there are some changes in the oozie configuration in the next versions.

Once i got this work, my next target is to create a new full cluster only with REST apis.

Update – need to modify the script – have to add view=full, otherwise i get partial configuration and the below does not work.

# This script modifies configuration in oozie for the new functionality
# It extract first the current configuration for oozie in json format
# Then it checks 3 parameters: xxxHadoopCounter, and wait-action-0.5.xsd
# If the parameter is missing it is added to the json file
# Then it loads the json back into the CDH
# Usage
function Usage {
echo "Usage: $0 "
exit 2
export host=$1
[[ -z "$host" ]] && Usage
# get_version
function get_version {
curl -u admin:admin "http://$host:7180/api/version" 2> /dev/null

export apiver=`get_version`
# get_cluster_name
function get_cluster_name {
curl -u admin:admin "http://$host:7180/api/$apiver/clusters" 2> /dev/null | awk '/name/{print $NF}'|sed 's/,//' | sed 's/"//g'


tmp_json=/tmp/oozie_`date +%Y%m%d_%H%M%S`.json

# Extract the json
curl -u admin:admin http://$host:7180/api/${apiver}/clusters/${cname}/services/oozie/roleConfigGroups/oozie-OOZIE_SERVER-BASE/config > $tmp_json 2> /dev/null
cp    $tmp_json ${tmp_json}.ORIG

# 1. add the oozie_config_safety_valve
grep -l -s $tmp_json > /dev/null 2> /dev/null
if [ $? -ne 0 ]
awk '
print $0;
print " \"name\" : \"oozie_config_safety_valve\",\n\"value\" : \"\\\\njava\\n\\n\\noozie.service.ELService.ext.functions.workflow\\\\n\\n\"\n}, { " ; next ;
{print $0}
' $tmp_json > ${tmp_json}.new
mv ${tmp_json}.new $tmp_json

# 2. add the oozie_executor_extension_classes
grep -l -s $tmp_json > /dev/null 2> /dev/null
if [ $? -ne 0 ]
awk '
/value" :/&&addvalue{sub(/\"$/,"");print $0 addvalue "\"" ;addvalue="";next}
{print $0}
' $tmp_json > ${tmp_json}.new
mv ${tmp_json}.new $tmp_json

# 3. add the oozie_workflow_extension_schemas
grep -l -s wait-action-0.5.xsd $tmp_json > /dev/null 2> /dev/null
if [ $? -ne 0 ]
awk '
/value" :/&&addvalue{sub(/\"$/,"");print $0 addvalue "\"" ;addvalue="";next}
{print $0}
' $tmp_json > ${tmp_json}.new
mv ${tmp_json}.new $tmp_json
curl -u admin:admin \
-H "Content-Type: application/json" \
-X PUT \
http://$host:7180/api/${ver}/clusters/${cname}/services/oozie/roleConfigGroups/oozie-OOZIE_SERVER-BASE/config \
-d "`cat ${tmp_json}`" > /dev/null 2> /dev/null

small bug when re-downloading Cloudera Parcels


I had an issue with uid of users on different servers so i had to delete and reinstall the cluster.

unfortunately – I kept the old parcels files with the old UID

ls -l /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel*
 -rw-r----- 1 cloudera-scm cloudera-scm 1558200266 May 12 14:51 /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel
 -rw-r----- 1 cloudera-scm cloudera-scm 848904192 May 12 14:52 /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel.part
 -rw-r----- 1 522 522 41 Apr 7 13:33 /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel.sha

note that the *.parcel.sha file is with the old UID of cloudera-scm account.

I saw that the parcels is being downloaded and redownloded in an endless loop.

in the log file i saw:
 2015-05-12 14:07:14,322 INFO MainThread:com.cloudera.parcel.components.PeriodicParcelTasks: Set up periodic parcel tasks every 60 minutes.
 2015-05-12 14:07:14,337 INFO ParcelUpdateService:com.cloudera.parcel.components.LocalParcelManagerImpl: Found files CDH-5.3.2-1.cdh5.3.2.p0.
 10-el6.parcel under /opt/cloudera/parcel-repo
 2015-05-12 14:07:14,352 WARN ParcelUpdateService:com.cloudera.parcel.components.LocalParcelManagerImpl: Error reading hash file: CDH-5.3.2-1
 .cdh5.3.2.p0.10-el6.parcel.sha /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel.sha (Permission denied)
 at Method)
 at com.cloudera.parcel.components.LocalParcelManagerImpl.readFirstLineFromFile(
 at com.cloudera.parcel.components.LocalParcelManagerImpl.getParcelHash(
 at com.cloudera.parcel.components.LocalParcelManagerImpl.processParcel(
 at com.cloudera.parcel.components.LocalParcelManagerImpl.scanRepo(
 at com.cloudera.parcel.components.LocalParcelManagerImpl$
 at com.cloudera.parcel.components.LocalParcelManagerImpl$
 at java.util.concurrent.ThreadPoolExecutor.runWorker(
 at java.util.concurrent.ThreadPoolExecutor$

Once i deleted the old sha file the downloding of the parcel ended and i could continue with the installation.

I’d think there should be some kind of warning to the screen that something is wrong so one won’t have to wait that long

This happend with CDH 5.4

Killing oozie jobs from HUE, while in kerberos mode, on HDP cluster


Everything is written – just search in the right place :-)

We’ve followed all (or most) of the instruction to convert an HDP cluster into kerberos mode.

But when we tried to kill a job from the HUE interface, we got the following error:

Problem: Error performing kill on Oozie job 0000000-150430180450260-oozie-oozi-W:

HTTP Status 401 –

type Status report


description This request requires HTTP authentication.

Apache Tomcat/6.0.37

To solve it, we had to read this –

If the parameter “”  is set to true,
then one has to modify the file /etc/oozie/conf/adminusers.txt – add a list of permitted user, one at a line.

If it is true and the file is empty, then when trying to kill a job, one gets the error above.

I’ve added several users, as just adding oozie is not sufficient, and we still need to do a fine tuning for the list of users to be within this file.

Also – one can run example jobs in oozie –

The example tar is here – /usr/share/doc/oozie-

After extracting the jar change in the the “localhost” to the FQDN of the host.
Then the folder has to be copied to hdfs:
hdfs dfs -copyFromLocal examples /user/oozie/.

I used the following command

oozie job -oozie http://`hostname -f`:11000/oozie -config examples/apps/map-reduce/ -run

And while the job ran (it gets suspend immediately), I killed it successfully from the HUE.

Bottom line – sababa. I can start the weekend now

Using falcon in HDP for backup – ongoing work


My project wants to check if falcon is suitable for backup between two clusters.

Following the only example I could find:

Main difference between the example and my tests is that I’ve created my own clusters and did not use the sandbox from HDP.

The issues I’ve encountered:

Issue #1

When you run falcon you’ll get the error:

Error: Invalid Execute server or port:
Cannot initialize Cluster. Please check your configuration for and the correspond server addresses.”

In order to overcome this one need to disable the parameter “yarn.timeline-service.enabled

Take from here –

In Ambari UI,click on Yarn, click on Configs, under Application Timeline Server uncheckthe box next to yarn.timeline-service.enabled, Save, then restart Yarn,then restart Falcon

Issue #2

Trying to submit the process entity showedan error:

falcon entity -type process -submit -file emailIngestProcess.xml
Error: org.apache.hadoop.ipc.RemoteException( User: falcon is not allowed to impersonate falcon

For this you’ll need to change the parameter “hadoop.proxyuser.falcon.groups

In the HDFS–>config to the right user permissions.
I’ve just put “*” (asterix) so it will grant all
then restart HDFS and other services

Issue #3

If you’re behind proxy – you’ll have to change the script in HDFS – can be done from HUE using the file browser
/ user/ ambari-qa/ falcon/ demo/ apps/ ingest/ fs/
Edit and add your proxy server (export http_proxy=http://proxyserver:8080 – or other port you’re using)

Issue #4
trying to load the rawEmailIngestProcess returns some error:

falcon entity -type process -schedule -name rawEmailIngestProcess
Error: null

Here there is probably a bug – the feed has to have an input. taken from here –

Current version I’m using probably does not have the fix

I created an empty feed (copied rawEmailFeed.xml and modified it)

<?xml version="1.0" encoding="UTF-8"?>
 A feed representing Hourly customer email data retained for 90 days
 <feed description="Empty feed" name="emptyFeed"
 <late-arrival cut-off="hours(4)"/>
 <cluster name="primaryCluster" type="source">
 <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
 <retention limit="days(90)" action="delete"/>
 <location type="data"
 <location type="stats" path="/none"/>
 <location type="meta" path="/none"/>
 <ACL owner="ambari-qa" group="users" permission="0777"/>
 <schema location="/none" provider="none"/>

and then loaded it

falcon entity -type feed -submit -file emptyFeed.xml

I modified the emailIngestProcess.xml – added an inputs to it:

> <inputs>
> <input name="input" feed="emptyFeed" start="now(0,0)" end="now(0,0)" />
> </inputs>

and deleted and reloaded rawEmailIngestProcess

falcon entity -type process -delete -name rawEmailIngestProcess
falcon entity -type process -submit -file emailIngestProcess.xml

Issue #5

Beacuse i’m installing my own clusters and not using sandbox – one have to configure all correctly:

checkout Chapter 19.3 in

Need to change the property oozie.service.HadoopAccessorService.hadoop.configurations into something like:


Where h153 and h156 are host names of the two clusters name nodes and resource managers

This is so far.

Next steps- to backup hive tables.

Removing alternatives of old CDH parcels


The title sounds like english, but it must be chinese to most of people…

Anyways – when changing parcles of cloudera we got stuck with alternatives pointing to the old version.

It seems like when one install new cloudera version using parcels, it runs the alternative command to set up the new path, but always with the same priority.

example for alternatives of zookeeper, after upgrading to CDH530 – without deleting the old parcel of CDH502.

/etc/alternatives]# alternatives --display zookeeper-client
zookeeper-client - status is auto.
 link currently points to /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/bin/zookeeper-client
/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/bin/zookeeper-client - priority 10
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/zookeeper-client - priority 10
Current `best' version is /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/bin/zookeeper-client.

To solve this, I ran an awk command that creates an alternatives remove command to delete the old path.


cd /etc/alternatives
timestamp=`date +%Y%m%d_%H%M%S`
ls -l |awk '/CDH-5.0.2/{print "alternatives --remove",$9,$NF}' > /tmp/remove_CDH502_${timestamp}.sh

ls -ld /tmp/remove_CDH502_${timestamp}.sh
bash /tmp/remove_CDH502_${timestamp}.sh

Of course the better solution is to have all alternatives to point to /opt/cloudera/parcels/CDH which is already a link to the right version.

We also did not check yet if the alternatives are removed if we delete the older parcels from the server.

Anyways – This works for us now.

pig script to convert snappy into gzip


My first ever pig script:

set output.compression.enabled true;
set output.compression.codec;
A = load '/path/to/snappy/dir/part*' using PigStorage();
store A into '/path/to/gzip/dir' USING PigStorage();

It’s that simple :)

I don’t know why we chose snappy for compression as we’ve found out that many third party could not reach out and read this type of data.
The compression is also nto as strong.
e.g. from pig output:

Successfully read 6041224101 records (205720552522 bytes) from: "/path/to/snappy/dir/part*"
Successfully stored 6041224101 records (117690493503 bytes) in: "/path/to/gzip/dir"

and from hdfs du -s -h command:

> hdfs dfs -du -s -h "/path/to/snappy/dir/"
191.6 G  574.8 G  /path/to/snappy/dir/
> hdfs dfs -du -s -h "/path/to/gzip/dir"
109.6 G  328.8 G  /path/to/gzip/dir

Duplicating Cloudera VM Cloudera related changesCloudeTable


Opposing the privided VMDK by cloudera, these steps show how to create a template from a VM on ESXi, handled by vsphere with (almost) any component.

Basic steps:

  1. Create a working cluster with all required components on a single VM
  2. Shut cluster down and set cdh services off
  3. Create template from VM and deploy a new VM from the template
  4. Perform hostname changes on the new VM
  5. Restart cluster in he new VM and check everything works

I’ve checked the following components:

HDFS, Hive, Hue, Impala, Oozie, Spark, YARN (MR2 Included), ZooKeeper

And I used these instructions:
Note – the psql commands changed between v4-8-3 and v5-3-0

Step 1 – installation

Install CDH with required components and check everything works

Step 2 – shut down cluster

From CDH web console – Shutdown cluster and Cloudera Management Services

From Linux CLI:

service cloudera-scm-agent stop
service cloudera-scm-server stop
service cloudera-scm-server-db stop
chkconfig cloudera-scm-agent off
chkconfig cloudera-scm-server off
chkconfig cloudera-scm-server-db off

Note – the chkconfig is important, so the services will not restart automatically on the new VM

Step 3 – Create VM template

From Vsphere UI (or any other vmware tool)

  1. Point on VM and right-click to “Clone to Template…
  2. Go to newly created template and right-click to “Deploy virtual Machine
  3. Start the newly created VM

Step 4 – Perform changes in new VM

Change hostname and set new IP (might need the VM console for this)
Useful command is system-config-network

Cloudera related changes:

  1. Change the host name in the file /etc/cloudera-scm-agent/config.ini
  2. Update host name in the following tables in Postgres
    Use dbvisualizer ( or any other tool)
    or the command “psql -U cloudera-scm -p 7432 -d scm
    password from here – /var/lib/cloudera-scm-server-db/data/generated_password.txt
    Note – some of the changes can be done from CM UI after the host tables is changed. i.e. not via the psql.

Table hosts

update hosts set name='NEWHOSTNAME.FQDN' where host_id=1;

Table hosts_aud

select * from hosts_aud;
update hosts_aud set name='NEWHOSTNAME.FQDN where host_id=1;

Table processes

select process_id,name,status_links from processes;

Create for each process its own update, something like:

update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8042/"}' where process_id=66;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8084/"}' where process_id=29;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8091/"}' where process_id=30;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:50090/"}' where process_id=62;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:50070/"}' where process_id=63;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:50075/"}' where process_id=64;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:25000/"}' where process_id=73;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:19888/"}' where process_id=65;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:11000/oozie"}' where process_id=74;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:25010/"}' where process_id=71;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8088/"}' where process_id=67;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8086/"}' where process_id=27;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8087/"}' where process_id=28;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:25020/"}' where process_id=72;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8888/"}' where process_id=75;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:18088"}' where process_id=82;

Note: Need to check resource field as well

Tables configs_aud and configs(two similar tables)

select config_id,attr,value from configs_aud where value like '%OLDHOSTNAME%';

Update according to the output, e.g.:

update configs_aud set value='NEWHOSTNAME.FQDN' where config_id=63;
update configs_aud set value='NEWHOSTNAME.FQDN:7432' where config_id=16;

Table command

select command_id,arguments from commands where arguments like '%OLDHOSTNAME%';
update commands set arguments='{"@class":"com.cloudera.cmf.command.BasicCmdArgs","alertConfig":null,"args":["NEWHOSTNAME.FQDN","postgresql","NEWHOSTNAME.FQDN:7432","amon","amon","gybJy2O6OM"],"scheduleId":null,"scheduledTime":null}' where command_id=16;

Step 5 – Start Cluster on new VM

On new VM – Linux CLI – Restart services:

service cloudera-scm-server-db restart
service cloudera-scm-server restart
service cloudera-scm-agent restart
chkconfig cloudera-scm-agent on
chkconfig cloudera-scm-server on
chkconfig cloudera-scm-server-db on

From CDH web console – Start Cloudera Management Services and the Cluster itself

From CDH web console – Go to each component configuration tab and search for remains of old host name. eg:

  • in Hive – search for “Hive Metastore Database Host”
  • in Hue – search for “HDFS Web Interface Role”
  • in Zookeeper – search for “ZooKeeper Server ID”

From CDH web console – might also need to “Deploy client configuration”

And Lastly – Clean up old log files:

find /var/log –type f |grep -i OLDHOSTNAME
rm -f `find /var/log –type f |grep -i OLDHOSTNAME`

Next task – have a script or puppet to do it all.