Skip to content

small bug when re-downloading Cloudera Parcels

by

I had an issue with uid of users on different servers so i had to delete and reinstall the cluster.

unfortunately – I kept the old parcels files with the old UID

ls -l /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel*
 -rw-r----- 1 cloudera-scm cloudera-scm 1558200266 May 12 14:51 /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel
 -rw-r----- 1 cloudera-scm cloudera-scm 848904192 May 12 14:52 /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel.part
 -rw-r----- 1 522 522 41 Apr 7 13:33 /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel.sha

note that the *.parcel.sha file is with the old UID of cloudera-scm account.

I saw that the parcels is being downloaded and redownloded in an endless loop.

in the log file i saw:
 2015-05-12 14:07:14,322 INFO MainThread:com.cloudera.parcel.components.PeriodicParcelTasks: Set up periodic parcel tasks every 60 minutes.
 2015-05-12 14:07:14,337 INFO ParcelUpdateService:com.cloudera.parcel.components.LocalParcelManagerImpl: Found files CDH-5.3.2-1.cdh5.3.2.p0.
 10-el6.parcel under /opt/cloudera/parcel-repo
 2015-05-12 14:07:14,352 WARN ParcelUpdateService:com.cloudera.parcel.components.LocalParcelManagerImpl: Error reading hash file: CDH-5.3.2-1
 .cdh5.3.2.p0.10-el6.parcel.sha
 java.io.FileNotFoundException: /opt/cloudera/parcel-repo/CDH-5.3.2-1.cdh5.3.2.p0.10-el6.parcel.sha (Permission denied)
 at java.io.FileInputStream.open(Native Method)
 at java.io.FileInputStream.(FileInputStream.java:146)
 at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
 at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
 at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:287)
 at com.google.common.io.CharSource.openBufferedStream(CharSource.java:80)
 at com.google.common.io.CharSource.readFirstLine(CharSource.java:157)
 at com.google.common.io.Files.readFirstLine(Files.java:674)
 at com.cloudera.parcel.components.LocalParcelManagerImpl.readFirstLineFromFile(LocalParcelManagerImpl.java:392)
 at com.cloudera.parcel.components.LocalParcelManagerImpl.getParcelHash(LocalParcelManagerImpl.java:348)
 at com.cloudera.parcel.components.LocalParcelManagerImpl.processParcel(LocalParcelManagerImpl.java:182)
 at com.cloudera.parcel.components.LocalParcelManagerImpl.scanRepo(LocalParcelManagerImpl.java:142)
 at com.cloudera.parcel.components.LocalParcelManagerImpl$1.run(LocalParcelManagerImpl.java:155)
 at com.cloudera.parcel.components.LocalParcelManagerImpl$1.run(LocalParcelManagerImpl.java:152)
 at com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable.call(ReadWriteDatabaseTaskCallable.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

Once i deleted the old sha file the downloding of the parcel ended and i could continue with the installation.

I’d think there should be some kind of warning to the screen that something is wrong so one won’t have to wait that long

This happend with CDH 5.4

Killing oozie jobs from HUE, while in kerberos mode, on HDP cluster

by

Everything is written – just search in the right place :-)

We’ve followed all (or most) of the instruction to convert an HDP cluster into kerberos mode.

But when we tried to kill a job from the HUE interface, we got the following error:

Problem: Error performing kill on Oozie job 0000000-150430180450260-oozie-oozi-W:

HTTP Status 401 –

type Status report

message

description This request requires HTTP authentication.

Apache Tomcat/6.0.37

To solve it, we had to read this – https://oozie.apache.org/docs/3.2.0-incubating/AG_Install.html#User_Authorization_Configuration

If the parameter “oozie.service.AuthorizationService.security.enabled”  is set to true,
then one has to modify the file /etc/oozie/conf/adminusers.txt – add a list of permitted user, one at a line.

If it is true and the file is empty, then when trying to kill a job, one gets the error above.

I’ve added several users, as just adding oozie is not sufficient, and we still need to do a fine tuning for the list of users to be within this file.

Also – one can run example jobs in oozie –

The example tar is here – /usr/share/doc/oozie-4.0.0.2.1.3.0/oozie-examples.tar.gz

After extracting the jar change in the job.properties the “localhost” to the FQDN of the host.
Then the folder has to be copied to hdfs:
hdfs dfs -copyFromLocal examples /user/oozie/.

I used the following command

oozie job -oozie http://`hostname -f`:11000/oozie -config examples/apps/map-reduce/job.properties -run

And while the job ran (it gets suspend immediately), I killed it successfully from the HUE.

Bottom line – sababa. I can start the weekend now

Using falcon in HDP for backup – ongoing work

by

My project wants to check if falcon is suitable for backup between two clusters.

Following the only example I could find: http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/

Main difference between the example and my tests is that I’ve created my own clusters and did not use the sandbox from HDP.

The issues I’ve encountered:

Issue #1

When you run falcon you’ll get the error:

Error: Invalid Execute server or port: h153.amdocs.com:8032
Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.”

In order to overcome this one need to disable the parameter “yarn.timeline-service.enabled

Take from here – http://mail-archives.apache.org/mod_mbox/falcon-dev/201408.mbox/%3CCAF1jEfAcdchXOY5stdVEgPxZvNcf=-ATPSKZYk1DmX+4Aec1Fw@mail.gmail.com%3E

In Ambari UI,click on Yarn, click on Configs, under Application Timeline Server uncheckthe box next to yarn.timeline-service.enabled, Save, then restart Yarn,then restart Falcon

Issue #2

Trying to submit the process entity showedan error:

falcon entity -type process -submit -file emailIngestProcess.xml
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: falcon is not allowed to impersonate falcon

For this you’ll need to change the parameter “hadoop.proxyuser.falcon.groups

In the HDFS–>config to the right user permissions.
I’ve just put “*” (asterix) so it will grant all
then restart HDFS and other services

Issue #3

If you’re behind proxy – you’ll have to change the script in HDFS – can be done from HUE using the file browser
/ user/ ambari-qa/ falcon/ demo/ apps/ ingest/ fs/ ingest.sh
Edit and add your proxy server (export http_proxy=http://proxyserver:8080 – or other port you’re using)

Issue #4
trying to load the rawEmailIngestProcess returns some error:

falcon entity -type process -schedule -name rawEmailIngestProcess
Error: null

Here there is probably a bug – the feed has to have an input. taken from here –
http://mail-archives.apache.org/mod_mbox/falcon-dev/201408.mbox/%3CCAPyZWqot92MOhqqSMHBT2t8d801vmDJ_2b1r0-8N7hWr1+S3ug@mail.gmail.com%3E

Current version I’m using probably does not have the fix

I created an empty feed (copied rawEmailFeed.xml and modified it)

<?xml version="1.0" encoding="UTF-8"?>
 <!--
 A feed representing Hourly customer email data retained for 90 days
 -->
 <feed description="Empty feed" name="emptyFeed"
 xmlns="uri:falcon:feed:0.1">
 <tags>externalSystem=USWestEmailServers,classification=secure</tags>
 <groups>churnAnalysisDataPipeline</groups>
 <frequency>hours(1)</frequency>
 <late-arrival cut-off="hours(4)"/>
 <clusters>
 <cluster name="primaryCluster" type="source">
 <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
 <retention limit="days(90)" action="delete"/>
 </cluster>
 </clusters>
 <locations>
 <location type="data"
 path="/tmp/empty/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
 <location type="stats" path="/none"/>
 <location type="meta" path="/none"/>
 </locations>
 <ACL owner="ambari-qa" group="users" permission="0777"/>
 <schema location="/none" provider="none"/>
 </feed>

and then loaded it

falcon entity -type feed -submit -file emptyFeed.xml

I modified the emailIngestProcess.xml – added an inputs to it:

diff
19a20,23
> <inputs>
> <input name="input" feed="emptyFeed" start="now(0,0)" end="now(0,0)" />
> </inputs>

and deleted and reloaded rawEmailIngestProcess

falcon entity -type process -delete -name rawEmailIngestProcess
falcon entity -type process -submit -file emailIngestProcess.xml

Issue #5

Beacuse i’m installing my own clusters and not using sandbox – one have to configure all correctly:

checkout Chapter 19.3 in http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.1.0.0/bk_installing_manually_book/bk_installing_manually_book-20140110.pdf

Need to change the property oozie.service.HadoopAccessorService.hadoop.configurations into something like:

*=/etc/hadoop/conf,h153:8020=/etc/hadoop/conf/,h153:8032=/etc/hadoop/conf/,h156:8020=/etc/hadoop/conf/,h156:8032=/etc/hadoop/conf/

Where h153 and h156 are host names of the two clusters name nodes and resource managers

This is so far.

Next steps- to backup hive tables.

Removing alternatives of old CDH parcels

by

The title sounds like english, but it must be chinese to most of people…

Anyways – when changing parcles of cloudera we got stuck with alternatives pointing to the old version.

It seems like when one install new cloudera version using parcels, it runs the alternative command to set up the new path, but always with the same priority.

example for alternatives of zookeeper, after upgrading to CDH530 – without deleting the old parcel of CDH502.

/etc/alternatives]# alternatives --display zookeeper-client
zookeeper-client - status is auto.
 link currently points to /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/bin/zookeeper-client
/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/bin/zookeeper-client - priority 10
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/zookeeper-client - priority 10
Current `best' version is /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/bin/zookeeper-client.

To solve this, I ran an awk command that creates an alternatives remove command to delete the old path.

#!/bin/bash

cd /etc/alternatives
timestamp=`date +%Y%m%d_%H%M%S`
ls -l |awk '/CDH-5.0.2/{print "alternatives --remove",$9,$NF}' > /tmp/remove_CDH502_${timestamp}.sh

ls -ld /tmp/remove_CDH502_${timestamp}.sh
bash /tmp/remove_CDH502_${timestamp}.sh

Of course the better solution is to have all alternatives to point to /opt/cloudera/parcels/CDH which is already a link to the right version.

We also did not check yet if the alternatives are removed if we delete the older parcels from the server.

Anyways – This works for us now.

pig script to convert snappy into gzip

by

My first ever pig script:

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
A = load '/path/to/snappy/dir/part*' using PigStorage();
store A into '/path/to/gzip/dir' USING PigStorage();

It’s that simple :)

I don’t know why we chose snappy for compression as we’ve found out that many third party could not reach out and read this type of data.
The compression is also nto as strong.
e.g. from pig output:

Successfully read 6041224101 records (205720552522 bytes) from: "/path/to/snappy/dir/part*"
Successfully stored 6041224101 records (117690493503 bytes) in: "/path/to/gzip/dir"

and from hdfs du -s -h command:

> hdfs dfs -du -s -h "/path/to/snappy/dir/"
191.6 G  574.8 G  /path/to/snappy/dir/
> hdfs dfs -du -s -h "/path/to/gzip/dir"
109.6 G  328.8 G  /path/to/gzip/dir

Duplicating Cloudera VM Cloudera related changesCloudeTable

by

Opposing the privided VMDK by cloudera, these steps show how to create a template from a VM on ESXi, handled by vsphere with (almost) any component.

Basic steps:

  1. Create a working cluster with all required components on a single VM
  2. Shut cluster down and set cdh services off
  3. Create template from VM and deploy a new VM from the template
  4. Perform hostname changes on the new VM
  5. Restart cluster in he new VM and check everything works

I’ve checked the following components:

HDFS, Hive, Hue, Impala, Oozie, Spark, YARN (MR2 Included), ZooKeeper

And I used these instructions:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-8-3/Cloudera-Manager-Administration-Guide/cmag_change_hostnames.html
Note – the psql commands changed between v4-8-3 and v5-3-0

Step 1 – installation

Install CDH with required components and check everything works

Step 2 – shut down cluster

From CDH web console – Shutdown cluster and Cloudera Management Services

From Linux CLI:

service cloudera-scm-agent stop
service cloudera-scm-server stop
service cloudera-scm-server-db stop
chkconfig cloudera-scm-agent off
chkconfig cloudera-scm-server off
chkconfig cloudera-scm-server-db off
poweroff

Note – the chkconfig is important, so the services will not restart automatically on the new VM

Step 3 – Create VM template

From Vsphere UI (or any other vmware tool)

  1. Point on VM and right-click to “Clone to Template…
  2. Go to newly created template and right-click to “Deploy virtual Machine
  3. Start the newly created VM

Step 4 – Perform changes in new VM

Change hostname and set new IP (might need the VM console for this)
Useful command is system-config-network

Cloudera related changes:

  1. Change the host name in the file /etc/cloudera-scm-agent/config.ini
  2. Update host name in the following tables in Postgres
    Use dbvisualizer (http://www.dbvis.com/ or any other tool)
    or the command “psql -U cloudera-scm -p 7432 -d scm
    password from here – /var/lib/cloudera-scm-server-db/data/generated_password.txt
    Note – some of the changes can be done from CM UI after the host tables is changed. i.e. not via the psql.

Table hosts

select HOST_ID, HOST_IDENTIFIER, NAME from HOSTS;
update hosts set name='NEWHOSTNAME.FQDN' where host_id=1;

Table hosts_aud

select * from hosts_aud;
update hosts_aud set name='NEWHOSTNAME.FQDN where host_id=1;

Table processes

select process_id,name,status_links from processes;

Create for each process its own update, something like:

update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8042/"}' where process_id=66;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8084/"}' where process_id=29;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8091/"}' where process_id=30;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:50090/"}' where process_id=62;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:50070/"}' where process_id=63;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:50075/"}' where process_id=64;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:25000/"}' where process_id=73;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:19888/"}' where process_id=65;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:11000/oozie"}' where process_id=74;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:25010/"}' where process_id=71;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8088/"}' where process_id=67;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8086/"}' where process_id=27;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8087/"}' where process_id=28;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:25020/"}' where process_id=72;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:8888/"}' where process_id=75;
update processes set status_links='{"status":"http://NEWHOSTNAME.FQDN:18088"}' where process_id=82;

Note: Need to check resource field as well

Tables configs_aud and configs(two similar tables)

select config_id,attr,value from configs_aud where value like '%OLDHOSTNAME%';

Update according to the output, e.g.:

update configs_aud set value='NEWHOSTNAME.FQDN' where config_id=63;
update configs_aud set value='NEWHOSTNAME.FQDN:7432' where config_id=16;

Table command

select command_id,arguments from commands where arguments like '%OLDHOSTNAME%';
update commands set arguments='{"@class":"com.cloudera.cmf.command.BasicCmdArgs","alertConfig":null,"args":["NEWHOSTNAME.FQDN","postgresql","NEWHOSTNAME.FQDN:7432","amon","amon","gybJy2O6OM"],"scheduleId":null,"scheduledTime":null}' where command_id=16;

Step 5 – Start Cluster on new VM

On new VM – Linux CLI – Restart services:

service cloudera-scm-server-db restart
service cloudera-scm-server restart
service cloudera-scm-agent restart
chkconfig cloudera-scm-agent on
chkconfig cloudera-scm-server on
chkconfig cloudera-scm-server-db on

From CDH web console – Start Cloudera Management Services and the Cluster itself

From CDH web console – Go to each component configuration tab and search for remains of old host name. eg:

  • in Hive – search for “Hive Metastore Database Host”
  • in Hue – search for “HDFS Web Interface Role”
  • in Zookeeper – search for “ZooKeeper Server ID”

From CDH web console – might also need to “Deploy client configuration”

And Lastly – Clean up old log files:

find /var/log –type f |grep -i OLDHOSTNAME
rm -f `find /var/log –type f |grep -i OLDHOSTNAME`

Next task – have a script or puppet to do it all.

YARN and logstash [FAIL] :(

by

I really tried, but so far have failed in sending the logs of YARN tasks to logstasth (using CDH 5.3).

(and now please come someone and say: “hey it’s easy: just add bla to foo” and make me happy again…)

What i did so far:

  • Installed logstash and kibana and elasticsearch
  • Played with logstash and log4j with java program
  • Add parameters to YARN log4j properties and messed things out

First – install logstash and freinds:

Just follow this excellent blog post “How to install Logstash with Kibana interface on RHEL

Second – Java program:

I took the code from here – “Log4j Hello World Example
But, as I’m a CLI person, I just removed the package line – see java code below.

To compile – you’ll need java, javac and log4j.
On redhat, install using yum, then, compile using javac, and run (and you’ll need the log4j.properties, see below):

yum -y install java-1.7.0-openjdk java-1.7.0-openjdk-devel log4j
javac -cp /usr/share/java/log4j.jar:. HelloExample.java
java -cp /usr/share/java/log4j.jar:. HelloExample

Third – log4j.properties with SocketAppender for logstash

Took it from here – “Log4j SocketAppender and socket server example” See the log4j.properties file below

Forth – connect CDH log files to logstash

There are several ways to do it:

  • rsyslogd
    • As there is already a rsyslog installed on RHEL, this is easy.
    • Just add a configuration file – /etc/rsyslog.d/logstash.conf – see below an example
    • You’ll need to add each and every log of cloudera processes – each log is located in a different library
    • Restart service rsyslogd
    • rsyslog documetation are here – “Welcome to Rsyslog”  – but I do not find them too user freindly, so i just modify their examples.
  • logstash.conf
    • To use logstash – you’ll need to install logstash on each node (well, obviously…)
    • configure the /etc/logstash/conf.d/logstash.conf – their site have good documentation “Logstash Config Language
    • Start the service logstash

The problem with both rsyslogd and logstash is that they can accept regexp only in the file name of the log, while the dirctory of the log has to have a full path.
This is not good for YARN processes, as their logs are located under a generated directory which includes a unique name.
so, for most logs one can use /full/path/to/log/*.log except for yarn logs which should look like /full/path/to/*log*/*log.
I thought i could use the log4j properties snippet as described below.

  • CDH log4j.properties
    • The configuration has to be done via CDH web admin (http://host:7180)
    • go to each component and search for log4j.properties
    • You’ll get list of snippets with description like “For advanced use only, a string to be inserted into log4j.properties for this role only.”
    • After adding the lines of the log4j.properties listed below, the changed service has to be restarted.
    • CDH then regenerate the configruation directory and add the lines to the conf file

Fifth – YARN fail

After doing all the previous steps, suddenly all my YARN Node Managers appeared as red.
The java process was up, but:
log files were not created (/var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAGER-h84.amdocs.com.log.out)
and
the ports 8042 were not opened

After investigation, I’ve found that my configuration was bad and I removed the log4j.properties.

So far conclusion – my task of sending the YARN logs to logstash had failed.

Last – to do next

Until now I played on our large cluster and interfere with other members with the logstash playing.
(foolishly, I thought it will work)

Now I’m going to create a small cluster on VM and will play there,

Once success, I’ll post again.

Appendix – here are the files I used

HelloExample.java

import org.apache.log4j.Logger;
public class HelloExample{
	final static Logger logger = Logger.getLogger(HelloExample.class);
	public static void main(String[] args) {
		HelloExample obj = new HelloExample();
		obj.runMe("zzup?");
	}
	private void runMe(String parameter){
		if(logger.isDebugEnabled()){
			logger.debug("This is debug : " + parameter);
		}
		if(logger.isInfoEnabled()){
			logger.info("This is info : " + parameter);
		}
		logger.warn("This is warn : " + parameter);
		logger.error("This is error : " + parameter);
		logger.fatal("This is fatal : " + parameter);
	}
}

log4j.prperties

#Define the log4j configuration for local application
log4j.rootLogger=ERROR, server
#We will use socket appender
log4j.appender.server=org.apache.log4j.net.SocketAppender
#Port where socket server will be listening for the log events
log4j.appender.server.Port=4560 # Note - should be the same port as defined in logstash
#Host name or IP address of socket server
log4j.appender.server.RemoteHost=HOSTNAME # Note - should be the logstash server
#Define any connection delay before attempting to reconnect
log4j.appender.server.ReconnectionDelay=10000

/etc/rsyslog.d/logstash.conf

$InputFileName /var/log/cloudera-scm-server/db.log
$InputFileTag cloudera-scm-server-db:
$InputFileStateFile state-cloudera-scm-server-db
$InputRunFileMonitor

$InputFilePollInterval 10

if $programname == 'cloudera-scm-server-db' then @@h135.example.com:5544 #Change the destination host
if $programname == 'cloudera-scm-server-db' then ~