Skip to content

Kafka upgrade important note


Recently we’ve tested the upgrade process of our product.kafka_time

Actually it was dual upgrade.

HDP 2.5.0 –> HDP 2.6.1 –> HDP 2.6.3

In between we’ve tested that our application can survive such upgrades.

After the second upgrade we’ve discovered some issues so we wanted to revert to previous installation.

Cluster was on 4 VMs on VMWare and we created a snapshot of the vApp, so we could go back fast and easy enough.

But now we’ve find a new problem – the Kafka topics had no data at all!!!

After a short “SSSSS-EMEK”, I think that I’ve found the problem:

It has to do with the log.retention.hours parameter of Kafka.

We used the default log.retention.hours=168 (i.e. one week), and we took our time in the overall test, so the time between the last snapshot (when the data was inserted) and the current time (after reverting to the snapshot) is way over a week, so all data from the Kafka is deleted.


  1. If you want to perform upgrade tests, do them fast
  2. Change the retention parameters

ואידך זיל גמור.


I’m a Dinosaur !


Following this article i suddenly hit me:

I’m a pre-epoch man!

My age is Epoch+12657600 (with +-43200) …

Something works for a change (at least partially)


We have a sudden request from the sales department:

They want to see reports on an iPad to show the potential customers.


Should be easy?

  • Moday
    • We got the request
    • As the back-end server requires now a “Mobile” component I decide to install a new instance on another server
    • and fail
    • The server needs an Oracle server for its metadata
    • During the installation i got some warning which I ignored
    • Copying project from another server resulted in space issues of the instance
    • I’ve resized Oracle table spaces and recreated metadata
    • Still warnings during installation (about UTF settings and such)
    • Meanwhile a mail is sent by the product person that installation is done
  • Tuesday
    • Failed to use the instances so I switched to another Oracle instance
    • No errors
    • We copy the relevant projects and set up the server
    • Now – how do we test that it works on iPad? the sales person already flu to customer
    • So – I install Android SDK + Emulator on my desktop
    • My desktop barely lives with emulator memory requirements.
    • I find how to connect the  emulator to the network:
emulator.exe -avd Nexus_10_API_25 -dns-server
    • and how to install the APK:
adb.exe install /path/to/Product_GA_Android.apk
    • But I cannot find how to connect report application to the right URL to see the reports.
  • Wednesday
    • The products’s support person is OOO
    • We manage to put our hands on two iPad
    • New issue – how to connect the iPad to the corporate network?
    • To find the right person who can do it is a major task
    • I have some other tasks
  • Thursday (today)
    • The products’s support person is here
    • While showing him the emulator I do some extra clicks
    • and voila – I can see the reports in the emulator!!
    • We’re half way there…
    • Applause
  • TODO
    • Now we need the network and security teams to allow the network connection and we can do the final reports tuning.
    • We’ll meet them on Sunday.
    • Getting an iMac so we can have iPad emulator on it (a tough task)
    • Although – I’m sure that when the sales team will do a demo for the next customer, the network will not be available.

Now we can rest for the weekend

How to delete a cluster in HortonWorks (for testing purposes)



During my testing of creating clusters using blueprints, i want just to recreate the cluster – not reinstall all packages.

Unlike the other vendor we’re using it is impossible to delete a cluster in HortonWorks.

It is possible to remove components and hosts, but you’ll end up with a single component (usually zookeeper) on a single host and this last component cannot be deleted.

So the steps should be:

Shutting down and cleaning cluster

  • Stop cluster
  • Remove all components in the right order of dependency (zookeeper will be last)
  • Remove all hosts till one is left
  • Stop ambari-agents on all hosts
  • Stop ambari-server
  • Stop PostgreSQL

Removing and recreating the cluster

  • Cleanup the PostgreSQL data dir – /var/lib/pgsql/data
  • Initialise DB using “postgresql-setup initdb”
  • setup ambari if required, e.g.
    “ambari-server setup –jdbc-db=postgres –jdbc-driver=/path/to/postgresql-jdbc.jar”
  • initialize ambari-server:
    “ambari-server setup -j /path/to/java/jdk/ -s”
  • start ambari-server and agents

And voila – empty ambari, ready for playing with blueprints.


  • Remove all components using REST APIs
  • Automate process with a single script

Upgrading RedHat 6.x to 7.y


Bottom line

I did it!!

My recommendation

Don’t do it…





And the story goes like this:

Our application is based on a cluster of Hadoop, installed with HortonWorks distribution.
One of the customers is going to start with Version X and then sometime upgrade to Version X+1. These version requires to change the platform matrix: from HDP 2.2 to HDP 2.3 or 2.4 and from RHEL 6.5 to RHEL 7.1.

Checking with the Vendor they gave us recommendation to upgrade first the Hadoop cluster, and then upgrade the OS.

Checking with their competitor Vendor, they specifically do not recommend and upgrade of OS and prefer a fresh installation of RHEL 7.1.

Checking with RedHat they say that it is doable, but they give a long list of constraints.

So i started to check it on some VMs we had.

Following instruction from here –

First obstacle

I have to have IT to give me access to the various channels required.
It took them some time but they did mange to give it to me.

Second obstacle

Red-Hat version (6.5 Vs 6.7).
in the above instruction the first stage is to update all packages of the SO to the latest, i.e.

yum update -y

meaning to upgrade to RHEL 6.7

IT gave me access to local channels with 6.5 only. I did not notice at first, and it can be done afterwards. no harm done.

Third obstacle

Installing the pre-upgrade utility. Should be simple:

yum -y install preupgrade-assistant preupgrade-assistant-ui preupgrade-assistant-contents

But I failed with missing rpms – mainly openscap. it was not missing, but there was a mismatch between the versions of i686 and x86_64.

After a long battle, I found a site to download openscap with the lower version and I’ve installed the rpms locally.


Only after installing the above rpms, with this exact version, and I tried a few others, did I manage to install the pre-upgrade tool and run it.

Forth obstacle

Output of the pre-upgrade utility itself.

It gave many warnings that are need to be checked and only a few FAILs that has to be resolved.

One of the error was that I had the wrong OS flavor – as I did not upgrade to 6.7. I managed to do it with out IT help, using rhel67.iso file that I had.

Another was about the /usr directory, which cannot reside on separate FS because of some changes done in RHEL 7.1.

Other issues related to eth naming convention (and good that this is just a VM with single network link and not a physical server…)

and a few others issues – some look more important than others, but none are show stoppers.

One of the “funny” issues is that there are many rpms installed that are not signed by RHEL. When i look at the list, most, if not all of them, are Hadoop related rpms…

Running the upgrade

RedHat warns in their instructions that it may take a long time, but in my small VM it was actually very fast, and even the reboot brought the VM alive with no issues.

Fifth obstacle

Let the fun begin.
The VM is connected to the Hadoop cluster, meaning that ambari-agent is running OK.
But no other process of Hadoop is willing to start.

To Be Continued…

Hortonworks fails to create namenode high availability


We’re working for more than a day to solve this out, until we’ve found how to work around it.

We have Ambari 1.6.1 running HDP 2.1 (quiet old, but this is what our customer has).

When one tries to enable HA for namenode, the is a nice wizard telling you to do this and that.
On step 4 we should do:

hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace

and then wait till the checkpoint is created.
But the checkpoint is never detected.

We even found a jira about this:

Ambari NN HA wizard cannot detect checkpoint” –

But the fix is in Ambari 1.7.0 – which does not support HDP 2.1 :-(


We’ve copied the URL to another tab in the browser and just changed the step number:

Than we could continue with the wizard.

Updating CDH configuration using python and REST APIs


I have improved my previous script in the previous post

Now I’ve composed some python script (my first serious python script…) (And I think it can be done easily in bash…)

The script could be beautified, but I think it is quiet readable as it is now.

this particular script modify two parameters for YARN.

the parameters and their values are hard-coded, but it seems not too complicated to get everything from the input as variable.

Next step – creating CDH clusters using scripts only.

import urllib2
import sys, getopt
import base64
from urlparse import urlparse
import json
from pprint import pprint
import cdhRest
yarn_json = ' { "items" : [ { "name" : "yarn_nodemanager_resource_cpu_vcores", "value" : "8" }, { "name" : "yarn_nodemanager_resource_memory_mb", "value" : "8192" } ] }'
content_header = {'Content-type':'application/json', 'Accept':'application/vnd.error+json,application/json', 'Accept-Version':'1.0'}
yarnJsonObj = json.loads(yarn_json)
def main(argv):
  chost = ''
  username = ''
  password = ''
    opts, args = getopt.getopt(argv,"h:u:p:",["chost=","user=","pass="])
  except getopt.GetoptError:
    print ' -h <chost> -u <username> -p <password>'
  for opt, arg in opts:
    if opt in ("-h", "--chost"):
      chost = arg
    elif opt in ("-u", "--user"):
      username = arg
    elif opt in ("-p", "--pass"):
      password = arg

  base64string = base64.encodestring( '%s:%s' % (username, password))[:-1]
  authheader = 'Basic %s' % base64string
  apiver = cdhRest.getversion(chost, authheader)
  baseurl = "http://" + chost + ":7180/api/" + apiver + "/clusters"
  clusterslist = cdhRest.get_cluster_names(baseurl, authheader)
  for cluster in clusterslist:
    baseurl1 = baseurl + "/" + cluster + "/services"
    service = cdhRest.get_service_name_by_type(baseurl1, "YARN", authheader)
    baseurl1 = baseurl1 + "/" + service + "/roleConfigGroups"
    confgroups = cdhRest.get_conf_groups(baseurl1, "NODEMANAGER", authheader)
    for confgroup in confgroups:
      baseurl2 = baseurl1 + "/" + confgroup + "/config?view=full"
      req = urllib2.Request(baseurl2)
      req.add_header("Authorization", authheader)
      handle = urllib2.urlopen(req)
      thepage =
      data = json.loads(thepage)
# example taken from here
      baseURL = baseurl1 + "/" + confgroup + "/config"
      request = urllib2.Request(url=baseURL, data=json.dumps(yarnJsonObj), headers=content_header)
      request.add_header("Authorization", authheader)
      request.get_method = lambda: 'PUT' #if I remove this line then the POST works fine.

      response = urllib2.urlopen(request)

if __name__ == "__main__":


import urllib2
import sys, getopt
import base64
from urlparse import urlparse
import json
from pprint import pprint

# function get version - get the REST api version of cm
def getversion(chost, authheader):
  theurl = "http://" + chost + ":7180/api/version"
  req = urllib2.Request(theurl)
  req.add_header("Authorization", authheader)
  handle = urllib2.urlopen(req)
  ret =
  return ret

# get_cluster_names - get list of cluster names
def get_cluster_names(theurl, authheader):
  req = urllib2.Request(theurl)
  req.add_header("Authorization", authheader)
  handle = urllib2.urlopen(req)
  thepage =
  data = json.loads(thepage)
  ret = []
  for xx in data["items"]:
    ret = ret + [xx["name"]]
  return ret

# get_service_name_by_type - get service name by type
def get_service_name_by_type(theurl, type, authheader):
  req = urllib2.Request(theurl)
  req.add_header("Authorization", authheader)
  handle = urllib2.urlopen(req)
  thepage =
  data = json.loads(thepage)
  for xx in data["items"]:
    aa = xx["type"]
    if (aa == type):
    ret = xx["name"]
    return ret

# get_conf_groups - get list of configuration groups
def get_conf_groups(theurl, roleType, authheader):
  req = urllib2.Request(theurl)
  req.add_header("Authorization", authheader)
  handle = urllib2.urlopen(req)
  thepage =
  data = json.loads(thepage)
  ret = []
  for xx in data["items"]:
    if (xx["roleType"] == roleType):
      ret = ret + [xx["name"]]
  return ret