Thursday 18 August 2022

Significance of chroot in Kafka & Zookeeper


Chroot you might have heard this keyword many times but due you know what it signifies?

Brief: Chroot is actually termed as zookeeper.chroot and it is one of the Kafka configuration parameter. Zookeeper is a robust distributed software and one of its feature is that, it is used for coordination and Kafka coordination is taken care by zookeeper for certain tasks.

Actually chroot is nothing but the path wherein any of the dependent software <dependent on zookeeper> can write their content. Generally one zookeeper cluster can be used by different service <like other component of eco system can be Solr, hbase, curator and many more> at same time, even we can plan to use the same ZooKeeper cluster for different Kafka cluster (such as Kafka cluster-1, Kafka cluster-2).

Originally chroot used to have value as /  i.e. any service can write anything at root directory and this can create potentials challenges or conflict if we are using multiple services as multiple service might write to same root and this can corrupt or create problem, so as to avoid it was suggested/decided that Kafka service must have a different path so that it can write to its respective location to avoid any conflict. Say for e.g. / was changed to /kafka, meaning Kafka will write its data to /kafka directory.

Suppose if chroot parameter value is /kafka and you use chroot path in tools  as / then you might discrepancy in the data or metadata. So make sure that you are checking the value first that what value is being used zookeeper.chroot and then use the same in your tool.

Enjoy Learning!!!

Monday 22 November 2021

Kafka: Boot-Strap and Advertised Listeners

 Boot-Strap and Advertised Listeners in Kafka


Last night I was going through YAML of containerized backing service Kafka to learn how containerized worlds actually works in terms of  Kafka backing service. To me few of the settings enticed me most(BOOTSTRAP_SERVER, ADVERTISED_LISTENER ) now these setting are not new and these are there in VM mode too, the only point of interest here is the more the complex network it becomes, more  complex it turn out to be. Because in VM world both these properties are marked with each node say for e.g. if host is XXXX then both these properties will have XXXX in them along with port off-course and if it is distributed Kafka then other host say YYYY, then again both properties will have YYYY in them with along with port to interact with. But in containerized world they will be invoked with services.

These setting are very important for the client to interact with the services, if these are missing then client wont be able to connect to the right service, in here service is Kafka.

So let me give a brief about it and also usages where it comes in to picture in our world.

(i)              Bootstrap Server: This bootstrap server variable contains host and port of Kafka broker, whenever any client initiate the connection (by client you have to understand it is producer or the consumer and it can be MS,3rd party etc) client will initiate the connection to Kafka broker. Generally it is good practice to have all the end points(host-port)  of Kafka cluster as a part of bootstrap variable so as to share to the client, because in case if one host goes down then at least application client can try to bootstrap on another host/broker and learn from their about cluster topology and continue its works.

                   Also note, not all servers or hosts needs to be bootstrap server but it is good practice to                            have all as part of cluster.

Now if questions comes to your mind that how application see it then you may take any code and you will see that each producer or the consumer does have this property implemented. Say for e.g. here is the consumer code using spring boot has implemented this as below.

 

public ConsumerFactory<String, String> consumerFactory() {

        Map<String, Object> configProps = new HashMap<>();        configProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");

 

By reading all this, if a question popped up in your mind that why client cant keep this information as static in some property file instead of discovering about brokers host during bootstrap then guys believe me you are on right path and if you didn’t manage to think,  I’ll make it easier for you, see if we keep such information at client end then it will be a static information and if say in future let say we add something for e.g. broker host or pods then we have to edit all the relevant information at many places and this will make your life uneasy. So too avoid this it is better that application decides to bootstrap and discover the next word from server itself.

(ii)           ADVERTISED_LISTENER: This property or the variable again contain host and port of Kafka broker. Once the client initiate the connection during bootstrap the brokers returns the metadata to the client and this metadata is nothing but the advertised_listener values. Metadata will contains list of all the brokers and this is what the client then uses for all subsequent connections to produce or consume data. By providing the metadata, the client doesn’t have to know at all times the list of all the brokers. The reason the client needs the details of all the brokers is that it will connect directly to one or more of the brokers based on which has data for the topic partition with which it wants to interact.

Let say client is on X node and Kafka is on Y node and if we place some wrong entry to this listener variable say for e.g. localhost:9092, then once bootstrap is done then broker will provide metadata to the client and client will try to produce or consume from localhost but client is on X node and there is no Kafka there then it will be a failure at client end.

Bottom line is this is very important property and it needs to be filled with correct values.

Enjoy Learning!!!!

Wednesday 17 November 2021

Elasticsearch : Primary Shards

 In Devops world, Elasticsearch technology is one of the key component which is being used widely. To strengthen the knowledge further here is short tip on one very important basic aspect which one must consider while designing.

Statement: Do you ever wonder in Elastic Search, Primary shards once created cannot be increased further unless you recreate the index.

Answer: One of the reason behind this is explained below:

shard = hash(routing) % number_of_primary_shards

The routing value here is just an arbitrary string, which actually defaults to the document’s _id but there is a provision to set it as custom value. Here routing string is passed through a hashing function that generate a number, which is then divided by the number of primary shards in the index to return the remainder. The remainder will be in the range 0 to number_of_primary_shards - 1, and gives us the number of the shard where a particular document lives.

So this actually explains why the number of primary shards can be set only when an index is created and never changed: if the number of primary shards ever altered in the future, all previous routing values would be invalid and documents would never be found.

So be very careful while designing the index.

Enjoy Learning!!!

Significance of chroot in Kafka & Zookeeper

Chroot you might have heard this keyword many times but due you know what it signifies? Brief : Chroot is actually termed as zookeeper.chroo...