Skip to main content

Some Real-time Issues faced during Cassandra Node Extension

 Adding a node for Cassandra isn't as simple as it sounds. There are many factors that should be evaluated before we proceed in this scenario. At Telenet, we encountered some interesting issues. We added 15 nodes for CASSANDRA_CLUST1 and 14 nodes for CASSANDRA_CLUST2 respectively

Let's get started and try to understand these issues one by one.

Issue - Cassandra_clust1 1st node could not be added, system.log error as follows.

java.nio.file.FileSystemException: <PATH>/bb-3116-bti-Data.db:Too many open files.


Root Cause-
Hard and soft limits for Cassandra_clust1 nodes were not set as per the Datastax recommendations.
Solution-
We modified number of open files for CASSANDRA_CLUST1 new nodes first. It was first tested on one node and then re-running ansible job
to add new nodes was done successfully for CASSANDRA_CLUST1 nodes.

Values from the file /etc/security/limits.conf were -

<cassandra-user> hard nofile 100000
<cassandra-user> soft nofile 100000 

were modified with below -

<cassandra-user> hard nofile 1048576 
<cassandra-user> soft nproc 32768

Before we could start next attempt to add the new node, we moved the older directories on cassandra_clust1 
mv metadata metadata_bkp

mv saved_caches saved_caches_bkp
mv hints hints_bkp
mkdir data metadata saved_caches hints
cd /dbr0001/cas/
mv commitlog commitlog_bkp

References from Datastax -

https://docs.datastax.com/en/dse/6.8/dse-admin/datastax_enterprise/config/configRecommendedSettings.html?hl=recommended%2Cproduction%2Csettings

NOTE:  It is important to note increasing number of open files for any OS user depends on sysctl.conf file parameter fs.nr_open.

Issue - Cassandra_clust2 1st node could not be added, system.log error as follows-

java.nio.file.FileSystemException: <PATH>/bb-3116-bti-Data.db:Too many open files. 

On the server side, we could see server (cassandra_clust2 new node) did not respond with high load and excessive memory consumption. A ticket was raised with Datastax Team.


Root Cause-
a. Hard and soft limits for Cassandra_clust2 nodes were not set as per the Datastax recommendations.

b. Memory arguments were not optimally set on cassandra_clust2 nodes.

c. zerocopy_streaming_enabled parameter was not set and was recommended by Datastax to set to false.


Solution-

Values from the file /etc/security/limits.conf were -

<cassandra-user> hard nofile 100000
<cassandra-user> soft nofile 100000 

were modified with below -

<cassandra-user> hard nofile 1048576 
<cassandra-user> soft nproc 32768


At node level, resources/cassandra/conf/jvm8-server.options file stores Xmx,Xms and they are being read during node startup. We increased the memory arguments from -Xmx19G to -Xmx31G for all the nodes (old and new). A rolling restart for older nodes was initiated to start with new memory arguments. 

zerocopy_streaming_enabled can be set in cassandra.yaml file under resources/cassandra/conf/ at node level and this was set to false as per recommendations from DataStax support.


Conclusion- 

The 2 issues faced above were striking enough to document them although we faced many other issues too when adding these nodes. The nodes were added using from ansible to reduce the number of steps. The complete extension activity that took almost 3 weeks as we added 29 nodes in total.

Some concluding observations to add -

  1. To verify is cassandra is started successfully on a node we can run - grep "DSE startup complete." logs/cassandra/system.log
  2. nodetool cleanup must be executed on every node ONE-AT-A-TIME after all the nodes are being added.
  3. Datastax recommends 200 tables per cluster irrespective of any number of keyspaces, so when it comes to cassandra, it is important to keep a check on the total number of tables we have. We tend to create backup tables assuming they are not being used unless being access by a query, but with Cassandra, that is not the case. 
  4. Tagging some key URLs that are my recommendations for cassandra enthusiasts - 
    https://community.datastax.com/questions/12579/limit-on-number-of-cassandra-tables.html#:~:text=We%20recommend%20a%20maximum%20of,of%20the%20number%20of%20keyspaces).

Comments

Popular posts from this blog

Logfile locations in EBS r12.1 and EBS r12.2

Startup/shutdown Apps tier services are started and stopped frequently and we must know logfiles when troubleshooting startup/shutdown issues. $INST_TOP/logs/appl/admin/log $INST_TOP/logs/appl/admin/log Apache OHS being part of opmn in r12.1 has continued in r12.2. Logfile locations for troubleshooting have been changed $INST_TOP/logs/ora/10.1.3/Apache/error_log[timestamp] $INST_TOP/logs/ora/10.1.3/opmn/HTTP_Server~1.log $IAS_ORACLE_HOME/instances/*/diagnostics/logs/OHS/*/*log*   OPMN Logfile locations for r12.1 and r12.2 have been changed $INST_TOP/logs/ora/10.1.3/opmn/opmn* $IAS_ORACLE_HOME/instances/*/diagnostics/logs/OPMN/opmn/* Oacore oacore in r12.1 is oc4j component and part of 10gAS. However, in r12.2, oacore is now a managed server for weblogic server $LOG_HOME/ora/10.1.3/j2ee/oacore/oacore*/ $LOG_HOME/ora/10.1.3/j2ee/oacore/oacore*/ $LOG_HOME/ora/10.1.3/opmn/oacore*/oacor...

Query to Check AD and TXK code levels in your EBS environment

Below query can be very handy in finding out current AD and TXK code levels. col ABBREVIATION for a10 set lines 1000 col NAME for a50 col CODELEVEL for a20 SELECT ABBREVIATION,NAME,codelevel FROM AD_TRACKABLE_ENTITIES WHERE abbreviation in ('txk','ad'); ABBREVIATI NAME                                                CODELEVEL ---------- -------------------------------------------------- ------------ ad           Oracle Applications DBA                             C.11 txk         Oracle Applications Technology Stack    ...

Compile all JSP files in Oracle ebs r12.2

Before you start compiling jsps and following below steps, I recommend understanding some key differences between 11i, r12.1 and r12.2 when it comes to compiling jsps. Please follow below link and then proceed further - One-stop shop to Compile JSPs in 11i, r12.1 and r12.2 1. Take a backup of _pages directory that will be modified due to jsp compilation - $ cd $EBS_APPS_DEPLOYMENT_DIR/oacore/html/WEB-INF/classes/ $ cp -R _pages _pages29dec2019 $ ls -ld _pages* drwxr-xr-x 5 applmgr oinstall 249856 Dec 29 16:36 _pages drwxr-xr-x 5 applmgr oinstall 249856 Dec 29 16:56 _pages29dec2019 2. Stop apache, oacore and oafm services - adapcctl.sh stop admanagedsrvctl.sh stop oacore_server1 admanagedsrvctl.sh stop oafm_server1 3. Compile the jsps manually using the below command - $ cd $FND_TOP/patch/115/bin/ $ perl $FND_TOP/patch/115/bin/ojspCompile.pl --compile --flush -p              4. Check class file last mo...