From time to time there happens different problems with connections to Oracle database. Below little report about those I have met.
problems on client
Usually they are trivial and bound with tnsnames.ora. Wrong PATH variable (for example few ORACLE_HOME/bin entries), wrong LD_LIBRARY_PATH, entries in the tnsnames.ora with some unusual blank character, wrong syntax in this file, wrong service_name. For example today 2 different tnsnames.ora files, while part of programs uses the right one, and some the wrong one.
While trivial, sometimes hard to diagnose, especially by phone.
Fortunately it is easy to get very detailed trace.
We need in sqlnet.ora few entries:
TRACE_LEVEL_CLIENT=16 #or SUPPORT
TRACE_FILE_CLIENT=mytrace #useful for identification
TRACE_DIRECTORY_CLIENT=traces #useful for location
The output is quite detailed and usually helpful in the diagnosis.
problems on cman
Of course if one uses it (we do).
I have met 3 major problems with it (and this is relevant for 11g version).
The first one was with the raise of 11g version. The cman needs one rule to connect to itself. Till the 10g it was for example:
(RULE=(SRC=10.0.50.10)(DST=127.0.0.1)(SRV=cmon)(ACT=accept))
With 11g it must be (at least on Linux):
(RULE=(SRC=10.0.50.10)(DST=::1)(SRV=cmon)(ACT=accept))
The Metalink (ID 298916.1) still provides address 127.0.0.1 as a right one, and indeed the cman starts, but one can not connect to it with cmctl. Possibly it works, but one can not control it. I always needed to kill the cman process with kill utility to change anything (and no, we did not use it on production before the problem was solved).
The second problem was with some older RDBMS versions (ie. before 10g) - we needed to add to simple rule some settings, for example:
(RULE=(SRC=10.0.50.12)(DST=10.0.60.11)(SRV=*)(ACT=accept)(ACTION_LIST=(mct=0)))
Usually it is enough to end with ACT attribute. This particular case with RDBMS servers on Windows kind os.
The third one was with error in syntax. Once we loose somewhere in the middle of cman.ora a parenthesis. The cman kept all the rules from the time before error emergence, thus in the rules there were rules located before the point of error as well as those after it. When we added rules before that point and reloaded, everything was fine, if we added after, the rules were ignored. Because we always reloaded, the cman worked and for some time we were unaware of the problem at all.
The true incident we met, when were trying to restart the cman. It went down and we can not start it up. IIRC the returned error was about problem in the configuration file in the end of it (which of course was due to mess in parentheses). The solution here was to display the config with VIM and color the syntax. Eventually we found the place for lacking parenthesis. The problem was trivial, but not so obvious.
To those three I add yet another - in our case it was lack of mapping IP->hostname in /etc/hosts, possibly similar situation if no proper name in DNS service available (if used instead of /etc/hosts mapping). In result the cman service returned ORA-12529, even though the cman.ora configuration does not use host names at all (all rules with IP).
The same happens when the entry in /etc/hosts differs from DST hostname.
problems on server
Both problems on server had something to do with firewalls.
Once it was MTU parameter, which default value appeared too large after some firewall software update.
The other problem was with inactivity timeout. Some of our firewalls disconnect connections, on which there is no traffic during specified time. A solution here seems to be enabling dead connection detection (DCD) feature. This feature implementation is based upon sending for time to time some bytes as a heartbeat - thus there is some traffic on a connection, so it is not broken.