03 Apr 2006

nss_ldap's undocumented nss_reconnect_tries

after a recent ~x86 update and reboot, i managed to find my gentoo work box stalled on boot at the udev stage. there may be a related bug in bugzilla, but i thought i would document it here as well because it is quite interesting.

the reason for the stall is because i use ldap on my box at work, more specifically, nss_ldap. turns out that udev was trying to look up some groups that don't exist in /etc/groups, and therefore was trying ldap over the network.

however, because udev comes up before the network devices are initialised, none of the ldap stuff is working. to add more problems to the mix, nss_ldap is set (for whatever reason) to try indefinitely until it gets a response.

so after googling a bit, a helpful suggestion from someone who also encountered this problem was to set "bind_policy soft" in my ldap.conf. what that does is makes nss_ldap return a negative result if it cannot connect the ldap server, rather than retry indefinitely.

turns out that the solution worked for my udev start stall. but it produced another problem -- i couldn't login through ssh. it turns out that nss_ldap would mysteriously (but reliably) fail somewhere during the ssh login process to look up either a group or user name, and that would throw sshd off.

so the situation is that if i set "bind_policy soft", i can boot up, but would fail unpredictably when a large amount of nss_ldap requests is generated. so i decided to trawl around the nss_ldap source code, and found the offending part that produced this error:

sshd[32527]: nss_ldap: could not search LDAP server - Server is unavailable

armed with that, it turns out rather than blindly getting nss_ldap to refuse a retry, we could limit the number of retries using nss_reconnect_tries. in the changelog, it was added in version 241, but oddly undocumented. i wonder why.

241     Luke Howard <lukeh@padl.com>

* new, more robust reconnection logic
* both "host" and "uri" directives can be used in
* new (undocumented) nss_reconnect_tries,
nss_reconnect_sleeptime, nss_reconnect_maxsleeptime,
nss_reconnect_maxconntries directives

anyway, that was my solution, set the retries to something that is not infinite which means udev should have a proper error thrown at it when it tries to lookup the non-existant groups, and that i can still tolerate one or two refused ldap connections because i've reached the connection limit.

i don't know if it is just my situation or what, but hope that might help someone.

You can reply to me about this on Twitter: