reveal search box

nss_ldap's undocumented nss_reconnect_tries 03/04/2006

after a recent ~x86 update and reboot, i managed to find my gentoo work box stalled on boot at the udev stage. there may be a related bug in bugzilla, but i thought i would document it here as well because it is quite interesting.

the reason for the stall is because i use ldap on my box at work, more specifically, nss_ldap. turns out that udev was trying to look up some groups that don't exist in /etc/groups, and therefore was trying ldap over the network.

however, because udev comes up before the network devices are initialised, none of the ldap stuff is working. to add more problems to the mix, nss_ldap is set (for whatever reason) to try indefinitely until it gets a response.

so after googling a bit, a helpful suggestion from someone who also encountered this problem was to set "bind_policy soft" in my ldap.conf. what that does is makes nss_ldap return a negative result if it cannot connect the ldap server, rather than retry indefinitely.

turns out that the solution worked for my udev start stall. but it produced another problem -- i couldn't login through ssh. it turns out that nss_ldap would mysteriously (but reliably) fail somewhere during the ssh login process to look up either a group or user name, and that would throw sshd off.

so the situation is that if i set "bind_policy soft", i can boot up, but would fail unpredictably when a large amount of nss_ldap requests is generated. so i decided to trawl around the nss_ldap source code, and found the offending part that produced this error:

sshd[32527]: nss_ldap: could not search LDAP server - Server is unavailable

armed with that, it turns out rather than blindly getting nss_ldap to refuse a retry, we could limit the number of retries using nss_reconnect_tries. in the changelog, it was added in version 241, but oddly undocumented. i wonder why.

241     Luke Howard <lukeh@padl.com>

* new, more robust reconnection logic
* both "host" and "uri" directives can be used in
ldap.conf
* new (undocumented) nss_reconnect_tries,
nss_reconnect_sleeptime, nss_reconnect_maxsleeptime,
nss_reconnect_maxconntries directives

anyway, that was my solution, set the retries to something that is not infinite which means udev should have a proper error thrown at it when it tries to lookup the non-existant groups, and that i can still tolerate one or two refused ldap connections because i've reached the connection limit.

i don't know if it is just my situation or what, but hope that might help someone.

  • 20 comments
  • gentoo linux
  • 3 years, 10 months ago (03/04/2006)
add commentAdd a comment
Justin says
20
bless your soul, been dealing with this for over a year and a half, and the other day I decided to look into the issue again and found this blog post. A thousand internets to you
  • 4 weeks ago (11/01/2010)
Glenn says
19
You may have just helped us solve the login problem we've been having for over a year now but never bothered to look into. Good spot! Thanks.
  • 4 months, 1 week ago (02/10/2009)
Ed van der Zalm says
18
Great! It works. What a waist of time

Anyway Thanks a lot
  • 1 year, 2 months ago (12/11/2008)
cmc says
17
This helped me on FreeBSD.

You are a god, thank you. Have my manchildren.
  • 1 year, 4 months ago (25/09/2008)
Chris says
16
A bit of a late reply, but I'd just like to say "thanks" too: I've been tearing my hair out all day with this one, and Googling has produced little more than other unanswered questions until now.

This also fixes another mysterious bug where LDAP groups weren't being added to a user ID, thankfully!
  • 1 year, 4 months ago (20/09/2008)
Trever Fischer says
15
Wonderful! Thanks for this quick fix. I've got the same kind of setup at my home network and for some mysterious reason my ldap server started failing these lookups. The only reason I never noticed was because nscd managed to keep everything in cache.
  • 2 years, 6 months ago (13/08/2007)
Francis Giraldeau says
14
Maybe there is another thing that can be done, that is self explained into the sources :

nss-ldap.c, version 251

/*
* If the file /lib/init/rw/libnss-ldap.bind_policy_soft exists,
* then ignore the actual bind_policy definition and use the
* soft semantics. This file should only exist during early
* boot and late shutdown, points at which the networking or
* the LDAP server itself are likely to be unavailable anyway.
*/
if (access("/lib/init/rw/libnss-ldap.bind_policy_soft",R_OK) == 0)
hard = 0;
  • 2 years, 10 months ago (23/03/2007)
DP110 says
13
Problem solved... ...many thanks!

My master server had a chicken and egg problem when it came to the initialising udev within the boot process...
I added:
nss_reconnect_tries 2
and now instead of waiting forever, it finally decides that using 'files' from within nsswitch is ok to use and after an approximately 24 second wait, continues and starts slapd later on (with another 24 second wait).

Now I have bookmarked this page for future reference, im going to try the other settings to see if i can reduce the wait times further...

I had considered abandoning LDAP, thanks for sharing this useful information as all my other nodes have now been updated too :)

BTW... the Gentoo nss_ldap i use is sys-auth/nss_ldap-249 ...looks like the baselayout ebuild in portage was fixed for devfs (2.4 kernel) and nss_ldap... not sure what was going on with udev tho (kernel greater than 2.6.9).
  • 3 years, 3 months ago (29/10/2006)
Martin says
12
I had the same problem on my gentoo box of being able to ssh into /etc/passwd accounts, but not into ldap ones. The only thing that worked for me was to downgrade nss_ldap to 239-r1 - obviously not the same issue you had, but I'm leaving it here for the sake of anyone googling
  • 3 years, 3 months ago (26/10/2006)
Nicolas says
11
Thanks. I still do not understand why ssh fails the first connection, but anyway it works now !
  • 3 years, 4 months ago (02/10/2006)
Omar says
10
Thanks. I had the same problem bindig groups using ldap on my machine. And i search but i can“t find a solution antill i read your page.
  • 3 years, 6 months ago (11/08/2006)
Bendany Qian says
9
I have the same situcations.

I am using FreeBSD 6.1, nss_ldap and pam_ldap in one machine.
and when machine boots up, start slapd server, nss_ldap will try to connect to ldap server, since it is starting. so slapd starting is frozen for a couple mintues. after that it is ok.
I search for solultion, somebody tell me
to change the nss_ldap.conf hard->soft.
it works, until today. I cannot ssh into
my system using account in ldap. I don't know why!!!

thank god. ;-)

you are saving me a lot of time!
  • 3 years, 7 months ago (29/06/2006)
Randy says
8
I was beating my head against this one for a couple hours. Thanks for posting this fix.
  • 3 years, 7 months ago (16/06/2006)
Mathias says
7
It works, thanks a lot :)
  • 3 years, 8 months ago (13/06/2006)
julien says
6
in /etc/ldap.conf
  • 3 years, 8 months ago (07/06/2006)
julien says
5
on gentoo i try to put "nss_reconnect_tries 2" but still have the sshd problem... any help
  • 3 years, 8 months ago (07/06/2006)
Cameron says
4
Just ran into this one. You page was the first one to popup on Google - had everything I needed.

Thanks a bunch!
  • 3 years, 8 months ago (03/06/2006)
Mike Gillis says
3
this is very handy. thanks for posting about it.
  • 3 years, 8 months ago (26/05/2006)
Adam Jacob says
2
You can set this as low as "2" in order for it to work.
  • 3 years, 8 months ago (16/05/2006)
Adam Jacob says
1
That's exactly the right fix for this problem. Strange that, for me, I don't get any other failures from nss_ldap... only that first attempt from ssh.

Fascinating.
  • 3 years, 8 months ago (16/05/2006)