lists.arthurdejong.org
RSS feed

Re: nslcd errors talking to IPVS cluster of LDAP servers

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: nslcd errors talking to IPVS cluster of LDAP servers



On Thu, 2010-10-07 at 11:02 -0400, Ken Gaillot wrote:
> Our shop runs a bunch of Debian lenny servers, some with LDAP-based 
> shell access using the libnss-ldap package. We decided to give 
> libnss-ldapd a try on a new server. We ran into problems with our LDAP 
> setup.

Which version of nslcd are you using? The one in lenny is 0.6.7. Version
0.7.4 saw some changes to connection error handling and 0.7.10 has some
more fixes that should handle some networking problems better.

Also, the development version has taken out the disabling of TCP
keepalives (which was part of the nss_ldap legacy code). This could be
the reason the connection times out in the first place.

Are you able to test any newer versions? I can provide a backport
version for lenny if you like.

> We traced that issue to log messages like these:
> 
> Oct  3 12:42:51 adonis nslcd[1517]: [cdfac0] ldap_result() failed: Can't 
> contact LDAP server
> Oct  3 12:42:51 adonis nslcd[1517]: [cdfac0] ldap_abandon() failed to abandon 
> search: Other (e.g., implementation specific) error
> Oct  3 12:42:52 adonis nslcd[1517]: [cdfac0] connected to LDAP server 
> ldap://ldap.teamgleim.com
> Oct  3 13:30:20 adonis nslcd[1517]: [578454] ldap_search_ext() failed: Can't 
> contact LDAP server
> Oct  3 13:30:20 adonis nslcd[1517]: [578454] no available LDAP server found, 
> sleeping 1 seconds
> Oct  3 13:30:21 adonis nslcd[1517]: [578454] no available LDAP server found
> Oct  3 13:30:21 adonis nslcd[1517]: [578454] no available LDAP server found, 
> sleeping 29 seconds

It seems that the error condition of the first request does not result
in a proper reconnect for the following request. I remember seeing
something like this before but I think it should be fixed in some
release (can't find the change right now).

Can you provide some more debugging info for when this happens? You can
run nslcd with the -d option which causes debugging info to be sent to
stderr.

> It would eventually reconnect, but I'm guessing osiris had already timed 
> out waiting for a response and considered the user accounts to be missing.
> 
> I tried several things:
> 
> * Setting an idle_timeout of 280 did not clear the errors.

I must say that the idle_timeout handling could be improved somewhat.
The connection is left open and only when a new request comes in a check
is done whether the connection should be closed first.

> * Restarting nslcd would clear the errors for more than an hour, but 
> then they would start again.
> 
> * Having a cron job run "getent passwd" every four minutes (thus 
> preventing nslcd from losing its connection to the LDAP server) *did* 
> clear the errors.
> 
> * Finally, changing the nslcd LDAP URI from the cluster address to an 
> explicit list of the three real LDAP servers *did* clear the errors.

It seems that the connection times out at some point and is left in some
half-open state.

> For now, we'll probably stick with libnss-ldap since we're familiar with 
> it, but I wanted to mention the issue in case there's something simple 
> I'm missing.

Thanks for reporting this.

-- 
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org --
--
To unsubscribe send an email to
nss-pam-ldapd-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/nss-pam-ldapd-users