lists.arthurdejong.org
RSS feed

Re: very slow initialization after reboot

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: very slow initialization after reboot



Hi Mat, Hi Arthur,

    Thanks a million for giving me the correct direction! I fixed it by  changing a line in nsswitch.conf!

    "hosts: files mymachines myhostname resolve dns [!UNAVAIL=return]"  should be changed to "hosts: files mymachines myhostname dns resolve [!UNAVAIL=return]"

    It can be also fixed by using IP address instead of host name for the LDAP server in nslcd.conf, but I would prefer nsswitch.

    Thanks a lot for you guys' help!

Best,

Manhong

On 11/12/19 1:32 PM, Mathieu wrote:
Hi,

Sounds more to me like an OS related issue than nslcd, as it happens
after reboot only.

Your system seem to use systemd. Issues/latencies at boot time are
known to occur for services that depend on network-online.target (such
as nslcd), as they will wait for that unit to be completely done
before really starting.
And with systemd, knowing when the host is "online" is a wild guess.
For example, disabling ipv6 with sysctl only will cause the kind of
issues you experience, as systemd will patiently and stubbornly try to
active an ipv6 interface no matter what, hence the "2 mins" delay
before it finally gives up and consider the host online.

So, before trying to debug nslcd, check your systemd logs for any
related "systemd-networkd-wait-online" messages...

--
Mat


On Tue, Nov 12, 2019 at 4:44 PM Manhong Dai <daimh@umich.edu> wrote:
Hi Arthur,

      I did some logging per your advice. All the files are under
https://y.mbni.org/nslcd-debug/

      'nslcd-tcpdump.mp4' shows that the node was not sending any tcp
packets out until the one-minute pause was over.

      'nslcd-strace.mp4' shows how I strace-ed it, and copied the trace
file into three segments, which were tarred in 'nslcd.trace.tar.gz'.

      If you need more information, feel free to let me know, please.


Best,

Manhong


On 11/11/19 5:13 PM, Arthur de Jong wrote:
On Mon, 2019-11-11 at 14:03 -0500, Manhong Dai wrote:
After reboot, the first 'id <USER>' took about two minutes and
then failed. Then all following 'id' command work fine. During the
two minutes of waiting period, I tcpdump-ed the packets on both the
LDAP client and LDAP server,  but didn't detect any packets until the
first 'id' command failed.
Hi Manhong,

The logs show that the initial connection seems to be set up but the
BIND operation takes a very long time. It is unclear to me why this
takes so long.

In any case the maximum time to wait for a response can be set with the
timelimit option. This should ensure that the process does not block
for too long. Then the reconnect logic of nslcd will kick in (see the
reconnect_sleeptime and reconnect_retrytime options).

If this is can be traced to some networking or a firewall issue a way
to reset the reconnect timers is to send a SIGUSR1 signal to nslcd
(assuming you use a recent version of nss-pam-ldapd). On Debian-based
systems for example, the /etc/network/if-up.d/nslcd file ensures that
the timers are reset every time networking is restored.

More ideas for debugging this further are running nslcd under strace
(start it as "strace -t -f -o /var/log/nslcd.trace nslcd -d") to
actually see which operation is blocking so long, looking to see if any
network traffic is actually beging sent, seeing whether ldapsearch is
able to perform search queries during the blocking time, trying to
connect with netcat to port 389 of the LDAP to see if there is a
networking issue and looking at the LDAP server logs.

Hope this helps,