Hardware Watchdog Reboots

Moderators: grovkillen, Stuntteam, TD-er

Message
Author
rayE
Normal user
Posts: 122
Joined: 12 Oct 2017, 12:53
Location: Philippines

Re: Hardware Watchdog Reboots

#101 Post by rayE » 13 Jul 2019, 14:03

And tada.... both nodes are up and running for 3 days now. No more web hangers etc
I totally agree, There is a lot of evidence to the underlying problem given the router configuration. This needs to be documented in some kind of on line spread sheet that the "testers" update (maybe), im sure the results may HELP point a fix for the developers?

Ray

Shardan
Normal user
Posts: 1119
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#102 Post by Shardan » 20 Jul 2019, 14:58

After some days of testing I think the reason is located with the WiFi reconnects.

As posted above I'm runing three nodes with the 0630-version.
These nodes were resetted by WD after a few hours.
The following steps made them working far more stable:

Usually I run a Ubiquiti Unifi Long Range AP for severeal WiFi's and about 35 devices from TV to ESPEasy.

I disabled the WiFi for home automation on that AP and grabbed an old TPLink AP from the basement shelf
so all nodes are running on a separate AP now.
Secind I set the configuration to "Force WiFi No Sleep"= on and "Periodically send Gratuitous ARP" = on (should be set by default).

Sadly we had a complete power outage here so the uptimesare shorted but atm I have uptimes of around 7 days on all three devices.
I didn't see such uptimes for a while... ;)

I won't say this is a solution.
But it shows quite clear the reason for WD resets is located in the WiFi part, I assume inside the core lib.

Have a nice weekend everynone.
Regards
Shardan

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#103 Post by TD-er » 29 Jul 2019, 10:31

Last night (it was past 2 am) I did make a test build: https://www.dropbox.com/s/uknq183mlb8yx ... 4.zip?dl=0
Can some of you please test these to see if this is:
- Connecting to WiFi (was an issue in last few builds)
- Crashing when disconnected from WiFi (power down AP, or in some AP's you can force disconnect some clients)

In my tests here, the units did not crash and did a reconnect like they should.
I tested several (4M) builds here and all do connect to WiFi and reconnect like they should.
But still it is best not to install it on a node which needs removing ceilings or parts of a wall to reach for an USB update :)

One last note.
All (except one) of my nodes I updated last night are still running, even after 10+ WiFi disconnects.
The one that did reboot did so on a WDT reboot, but I think that one may have other issues (using pulse counter).

If this one does seem to work, then you will make me very happy since it is the end of about a year of debugging.
Still a bit skeptical here since it took so long, but I was very pleased to see it did do a WiFi reconnect like it should.
That's a significant part of the WDT reboots.

Wiki
Normal user
Posts: 138
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#104 Post by Wiki » 29 Jul 2019, 13:54

I'd like to make you happy.

At my site a Wemos D1 China clone is waiting to be tested. Any suggestions/wishes, which of the 8266-4M files to use, any suggestions/wishes of specialized Wifi-configurations (i.e. fixed IP, reconnect, Wifi b/g,....)?

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#105 Post by TD-er » 29 Jul 2019, 14:05

Wiki wrote:
29 Jul 2019, 13:54
I'd like to make you happy.

At my site a Wemos D1 China clone is waiting to be tested. Any suggestions/wishes, which of the 8266-4M files to use, any suggestions/wishes of specialized Wifi-configurations (i.e. fixed IP, reconnect, Wifi b/g,....)?
Just start with the "normal" 4M build.
There is also a "custom 4M" build included, which has just the most basic plugins available (the ones I use in my own nodes ;) ) See: https://github.com/letscontrolit/ESPEas ... _script.py
P.S. this Python script will be extended in the future and you may also use it yourself to quickly make a special build for yourself.

Just start with DHCP as IP and default settings.
I also did change some stuff related to the start and stop of the AP mode on new nodes.
Please also report issues with that part.
I did a revert of most code, but took that part again from the changes I made last few weeks. Just hope I did not forget any part of it last night.

Wiki
Normal user
Posts: 138
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#106 Post by Wiki » 29 Jul 2019, 16:51

I have setup the deivce from scratch (means: blanked with blank_4MB.bin), flashed ESP_Easy_mega-20190716-15-PR_2514_normal_ESP8266_4M.bin

Changes in Setup: activated UTP port 62888, added INA219 & DS18B20 as devices, publishing every 10 sec, switched on the onboard LED, set serial log to info, added NTP server, added syslog server (syslog level debug), MQTT to Domoticz

I've put the device to a location with very poor Wifi. As a first result I am sending the syslog entries from my server of this device, see attached file. As you can see, the device doesn't stay up more than some minutes.

Please let mr know if you need more / other logs / infos, or different configurations. I am prepared.
Attachments
ESPTest.zip
(6.06 KiB) Downloaded 14 times

Wiki
Normal user
Posts: 138
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#107 Post by Wiki » 29 Jul 2019, 16:57

Additional info:

After flashing I went through the standard procedure of connecting to the device through the AP functionality and adding it to my Wifi using the standard setup procedure using the Web GUI (in an environment of normal Wifi conditions, not the hard conditions where it is running now). Connection and configuration worked flawlessly with optimum performance.

Wiki
Normal user
Posts: 138
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#108 Post by Wiki » 29 Jul 2019, 17:50

I doublechecked the functionality of the device by just moving it to a location with reasonable Wifi. I reduced the syslog level to info due to the huge amount of data which would be produced using debug level. Syslog attached.
Attachments
ESPTest.1.zip
(7.55 KiB) Downloaded 19 times

Shardan
Normal user
Posts: 1119
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#109 Post by Shardan » 29 Jul 2019, 19:12

I did some testings as described here in the thread.

Moved all ESP devices to a separate access point, set wifi to run permanently etc., see above.

My test devices ran for about 14 days without a problem, then I got WD restarts on all three devices.

As far as I can say there is at least one problem with the core libs newer then 2.3.0.
If WiFi reconnects they tend to WD reset. So a first workaro8nud is to set "Force WiFi No Sleep" in the advanced settings.

But even a restart after 14 days isn't the optimum. It seems something is piling up and causes a restart after a time.
Regards
Shardan

Wiki
Normal user
Posts: 138
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#110 Post by Wiki » 29 Jul 2019, 19:35

And I will give you a prophecy:

From now on these three devices will be more unstable than before if you let them run without reset or powercycle....

User avatar
grovkillen
Core team member
Posts: 3261
Joined: 19 Jan 2017, 12:56
Location: Hudiksvall, Sweden
Contact:

Re: Hardware Watchdog Reboots

#111 Post by grovkillen » 29 Jul 2019, 21:11

Are you running the release that TD-er linked?
ESP Easy Flasher [flash tool and wifi setup at flash time]
ESP Easy Webdumper [easy screendumping of your units]
ESP Easy Netscan [find units]
Official shop: https://firstbyte.shop/
Sponsor ESP Easy, we need you :idea: :idea: :idea:

Wiki
Normal user
Posts: 138
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#112 Post by Wiki » 29 Jul 2019, 22:08

Whom do you ask?

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#113 Post by TD-er » 30 Jul 2019, 12:24

Wiki is using the test build I made as far as I can see in his posts.
Shardan is talking about an uptime of over 14 days, so that's definitely not the test build :)

@Shardan, could you also try the test build? (or last night's build)
Just make sure not to use static IP, since that's already reported in an issue on GitHub.

The last build should be much more stable when it comes to a WiFi reconnect.
But still there are other issues at stake here since some of my nodes already rebooted running the test build.

frank
Normal user
Posts: 85
Joined: 15 Oct 2016, 20:17
Location: Nederland

Re: Hardware Watchdog Reboots

#114 Post by frank » 06 Aug 2019, 20:23

mega-20190805 norrmal 4mb
I do not know if the recent update includes the correction but since the installation no watchdog reboots and 1 day uptime. :) :) :)

User avatar
dynamicdave
Normal user
Posts: 173
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#115 Post by dynamicdave » 06 Aug 2019, 20:28

Same here... +1day uptime with mega-20190805 norrmal 4Mb running on a Wemos D1 Mini.

User avatar
ThomasB
Normal user
Posts: 406
Joined: 17 Jun 2018, 20:41
Location: USA

Re: Hardware Watchdog Reboots

#116 Post by ThomasB » 07 Aug 2019, 00:18

The user feedback sounds promising. So today I loaded ESP_Easy_mega-20190805_normal_ESP8266_1M.bin on a Sonoff Basic to try it out. Hopefully its goodness spreads my way too.

This device was running a late June build (mega-20190630_normal). I'm looking forward to seeing less reboots on this fresh August release.

- Thomas

User avatar
ThomasB
Normal user
Posts: 406
Joined: 17 Jun 2018, 20:41
Location: USA

Re: Hardware Watchdog Reboots

#117 Post by ThomasB » 07 Aug 2019, 22:11

As mentioned in the previous post, mega-20190805 was installed on my Sonoff basic yesterday. Today it rebooted (after running 24 hrs, 15 mins). System info says:

Code: Select all

Boot: Manual reboot (1) 
Reset Reason:	Exception
Rebooting approximately once a day is what this device experienced with the June build it was previously using. I'll continue to monitor the new firmware's performance so I can make a better judgement on the uptime reliability of my test installation.

- Thomas

User avatar
ManS-H
Normal user
Posts: 233
Joined: 27 Dec 2015, 11:26
Location: the Netherlands

Re: Hardware Watchdog Reboots

#118 Post by ManS-H » 07 Aug 2019, 22:23

Hello,
I dont not really how the Watchdog works.
This is my firmware:
'
Watchdog-1.jpg
Watchdog-1.jpg (41.37 KiB) Viewed 1616 times
And this is what i see:
Watchdog-2.jpg
Watchdog-2.jpg (29.73 KiB) Viewed 1616 times
My question, is this a normal situation?

User avatar
ThomasB
Normal user
Posts: 406
Joined: 17 Jun 2018, 20:41
Location: USA

Re: Hardware Watchdog Reboots

#119 Post by ThomasB » 08 Aug 2019, 04:21

My question, is this a normal situation?
Nice to see the 97 day uptime on your old June-2018 core_2_4_1 build. With the 2.4.1 core, and up to the mid-2018 time frame, reliable run times were achieved by most users. I fondly remember those good old days.

- Thomas

frank
Normal user
Posts: 85
Joined: 15 Oct 2016, 20:17
Location: Nederland

Re: Hardware Watchdog Reboots

#120 Post by frank » 08 Aug 2019, 10:04

the wd boots are back :( :(

It looks like the lower the wifi signal the more reboots there are
Last edited by frank on 08 Aug 2019, 10:17, edited 1 time in total.

User avatar
ManS-H
Normal user
Posts: 233
Joined: 27 Dec 2015, 11:26
Location: the Netherlands

Re: Hardware Watchdog Reboots

#121 Post by ManS-H » 08 Aug 2019, 10:15

ThomasB wrote:
08 Aug 2019, 04:21
My question, is this a normal situation?
Nice to see the 97 day uptime on your old June-2018 core_2_4_1 build. With the 2.4.1 core, and up to the mid-2018 time frame, reliable run times were achieved by most users. I fondly remember those good old days.

- Thomas
Thanks for the reply. I used this version with a Sonoff for switch on a table light. Then i keep this version for the work it did.

User avatar
dynamicdave
Normal user
Posts: 173
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#122 Post by dynamicdave » 08 Aug 2019, 17:51

Quick update...

+3days uptime with mega-20190805 norrmal 4Mb running on a Wemos D1 Mini.

Fingers crossed this will be huge step forward for mankind.

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#123 Post by TD-er » 08 Aug 2019, 21:08

4 of my boards haven't been rebooted since I installed the firmware on them with the supposed fix for the WDT reboots.

Uptime: 10 days 18 hours 23 minutes
Build Time:⋄ Jul 29 2019 01:34:03
Binary Filename:⋄ ESP_Easy_mega-20190716-15-PR_2514_normal_core_252_ESP8266_4M.bin

The test builds have the last official nightly timestamp in the filename, it is a bit confusing....

Well, the bugfix was specific for the WDT reboots occurring when reconnecting to WiFi.
There are still others and for example a reboot with Exception as reboot reason is a totally different one.
For example it can be a stupid programming error (divide by zero for example) or trying to dereference an object which has already been deleted.
But also out of memory can be an issue and numerous other reasons.

One of my own nodes still showed a WDT reboot, but I am sure it must have been another reason since I could not find a network reconnect in the logs.
Another one also has had some reboots, but that one is using the pulse counter and that one has a known bug which I still have to fix.

User avatar
ThomasB
Normal user
Posts: 406
Joined: 17 Jun 2018, 20:41
Location: USA

Re: Hardware Watchdog Reboots

#124 Post by ThomasB » 08 Aug 2019, 22:38

There are still others and for example a reboot with Exception as reboot reason is a totally different one.
For example it can be a stupid programming error (divide by zero for example) or trying to dereference an object which has already been deleted.
But also out of memory can be an issue and numerous other reasons.
Thanks, made me check the installation again to review my configuration. And I found something unusual, the System Info Plugin is corrupted. Screenshots below show that SYSINFO is not fully initialized (missing Values). BTW, I wasn't using the missing Values in rules and wasn't sending them to a controller. In case you ask, I cleared the memory before flashing ESP_Easy_mega-20190805_normal_ESP8266_1M.bin.

Cold rebooting did not fix it. But deleting the plugin and re-installing it did the trick. Now I have my four System Info Values being reported on the device page.

Maybe this crippled plugin was causing the exception reboots I've been getting on the mega-20190805. Fingers are crossed.

- Thomas
Attachments
esp_sysinfo.jpg
esp_sysinfo.jpg (116.58 KiB) Viewed 1540 times
esp_devices.jpg
esp_devices.jpg (84.64 KiB) Viewed 1540 times

User avatar
ThomasB
Normal user
Posts: 406
Joined: 17 Jun 2018, 20:41
Location: USA

Re: Hardware Watchdog Reboots

#125 Post by ThomasB » 11 Aug 2019, 18:15

Update: Fixing the System Info plugin seems to have eliminated the Exception reboots. But they have been replaced with Watchdog reboots. So I didn't win the reboot lottery, but at least know where the exceptions came from.

- Thomas

DebugBug
Normal user
Posts: 6
Joined: 11 Feb 2019, 21:47

Re: Hardware Watchdog Reboots

#126 Post by DebugBug » 21 Aug 2019, 21:35

Any new updates on this topic? My devices keep rebooting with a maximum uptime of 1-2 days, running the latest releases.

User avatar
dynamicdave
Normal user
Posts: 173
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#127 Post by dynamicdave » 22 Aug 2019, 09:15

Hi,
I've been running a Wemos D1 Mini for just over 5-days with the latest firmware and have not had any re-boots (yippee).

ScreenShot073.png
ScreenShot073.png (20.37 KiB) Viewed 1172 times

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#128 Post by TD-er » 22 Aug 2019, 10:36

Just a question then for those with the modules up for a few days already.
I've noticed some of my modules have AP mode still enabled (not turned off), do you also see your nodes as SSIDs in the WiFi network?

It seems like there is still an issue with not receiving the "got IP" event in the WiFi, which is the trigger to turn off the AP. (and turn it on also)
So I have to look into that for sure, but I was wondering if it is something with the WiFi here, or maybe others also experience this.

DebugBug
Normal user
Posts: 6
Joined: 11 Feb 2019, 21:47

Re: Hardware Watchdog Reboots

#129 Post by DebugBug » 22 Aug 2019, 18:02

I'm not seeing an SSID for my modules. They have all connected successfully to my wifi and switched off the SSID.
I normally get 0-2 days uptime on my modules between reboots, but the 20190817 release is currently on 4+ days, however the change log does not really show any changes that would explain why? Will keep monitoring the uptime.

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#130 Post by TD-er » 23 Aug 2019, 12:51

I'm working on removing a lot of unneeded code complexity from the WiFi part of the code.
That complexity crept in while searching for the problem over the last year(!!)

This is the one I'm running on most of my nodes myself right now: https://www.dropbox.com/s/tp6mnfmdbmgnf ... 9.zip?dl=0
Even the ones that are really hard to reach when I must reboot them or worse...
But better try it first on nodes that can be downgraded with ease.

In short, I did find quite a few issues that could lead to crashes or nodes unable to reconnect since they would be stuck in AP mode.
Most of those units would then crash in a few minutes, due to another bug....
So either way, the unit should now no longer be left stuck in AP mode and/or not crash and thus get stuck.... You decide if that's an improvement ;)

I also fixed the issue that it was really hard to complete the initial WiFi setup on a newly flashed node. That was also caused by the issue that a node would crash soon when in AP-only mode and a client is connected.

DebugBug
Normal user
Posts: 6
Joined: 11 Feb 2019, 21:47

Re: Hardware Watchdog Reboots

#131 Post by DebugBug » 23 Aug 2019, 21:11

Thanks, TD-er.
I have now loaded it onto one of my modules and will post back with my findings. One thing I noted already is that it seems to be even faster connecting to the Wifi after boot (static IP).

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#132 Post by TD-er » 23 Aug 2019, 21:33

It can even be faster, if I store the last known BSSID and channel.
Then it reconnects in my tests in just around a second.

Current situation:
Turning on WiFi does take time.
Then it still does perform a WiFi scan, which takes 200 msec per channel (12 channels) and then does switch to the right channel and tries to connect.

From cold boot till connected currently takes roughly 4300 msec + the DHCP time.
If I can use the last known BSSID and channel, the time to connect can be cut down with roughly 2700 msec.

DHCP time is also more optimized now.
I added some extra delays, which makes it faster :)
The problem was that if you tried to do a DHCP request before the unit was connected well, it could lead to timeouts.
So instead of 1 - 3 seconds DHCP it now only takes about 50 msec on my system to do the DHCP.

Connect from cold boot, without available BSSID + channel:

Code: Select all

355 : WIFI : Set WiFi to STA
388 : WIFI : Connecting MikroTik attempt #0
2017 : WD   : Uptime 0 ConnectFailures 0 FreeMem 26952 WiFiStatus 6
4342 : WIFI : Connected! AP: MikroTik (B8:69:F4:9F:21:FA) Ch: 7 Duration: 3765 ms
4344 : WIFI : DHCP IP: 192.168.1.96 (ESP-Easy-0) GW: 192.168.1.1 SN: 255.255.255.0   duration: 47 ms
4388 : NTP  : NTP replied: delay 20 mSec Accuracy increased by 0.036 seconds
Example of reconnect using BSSID + channel:

Code: Select all

577824 : WIFI : Disconnected! Reason: '(200) Beacon timeout' Connected for 9 m 33 s
577826 : WIFI : Switch off WiFi
579147 : WIFI : Set WiFi to STA
579180 : WIFI : Connecting MikroTik attempt #0
580392 : WIFI : Connected! AP: MikroTik (B8:69:F4:9F:21:FA) Ch: 7 Duration: 1038 ms
580394 : WIFI : DHCP IP: 192.168.1.96 (ESP-Easy-0) GW: 192.168.1.1 SN: 255.255.255.0   duration: 42 ms
585517 : NTP  : NTP replied: delay 11 mSec Accuracy increased by 0.167 seconds
585519 : Time adjusted by -16.66 msec. Wander: -0.00 msec/second

DebugBug
Normal user
Posts: 6
Joined: 11 Feb 2019, 21:47

Re: Hardware Watchdog Reboots

#133 Post by DebugBug » 23 Aug 2019, 21:40

That is some really impressive optimisations. I thought it was faster already, but apparently it can be tweaked even more.
I can see that being very interesting indeed if people wanted to build i.e. a window or door sensor, based on a reed switch powering up the device from a battery, when needed.
What is currently holding this improvement back from being part of the builds?

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#134 Post by TD-er » 23 Aug 2019, 22:04

Well the last known BSSID + channel are already being used for reconnects.
So a disconnect may only interrupt the connection for about a second.

There is a number of reasons why I have not implemented it to be used at boot:
- to support warm boot (e.g. after a crash or deep sleep), I need to make room for it in the RTC memory. (takes only 7 bytes, so isn't that hard)
- to support it from cold boot, I need to change the settings and store the BSSID/channel along with the WiFi credentials. Channel may change, so I doubt it is really useful.
- I don't know what effect it has on nodes with pour power supply units.

Currently the current peak doesn't start right after boot, but a bit later. (RF calibration takes a short peak of about 500 mA)
So capacitors may have some time to get charged.

I could also disable the RF calibration and just only enable it if a number of attempts failed.
This has a few drawbacks:
- RF calibration needs to be stored somewhere => flash wear out
- RF calibration mainly depends on the voltage supplied on the ESP Vcc.

So as you can see, there is a number of reasons why I have not implemented it.
But I guess I can start with storing it in the RTC, so reconnect after a crash or deep sleep is also really fast.
Maybe I should also add a setting to disable this, if someone encounters issues related to the power supply.

DebugBug
Normal user
Posts: 6
Joined: 11 Feb 2019, 21:47

Re: Hardware Watchdog Reboots

#135 Post by DebugBug » 23 Aug 2019, 22:11

I see the complexity.
It could be a feature that was not enabled by default, but instead something that people could enable in the advanced settings, with a short explanation in one of those (Info) icons that are used elsewhere in the menus.
I'm not sure how often wifi channels change (if at all) on home networks, but I definitely see your point.

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#136 Post by TD-er » 23 Aug 2019, 22:24

How often a channel changes is highly depending on the number of access points in the neighborhood.
AP's may try to switch to another channel if they see a drop in SNR or see lots of checksum errors.

A good friend of mine lives in an apartment flat, with 60 apartments per flat.
So in his own flat he is surrounded by 60+ access points.
To make matters worse, he faces on both sides of this flat another flat of the same size.
So that's roughly 120 - 180 AP's in view.

That's the kind of environment where making a Skype call via 2.4 GHz WiFi is clearly impossible.
Even keeping a SSH connection open is already a challenge.

I guess that in such an environment, where people do buy stronger and stronger APs to get some WiFi signal, a WiFi channel may hop multiple times per hour.

DebugBug
Normal user
Posts: 6
Joined: 11 Feb 2019, 21:47

Re: Hardware Watchdog Reboots

#137 Post by DebugBug » 26 Aug 2019, 21:53

@TD-er: So, have been running with the version you posted on 23 Aug 2019, 12:51. Unfortunately, I'm still seeing watchdog reboots every 1-2 days.
Currently the version that has been most stable on my devices (12 pcs.) is mega-20190817 (9 days and still counting) and mega-20190121 (also 9 days).

obod0002c
Normal user
Posts: 21
Joined: 10 Aug 2019, 20:31

Re: Hardware Watchdog Reboots

#138 Post by obod0002c » 26 Aug 2019, 23:58

TD-er wrote:
23 Aug 2019, 22:24
How often a channel changes is highly depending on the number of access points in the neighborhood.
Unfortunately I had also massive re-connect issues with one of my PI's, too.
Just a few AP's round and I could see my Android tab loosing and regaining connection automatically where my PI had to be restarted manually.
Seems something's not really implemented straight forward in WiFi routines ...
@TD-er: good luck, hoping for the best for all of us

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#139 Post by TD-er » 27 Aug 2019, 00:36

DebugBug wrote:
26 Aug 2019, 21:53
@TD-er: So, have been running with the version you posted on 23 Aug 2019, 12:51. Unfortunately, I'm still seeing watchdog reboots every 1-2 days.
Currently the version that has been most stable on my devices (12 pcs.) is mega-20190817 (9 days and still counting) and mega-20190121 (also 9 days).
Fortunately that other "stable" build you're experiencing is not too far away from where we are now.
I've said it before, there are several causes for these WDT reboots and one of the most frequent ones (I hope...) was related to the WiFi reconnecting.
The reconnecting issue is now taken care of (was already in the 20190817 build) and in my latest changes I did take care of some issues where a unit (in some situations) would not be able to reconnect to the AP if it was running in AP mode itself.
Then it would not handle incoming traffic (because it was not considering itself connected) which would cause the memory to fill up and lead to other crashes.

So we're handling WDT reboots one at a time. Too bad it is still hard to find the cause of these reboots just by looking at the symptoms.
But we're getting closer and closer to a stable situation again and that's giving hope for the future again.

georgep
Normal user
Posts: 37
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#140 Post by georgep » 27 Aug 2019, 18:42

Hello guys!

I'm sorry but I have been away from the forum for some weeks working on other things.

I have found during this time that 'Tasmota' is also not so stable, but where ESP Easy has watchdog reboots, in some (sensor-only) configurations I have found that Tasmota will just "hang" and need a power cycle or hardware (reset button) reset to recover. Clearly this is generally not good! :roll:

I have now 'caught up' with the forum and to resume my efforts to help here I have just flashed "ESP_Easy_mega-20190823_normal_core_252_ESP8266_4M.bin" onto two units and following recent advice here I have set "Force WiFi No Sleep" and "Periodical send Gratuitous ARP" to "true".

I hope to be able to keep in touch here and to help with bug-squashing :D

George

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#141 Post by TD-er » 27 Aug 2019, 23:07

You may also want to check some core 2.6.0 builds of future nightly builds.
There has been some fixes on that core related to obeying timeout settings while still trying to connect to a host.

georgep
Normal user
Posts: 37
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#142 Post by georgep » 28 Aug 2019, 17:36

TD-er wrote:
27 Aug 2019, 23:07
You may also want to check some core 2.6.0 builds of future nightly builds.
There has been some fixes on that core related to obeying timeout settings while still trying to connect to a host.
I wasn't aware of the existence of 'nightly' builds, only the 'pre-releases' such as the latest mega-20190827. Am I missing something?

I've now updated to "ESP_Easy_mega-20190827_normal_core_260_sdk222_alpha_ESP8266_4M.bin" and will keep a close eye on what happens :)

George

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#143 Post by TD-er » 29 Aug 2019, 15:03

Those "pre-releases" are indeed what I meant.

I will merge a change soon (not sure if I will merge it today) which does use a different patched version of the core 2.6.0 branch as it does have a patch for a bug where the timeout of a client is not being honored while still in the init state of making a connection.

georgep
Normal user
Posts: 37
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#144 Post by georgep » 29 Aug 2019, 15:26

TD-er wrote:
29 Aug 2019, 15:03
I will merge a change soon (not sure if I will merge it today) which does use a different patched version of the core 2.6.0 branch as it does have a patch for a bug where the timeout of a client is not being honored while still in the init state of making a connection.
I will look out for that and update to it as soon as I can.
Thanks for all your good work! :) :)

georgep
Normal user
Posts: 37
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#145 Post by georgep » 30 Aug 2019, 10:53

georgep wrote:
29 Aug 2019, 15:26
I will look out for that and update to it as soon as I can.
I know that this isn't the place to report bugs but I don't have any more free time today ...

I just updated to mega-20190830 (same version, same core 2.6.0) and my DHT-11 sensor stopped working.
The value on the 'Devices' page just showed 'NaN' for both temperature and humidity.
Without doing anything else and without touching the hardware I went straight back to mega-20190827 and everything is fine again.
I guess that this is some unexpected consequence of one of the changes.

I'm sorry I don't have any time at all right now to investigate further.

George

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#146 Post by TD-er » 30 Aug 2019, 13:14

I noticed your reply on Github also.
I don't think I have this sensor myself, but since the change in this plugin was mainly made based on a report which later appeared to be an incorrect configuration, I guess we should revert that change.

georgep
Normal user
Posts: 37
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#147 Post by georgep » 01 Sep 2019, 11:55

As I posted yesterday on github I upgraded to the 20190830 release and my DHT11 sensor stopped working.

One of the changes in the 30 release was then reverted in a test PR build, which got the sensor working again but then started a series of frequent (every 30min-1hr) watchdog reboots.

I then posted this in the github thread...
This seems to confirm my thought, which is that once a node has "failed" and rebooted, some code or memory is overwritten or corrupted and even if a new image is flashed, the corrupt memory remains and this causes any new and otherwise "stable" image to become erratic and unstable.
My suspicion (as yet unproven) is that once a node is in this corrupt state, only completely clearing the node (flashing a blank image or using 'esptool.py erase_flash') can make it behave correctly again; flashing a new .bin file will not fix it.
Last night I did an online (web-ui) "upgrade" back to the earlier 20190827 release, but with more reboots overnight I decided this morning to "wipe" the unit and flash the 20190827 build, which had previously been stable for me.

Having done that this morning the unit has so far been up for around an hour.

This *might* confirm my suspicions about an initial "crash" causing an ongoing instability which is not affected by hardware or power-off reboots nor by upgrading to a different release, but can only be (seemingly) "fixed" by wiping the node before starting from scratch with a new build.

Can anyone else confirm (or refute) this suspicion?

George

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#148 Post by TD-er » 01 Sep 2019, 16:00

This *might* confirm my suspicions about an initial "crash" causing an ongoing instability which is not affected by hardware or power-off reboots nor by upgrading to a different release, but can only be (seemingly) "fixed" by wiping the node before starting from scratch with a new build.

Can anyone else confirm (or refute) this suspicion?
I cannot direct confirm this, but it is one of the more standard replies here of the experienced users and I have read similar reports here which state problems to go away when starting from scratch.

Others have reported a full power down may also fix some issues.

georgep
Normal user
Posts: 37
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#149 Post by georgep » 04 Sep 2019, 14:53

georgep wrote:
01 Sep 2019, 11:55
I decided this morning to "wipe" the unit and flash the 20190827 build, which had previously been stable for me.
Having done that this morning the unit has so far been up for around an hour.
Just a further quick update on this...

I now have five units, two running 20190827 and three running the 'PR_2582' test build.
Prior to flashing each of these units I meticulously 'cleaned' them by both using 'esptool.py erase_flash' *AND* flashing the 'blank_4MB.bin' image.

All of these units have now run reliably for 72hrs+ and have also all coped admirably with me twice manually changing the channel of my WiFi router - something that has necessitated manual reboots on a couple of other devices in the house.

I will update again when anything relevant happens :)
George

TD-er
Core team member
Posts: 1605
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#150 Post by TD-er » 04 Sep 2019, 18:31

I will update again when anything relevant happens :)
Maybe you should add the 100-day mark to your calendar, that sounds relevant ;)

On a more serious note, it does seem some UDP packets are not dealt with properly when the WiFi is (temporary) not initialized.
So right at the short interval during reconnect there is still a window in which it may crash.

Post Reply

Who is online

Users browsing this forum: No registered users and 4 guests