Hardware Watchdog Reboots

Moderators: grovkillen, Stuntteam, TD-er

Message
Author
User avatar
grovkillen
Core team member
Posts: 3303
Joined: 19 Jan 2017, 12:56
Location: Hudiksvall, Sweden
Contact:

Re: Hardware Watchdog Reboots

#51 Post by grovkillen » 23 Jun 2019, 08:08

Thank you David for such a great insight! I too think many problems are due to poor power. Your idea of capacitors are really good and I will try that myself.
ESP Easy Flasher [flash tool and wifi setup at flash time]
ESP Easy Webdumper [easy screendumping of your units]
ESP Easy Netscan [find units]
Official shop: https://firstbyte.shop/
Sponsor ESP Easy, we need you :idea: :idea: :idea:

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#52 Post by georgep » 23 Jun 2019, 08:26

dynamicdave wrote:
23 Jun 2019, 07:06
So far the Wemos has been running for 620 minutes without rebooting (which is a dramatic improvement).
I just thought I'd share this with the forum as other people could try it and see if it helps their situation.
Hi David.
Thanks for the suggestions - this all makes perfect sense to me.
There is obviously a drive to build these things for the absolute minimum cost and with as few components as possible (particularly "expensive" capacitors).
I'll have a rummage in my 'bits-n-bobs' boxes and see what I can find to modify a couple of my boards and see what effect it has.
What remains interesting however is that different softwares don't suffer so much (if at all) from all these reboots.
I'll keep reading (and posting) here - this is an issue that's crying out to be fixed!
George

User avatar
dynamicdave
Normal user
Posts: 185
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#53 Post by dynamicdave » 23 Jun 2019, 08:30

I've just checked and the Wemos D1 Mini is still going strong at 722 mins.

Here's a link to some photos and the same information I've posted on 'discourse node-red'

https://discourse.nodered.org/t/wemos-d ... oots/12495

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#54 Post by TD-er » 23 Jun 2019, 09:18

It is indeed a good patch for these modules.
Keep in mind that the older Wemos nodes did have proper voltage regulators.
Initially they had one that could handle 500 - 800 mA and the NodeMCU boards even had an AMS1117 which can handle 1A.
The last batches of Wemos nodes I received all have a voltage regulator that can handle only 150 mA.
During RF calibration the ESP may take up-to 500 mA spikes, so adding a capacitor is a good idea.

In my own designs I also use those 100 nF ones a lot (close to any consuming part) and my standard design has a 470 uF/10V on the 5V line.
The ESP reference design data sheets suggest a 10 uF capacitor close to the ESP12E/F/S module.
I can imagine even that one is not according to the suggested value on some of the boards.

Oh and just to be sure.
These caps will not fix the HW watchdog resets.
It just limits the reset/reboots to something that's software related.

About the mentioned 20-ish March 2018 build. I am pretty sure it is core 2.3.0 on that one.
That's when all the trouble started, when I did change too many things at once and thus lost track of causes.

The main things changed at that point:
- Event based wifi
- Core 2.4.0 update (to deliver chunked web transfers, which greatly improves memory usage and speed)

During development of event based wifi, I did notice the internal state of the wifi connection status was sometimes incorrect and I still believe that has something to do with some of the reboots.
I think I will look into that to no longer keep that tight control over the wifi connection state, but let it be handled by the existing core libraries and just poll to see if it is ready.

Last few weeks I also learned the WiFi state may be something ahead of the actual state, which is of great importance if you're using deep sleep.
Meaning, if you immediately try to use the network when the "got IP" event is sent, you will run into timeouts which takes much longer. Just waiting 50 - 100 msec may help a lot there and you will finish your task sooner and go back to sleep.

User avatar
dynamicdave
Normal user
Posts: 185
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#55 Post by dynamicdave » 23 Jun 2019, 09:28

I think I have some 470uF capacitors so I'll lash-up another test-rig and see what happens.

I think having a 100nF capacitor close to any active device is 'classic textbook stuff' and very sensible.

Cheers from David.

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#56 Post by TD-er » 23 Jun 2019, 09:36

Ah you posted while I was editing ;)

Yep, that's something I learned "the hard way" indeed :)
As for my "textbook" experience. I never did follow any classes on electronics design.
All I know is from failing projects and from a (huge) stack of Elektor magazines I bought when I was 12 and then subscribed to the magazine. So I do have all Elektor magazines (in Dutch) back to 2 years before I was born :) (born in 1976)

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#57 Post by georgep » 23 Jun 2019, 09:40

TD-er wrote:
23 Jun 2019, 09:36
So I do have all Elektor magazines (in Dutch) back to 2 years before I was born :) (born in 1976)
Wow! I remember those magazines - very advanced for that time I think, and I too learned a lot from them.
Around 1976 I was starting my first job in electronics design (born 1956) when surface mount technology didn't exist and components were all big enough to solder by hand :D

Right - I'm off into the loft to find some capacitors....

User avatar
dynamicdave
Normal user
Posts: 185
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#58 Post by dynamicdave » 23 Jun 2019, 09:57

I have an ESP-01S (on a breadboard) that runs off of an USB power adapter type HW-131.
This device would only manage 20 or so minutes of up-time before it rebooted.

I've just added a 470uF and 100nF to the breadboard, so it will be interesting to see how long it stays UP.

Regards, David.

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#59 Post by georgep » 23 Jun 2019, 12:16

It's interesting that your board and mine appear so different.
Mine are undoubtedly 'clone' boards bought cheaply from Banggood...

[[I've not linked any images on here before so let's see how this works]]

Image Image

Time to reach for the capacitors and soldering iron :D
George

User avatar
dynamicdave
Normal user
Posts: 185
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#60 Post by dynamicdave » 23 Jun 2019, 12:25

Yes it is - how strange??

I've bought about 40 Wemos D1 Minis (over the last 18-months for my IoT Computer Club) from Banggood and Ali-Express.
Some are real ones, others are most certainly clones.
Although they all flash and work perfectly fine.

PS: Just checked the test-rig has run for 936 minutes now without rebooting!!!
Last edited by dynamicdave on 23 Jun 2019, 12:32, edited 1 time in total.

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#61 Post by Wiki » 23 Jun 2019, 12:30

TD-er wrote:
23 Jun 2019, 09:18
[...]
About the mentioned 20-ish March 2018 build. I am pretty sure it is core 2.3.0 on that one.
That's when all the trouble started, when I did change too many things at once and thus lost track of causes.
[...]
Wiki wrote:
22 Jun 2019, 14:34
[...]
The version they were running on before was dated on 22nd of May 2018 - without any wd timeouts and uptimes of >100 days.
[...]

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#62 Post by Wiki » 23 Jun 2019, 12:32

I'm sorry, I can't get a screenshot uploaded, getting the message "Sorry, the board attachment quota has been reached."

The core version of the release 2018-05-22 is definitely 2.4.1

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#63 Post by georgep » 23 Jun 2019, 12:47

Wiki wrote:
23 Jun 2019, 12:32
The core version of the release 2018-05-22 is definitely 2.4.1
Once I've added some capacitors I reckon I'll erase and then flash a fresh copy of ESP_Easy_mega-20190607_normal_core_241_ESP8266_4M.bin (unless anyone here would rather I tested something else) and we'll see how that goes :)

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#64 Post by Wiki » 23 Jun 2019, 12:56

Sorry, but flashing a version of 2019 will result in wd reboots again. The developers are assuming that the change from 2.3.0 to 2.4.0 is the reason for especially this problem, but thats not true.

...but nobody believes me.....

User avatar
dynamicdave
Normal user
Posts: 185
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#65 Post by dynamicdave » 23 Jun 2019, 14:06

Just for clarification...

I'm using ESP_Easy_mega-20190607_normal_ESP8266_4M.bin on my test rig with the capacitors.

PS: It's been up for 17hrs 17mins as we speak!!

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#66 Post by Shardan » 23 Jun 2019, 15:04

Hello all,

I'm using a lot of X5R/X7R MLCC capacitors all over the board of my weather sensor, of course near the ESP8266 too.
Even when using my big lab power supply I get a lot of restarts.
So I don't think power supply and/or capacitors are reasons for resets in general.

With the ESP_Easy_v2.0-20180209_normal_ESP8266_4096 my weather sensor does never reach 5 hours.
Besides that it shows some strange behaviour:
The device page shows up normaly... after a while a refresh runs very long, sometimes until timeout ("Page not reachable")
Switching to the main page and back to device page brings it back at once. Might be related.

I've switched to ESP_Easy_mega-20190607_normal_core_252_ESP8266_4M yesterday, now it is up for 21 hours,
even the flaw with the device page seems do have disappeared.

As nothing else changed (Same configuration and rules) I think something in the core has changed.
One thing i noticed that has changed:
With the "normal" version a free heap of 13,500 was reported, with the core 2.5.2 i see a freeheap of 20,000.
Free Stack is more or less same.
Regards
Shardan

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#67 Post by georgep » 23 Jun 2019, 15:12

Just so as not to repeat exactly what others are doing I decided to try ESP_Easy_mega-20190607_normal_core_260_sdk222_alpha_ESP8266_4M.bin
I'm seeing more free memory than before too.
Let's see how this goes...
Image

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#68 Post by Wiki » 23 Jun 2019, 15:58

Uptime of 17h or 20h are no sign to have found a solution.
Unbenannt.JPG
Unbenannt.JPG (41.22 KiB) Viewed 3474 times
Currently used build:
Unbenannt1.JPG
Unbenannt1.JPG (53.38 KiB) Viewed 3374 times
The graph shows the uptime in minutes of a device, Wemos D1, INA219, DS18B20, MQTT to Domoticz, http to another server, connected to the other esp. As you can see, randomly the device is running a few day, then again only some hours. The device ran in same configuration (as well as three identically cofigured others) >100 days with the version form 2018-05-22.

Documentation of The stable build:
Unbenannt2.JPG
Unbenannt2.JPG (34.87 KiB) Viewed 3474 times
Last edited by Wiki on 24 Jun 2019, 12:39, edited 1 time in total.

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#69 Post by TD-er » 23 Jun 2019, 16:07

Wiki wrote:
23 Jun 2019, 12:56
Sorry, but flashing a version of 2019 will result in wd reboots again. The developers are assuming that the change from 2.3.0 to 2.4.0 is the reason for especially this problem, but thats not true.

...but nobody believes me.....
It is not that I don't believe you, I just did not read your build date very well.
Also the Watchdog resets are not a result of a single issue.
Just the fact that the reboot frequency has been decreased by switching to core 2.5.2 does prove it is not related to a single change/bug.
Another issue may be that the lack of reboots on some nodes does seem to be random.
For example, would the same unit using the same firmware still reach 10+ days uptime after it reboots?

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#70 Post by Wiki » 23 Jun 2019, 18:49

@TD-er:

Don't misunderstand: In the past you were several times repeating, that the wd-issue began at the time when you were upgrading the development environment from 2.3 to 2.4.

Thats not 100% true, the release of 22.05.2018 is built with core 2.4.1. I know, that getting the point where the problems began is very, very hard. What I was doing in the few past weeks was, to take the risk of bricking four of my productive devices (which were running rock solid with the release of 22nd of May 2018, even after reboots, all same configuration concerning hardware, function, reporting etc.) and which have been instantly pretty unstable with releases post Dec. 2018 and to step back one by one into the past with different releases.

The past few months there were only guesses at which time the problems started to appear - tapping around in the dark. I am just trying to put a finger into the wound and to figure out when the wd reboots definitly began - result open - and probably to give you an idea where to look at.

My results up to now:

releases from:
22.05.2018: stable
14.09.2018: stable (up to now (flashed yesterday, 1.5 days up), has to be approved)
08.10.2018: unstable, wd-reboots
31.12.2018: unstable, wd-reboots

What i am actually curious about:

change log 16.09.2018:
[HW Watchdog] Backgroundtasks instead of yield during rules handling
change log 08.09.2018:
[Watchdog] Add watchdog feed to backgroundtasks() function

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#71 Post by georgep » 23 Jun 2019, 19:05

I think it's great to have a few people here all looking at the issues - it may yet lead to some answers ;)

As of today I have three test units running various versions:

* Sonoff Basic: "Stable" build 120 - core 2.3.0
* Sonoff Basic: "Alleged stable" ESP_Easy_mega-20180522_normal_ESP8285_1024.bin
* Wemos D1 Mini: "Bleeding edge" ESP_Easy_mega-20190607_normal_core_260_sdk222_alpha_ESP8266_4M.bin

None have any sensors, switches etc attached or configured - just an out-of-the-box install with NTP and an MQTT broker configured.

I have some other devices controlling various stuff around the house but due to instability issues these are currently running T*****a :o
Once I can find a version that won't keep rebooting I'll look to moving them to ESPEasy :D

I'll try to leave the test units untouched for a few days and see what happens.

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#72 Post by TD-er » 23 Jun 2019, 23:02

@wiki
Thanks for testing.
The thing is, at the time we switched core versions, we also had some very nasty build issues. Which became clear after a long time.
And also the change in how the wifi is handled.

All-in-all way too many things at once and something I really don't ever want to have happening again.... ever!

So I am really glad you're trying to get some structure in this, since I've been trying to solve way too many things.
Also solved a lot of other issues, which makes it even harder to get a good idea on what is causing what issue.

I will look into the changes between 20180914 and 20181008.

Edit:
The commit you mentioned, about changing yield() to backgroundtask() is included in 20180916, so if you can test that build, that would be great.

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#73 Post by Wiki » 23 Jun 2019, 23:25

@georgep

Nice to see you joining the party, I think your'e welcome.

Recommendation: Don't put efforts in flashing devices and let them run if they have nothing to do. You will need at least some network traffic and some sensors to get an unstable device. Example from one node out of my node mesh:
Unbenannt.JPG
Unbenannt.JPG (59.16 KiB) Viewed 3374 times
So the firmware is dedicated to produce wd-timeout reboots. But running times:
Unbenannt1.JPG
Unbenannt1.JPG (43.23 KiB) Viewed 3418 times
Unbenannt2.JPG
Unbenannt2.JPG (39.14 KiB) Viewed 3418 times
As you can see, this particular device could be named as running stable. Its a Wemos D1, only responsible to switch a relay. Sending only all 10 min. its uptime to Domoticz (MQTT), connected to esp and thats it. The reboot in March was updating it from early January 2019 release to mid March, the other two reboots due to powercycle. So using Sonoffs without sensors / network traffic could lead to wrong conclusions.
Last edited by Wiki on 24 Jun 2019, 12:34, edited 1 time in total.

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#74 Post by Wiki » 23 Jun 2019, 23:32

TD-er wrote:
23 Jun 2019, 23:02
[...]
The commit you mentioned, about changing yield() to backgroundtask() is included in 20180916, so if you can test that build, that would be great.
Öööhm, 20180916 is deleted :mrgreen:

But I will try a later one.

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#75 Post by TD-er » 23 Jun 2019, 23:34

I am looking through the commits and I am now at around 20180930, where there were several commits related to memory allocation issues.
We were simply running out of memory and also there were some reports of all kinds of reboots.
I think some of them were already HW watchdog reboots by then, so I guess it started somewhere inbetween 20180914 and 20180930.

Apart from the yield() => backgroundtask() call you mentioned, there are 2 others I think deserve some extra attention:

TokenPos is not checked to be in the range of the allocated array (if it is, I didn't see it) in Calculate and RPNCalculate asumes token[1] is there.
This may lead to very strange issues.

PR #1641:
Rules parsing, added:
bool condition[],
bool ifBranche[],
byte& ifBlock,
byte& fakeIfBlock)

This was later reverted and later added again.
Using these arrays of bool does also mean you have to perform range checks and I am not sure I saw where those checks are performed.
Also the rules processing should deserve some extra attention.


These 2 are mainly related to rules processing, but I am not sure if it is the main cause of the WD reboots we're looking into now, since it may also reboot when rules are not enabled.

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#76 Post by georgep » 24 Jun 2019, 09:04

TD-er wrote:
23 Jun 2019, 23:34
Also the rules processing should deserve some extra attention.
These 2 are mainly related to rules processing, but I am not sure if it is the main cause of the WD reboots we're looking into now, since it may also reboot when rules are not enabled.
The most unstable of my units (it would struggle to maintain an uptime of more than a few hours) had [of my units] the most complex rules, with several "interlocking" timers triggering each other, toggling button-presses,controlling variables, led and relay. This is the one that I have now moved over to Tasmota, but I do have a spare device that I can load up the same rulesets onto.

Overnight it has also become clear to me that units given little or no work do seem stable...

Code: Select all

/ESP-20180522-26/status/UP 882
/ESP-20190607-22/status/UP 1081
/ESP-R120-25/status/UP 769

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#77 Post by Shardan » 25 Jun 2019, 12:59

Now three devices running with 20190607 on 2.5.2 core as the 0523 version has shown too many problems.

One is a room sensor, just with a SI7021 and a DS18B20, no rules at all: Many restarts, every 2..3 hours for now, heap fragmentation 11% - 12%, free mem 18936, free stack 3600, heap max free 16960

A rain sensor with rules: Restarts about every 24hours, heap fragmentation 7%, free mem 18976, stack 3600, heap max free 17688

A fully equipped weather sensor with lot of rules and 12 tasks used, restarts after 2..3 days, heap fragmentation 3%, free mem 17.200, stack 3600, heap max free 16760

I'm syslogging the room sensor now and report back
Regards
Shardan

User avatar
grovkillen
Core team member
Posts: 3303
Joined: 19 Jan 2017, 12:56
Location: Hudiksvall, Sweden
Contact:

Re: Hardware Watchdog Reboots

#78 Post by grovkillen » 25 Jun 2019, 21:12

I updated my record unit yesterday with 20190607 normal core 252. Its been running fine since then:
Attachments
Screenshot_20190625_210925_com.android.chrome~2.jpg
Screenshot_20190625_210925_com.android.chrome~2.jpg (104.41 KiB) Viewed 3289 times
ESP Easy Flasher [flash tool and wifi setup at flash time]
ESP Easy Webdumper [easy screendumping of your units]
ESP Easy Netscan [find units]
Official shop: https://firstbyte.shop/
Sponsor ESP Easy, we need you :idea: :idea: :idea:

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#79 Post by georgep » 26 Jun 2019, 10:16

Since I set up my test units with various firmwares *none* have rebooted...

Code: Select all

Local Time:	2019-06-26 09:13:25
Uptime:	2 days 19 hours 15 minutes
Load:	2.50% (LC=5498)
CPU Eco Mode:	false
Free Mem:	20344 (14944 - sendContentBlocking)
Free Stack:	3552 (1440 - LoadTaskSettings)
Heap Max Free Block:	17192
Heap Fragmentation:	10%
Boot:	Cold boot (0)
Reset Reason:	External System
This is just so weird! Prior to this even nodes that were "doing nothing" would not stay up like this.

All I can say that is different is that I explicitly did esptool "erase_flash" TWICE on each unit before I flashed the new code.
Also none of them have UDP linking enabled and all are set up with OpenHAB MQTT as the only controller.
Each has a simple script publishing %uptime% every minute.
I have tried to provoke reboots by flood pinging, flooding with http requests, abusing the buttons ... and ... nothing.

<scratches balding head>

User avatar
dynamicdave
Normal user
Posts: 185
Joined: 30 Jan 2017, 20:25
Location: Hampshire, UK

Re: Hardware Watchdog Reboots

#80 Post by dynamicdave » 26 Jun 2019, 12:36

Hi @georgep,
Did you add the capacitors to the the Wemos (or ESP8266) you are using (as I think that will make a world of difference) ??

Cheers from David.

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#81 Post by georgep » 26 Jun 2019, 12:52

dynamicdave wrote:
26 Jun 2019, 12:36
Did you add the capacitors to the the Wemos (or ESP8266) you are using (as I think that will make a world of difference) ?
I did.

On the D1 mini I soldered a tantalum bead (remember those? 🤔) and a 0.1uF ceramic in parallel across each of the 5v and 3.3v rails, as close as I could get them.

I also have a (quite old) Wemos NodeMcu which seems to have an altogether better design of power supply yet the firmware reports a supply voltage of around 2.9v. 🙄

The Sonoff units I've left 'as is' for now.

George.

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#82 Post by Shardan » 26 Jun 2019, 13:36

Shardan wrote:
25 Jun 2019, 12:59
Now three devices running with 20190607 on 2.5.2 core as the 0523 version has shown too many problems.

One is a room sensor, just with a SI7021 and a DS18B20, no rules at all: Many restarts, every 2..3 hours for now, heap fragmentation 11% - 12%, free mem 18936, free stack 3600, heap max free 16960

A rain sensor with rules: Restarts about every 24hours, heap fragmentation 7%, free mem 18976, stack 3600, heap max free 17688

A fully equipped weather sensor with lot of rules and 12 tasks used, restarts after 2..3 days, heap fragmentation 3%, free mem 17.200, stack 3600, heap max free 16760

I'm syslogging the room sensor now and report back
Bit of an update:
Restarts seem to be random.
The Syslogging doesn't say anything, it just stops.
Main page says
.
Main.jpg
Main.jpg (114.89 KiB) Viewed 3231 times
.
which is not very enlightening at least to me.
Regards
Shardan

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#83 Post by georgep » 26 Jun 2019, 15:15

Shardan wrote:
26 Jun 2019, 13:36
The Syslogging doesn't say anything, it just stops.
Yeah I've found that too. I've set the highest level of debug on rsyslog yet never seen anything useful relating to a reboot.

I would kind of assume that "Manual Reboot" means that someone clicked the 'Reboot' button on the 'Tools' screen - but if that's what's shown after one of the infamous 'random reboots' then it's all very confusing :-/

{EDIT} I just spotted that my Sonoff unit running ESP_Easy_mega-20180522_normal_ESP8285_1024.bin has just rebooted! Isn't this supposed to be the reliable one?
The display shows:

Code: Select all

Local Time	2019-06-26 14:16:20
Uptime		0 days 0 hours 37 minutes
Load		7% (LC=11703)
Free Mem	18288 (16912 - sendContentBlocking)
Boot		Manual reboot (1)
Reset Reason	Hardware Watchdog
Eeek!

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#84 Post by Shardan » 26 Jun 2019, 17:36

Now the device rebooted again (Without button.. :) )
Manual reboot - Hardware Watchdog.

Still the one with biggest heap fragmentation (12% atm) is the one with most restarts..
May be related or not, IDK. I'm not that deep into programming, to say nicely.
Regards
Shardan

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#85 Post by Shardan » 27 Jun 2019, 14:45

I was monitoring the devices closely these days due to these resets just by refreshing the "devices"- and "main" pages now and then.

I've noticed some strange things:

Short time before reset occurs the web pages get incredibly slow.
The "Devices" page can take up several seconds to refresh, sometimes long enough to produce a "no answer" error from the browser.
It helps to stop the browser, open another page, e.g. controllers and then switch back to devices.

It can go up until the connection is completely lost, no web, no reporting to controller.
The ESP8266 is still working internally. The LED indicator for heating lights up if i wetten the rain sensor, for example.
This is done by a rule so the ESP is working, just the webserver died.

These issues are somewhat random. There is no visible correlation besides one thing:
After rebooting the heap fragmentation can be very different.. One device showed 6% after reboot, with the next reboot it had 19%.
I've got the notion that the devices tend to hang or reboot more often the higher the fragmentation gets. I might err.

Is it possible that there is something like a "race condition" between two or more software components?
Regards
Shardan

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#86 Post by georgep » 27 Jun 2019, 15:36

Shardan wrote:
27 Jun 2019, 14:45

Short time before reset occurs the web pages get incredibly slow.
I've seen exactly that behaviour too - and though I can't correlate it to any rebooting there is clearly "something going on" that isn't right.

We had a mains power outage here this morning so all of my devices restarted .. and within an hour the same device that rebooted yesterday (the supposed stable release 20180522) had rebooted again. Since then (5hrs) I've seen no more reboots, and prior to the power cut the other units had all been up for something like 4days+ with no reboots.

Shardan wrote:
27 Jun 2019, 14:45
The ESP8266 is still working internally. The LED indicator for heating lights up if i wetten the rain sensor, for example.
This is done by a rule so the ESP is working, just the webserver died.
<snip>
Is it possible that there is something like a "race condition" between two or more software components?
I've been wondering about this. I've read something recently about the new ESP32 chip having dual processors and therefore [approximate quote] "your code doesn't have to continually yield to the wifi" - which suggests that on the ESP8266 code *DOES* have to yield to the WiFi. Obviously if this is handled properly the worst that should happen is some dropped (and hence retried) data, but clearly we're seeing something more serious here.

I've also wondered if somehow (which I remember a LONG time ago from my younger days coding in assembler) there is a situation where errant code can "trample over" and corrupt data or even program memory. In those "olden days" this was often due to not allocating sufficient stack space to handle enough interrupt routine calls/returns.This could correlate with the fact that after one reboot, there seem to be more reboots of the same device, seemingly at shorter intervals. I'm not familiar enough with modern coding methods to know if this is likely, or even possible here.

George

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#87 Post by Shardan » 27 Jun 2019, 21:46

Well, the "race condition" thing was just an idea of mine, remembering long ago times :)
Honestly I'm far away from a programmer nowadays, I never learned C or C++ so I can only guess.

The heap is used by many firmware parts so heap fragmentation issues might show these random effects:
I had hanging webserver, restart reasons "Exception" and "WD Restart" randomly. Sometimes after minutes,
sometimes after a day or two.

Watching my nodes I noticed the heap fragmentation increasing to high values up to near 20% on a node,
others starting with high fragmentation levels already...

This might be at least one of the problems causing reboots.
Regards
Shardan

Wiki
Normal user
Posts: 145
Joined: 23 Apr 2018, 17:55

Re: Hardware Watchdog Reboots

#88 Post by Wiki » 28 Jun 2019, 11:43

georgep wrote:
26 Jun 2019, 15:15
[...]
{EDIT} I just spotted that my Sonoff unit running ESP_Easy_mega-20180522_normal_ESP8285_1024.bin has just rebooted! Isn't this supposed to be the reliable one?
The display shows:

Code: Select all

Local Time	2019-06-26 14:16:20
Uptime		0 days 0 hours 37 minutes
Load		7% (LC=11703)
Free Mem	18288 (16912 - sendContentBlocking)
Boot		Manual reboot (1)
Reset Reason	Hardware Watchdog
[...]
Ooops, sounds not good.

I swear with that release (difference: Wemos D1, ESP_Easy_mega-20180522_normal_ESP8266_4096.bin) I didn't notice any wd reboots during the time they were running on it, I remember uptime >100 days when I flashed them in January - when my problems started.

Prob. another difference: You did a clean flash, blanking the device before, did you? In May/June last year mine came from mid March release (first flashing), and were upgraded to a version I think I remember 18th of April (veeeery slow Web GUI) and finally ended up in the version of 22nd of May, all without blanking them.

waspie
Normal user
Posts: 115
Joined: 09 Feb 2017, 19:35

Re: Hardware Watchdog Reboots

#89 Post by waspie » 01 Jul 2019, 17:09

For another data point:
I have two units that I custom compile with about 7 plugins and all controllers. I use MQTT import and several dummy tasks and some with all 4 rulesets almost completely full with several timers. One has 4 dallas temp sensors and another has two ultrasonic sensors, a tsl2561, and an mcp23017 channel expander.
I use core 2.6 sdk3 and routinely have uptimes of 2-4 days and am not doing anything with any capacitors.

I have a few others running nextions with mqtt imports and dummy tasks with lots of rules and they're generally up many more days than 3-4 with nothing special done to the power supplies.

about the only thing i'm doing different is removing all the plugins i don't need.

also using syslog and udp network

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#90 Post by Shardan » 01 Jul 2019, 19:17

Just to make it a bit more complex....

My three devices I used for testing behave nicely atm... up for 1,5...6 days.....
Don't ask me why. I have no acceptable explanation.
Regards
Shardan

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#91 Post by TD-er » 05 Jul 2019, 23:15

I've been looking into this the last few days (almost full time) and I found something which does allow me to trigger this effect.

What I found is that the WD-reboots do happen right at the moment the WiFi connection transitions from "connected" to "Got IP".
The same transition does happen when static IP is being used.
This can happen at the first connect attempt or at any other reconnect, but it doesn't happen always.

The standard WiFi connection cycle can be seen as several stages:
- Connect to AP
- Authenticate connection => When successful, event "connected" is fired.
- Setup network configuration => When successful, event "got IP" is fired.

Every now and then the last step does halt the system for some reason. Not even the loop() function is called then and thus a WD-reboot.
I do not see the last event happening in all these situations where the WD reboot occur.

To trigger it, I force a WiFi disconnect from my AP (a MikroTik) for the node I'm testing.
Sometimes the disconnect is not even seen at the ESP. There is no disconnected event fired and the connection just continues like nothing has happened.
It also may just do a disconnect after which the ESP does perform a new reconnect and continues work.
But every now and then the reconnect process does lead to a WD reboot.

A WiFi disconnect is just something that's very normal.
For example if the WiFi AP does change channels, or if there have been too many errors reported (can also be in the transmission of another client)

Now the curious part.
I've been doing lots of tests to see if changing timings (adding a delay() at some points) does help here.
But it was next to impossible to find a predictable behavior.
Later I managed to change a text in a log line (completely unrelated code) which resulted in completely unstable nodes. Roll back of the change made the node behave like before.
So there is something fishy here. Very likely an array which is addressed out of range, or some missing 0-terminating character or something like that.
This can be in our code, or in the core libraries. Not sure yet.


But at least I'm glad I have some reproducible setup which does a great job in debugging this.
Finally some progress, but no solution yet.

User avatar
ThomasB
Normal user
Posts: 425
Joined: 17 Jun 2018, 20:41
Location: USA

Re: Hardware Watchdog Reboots

#92 Post by ThomasB » 06 Jul 2019, 00:25

Thanks for digging deep into the reboot issue. It's been a persistent problem for some of us. Your findings are interesting and it appears you are close to solving it.

I don't have any useful information to help in your debugging adventure; Just wanted to stop by and cheer you on.

- Thomas

rayE
Normal user
Posts: 136
Joined: 12 Oct 2017, 12:53
Location: Philippines

Re: Hardware Watchdog Reboots

#93 Post by rayE » 06 Jul 2019, 03:30

Not sure if this helps. I have 2 devices running the same application and set up as follows.
1. DHCP using a fixed IP oclet (advanced). Both on the same IP address, with 30, 31 as the last oclet.
2. ESP82xx Core 2_4_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3
3. Release 20190202.

My laptop is connected to the same router with the web GUI from both devices running. I have noticed on a few occasions that when i use the lap top that has been in sleep mode that one or both units will perform a manual reboot at exactly the time the laptop wakes up. They also do random manual reboots and will run for a max of 2 days. I have tried fixed IP and many other configs over several months with no success. I have also tried not running the GUI from the devices on my computer but that also had no effect.

As a further test i have changed the routers channel number from auto to fixed, ill give it a try.

Ray

georgep
Normal user
Posts: 38
Joined: 05 May 2019, 16:32
Location: Somerset, UK

Re: Hardware Watchdog Reboots

#94 Post by georgep » 06 Jul 2019, 11:48

TD-er wrote:
05 Jul 2019, 23:15
I've been looking into this the last few days (almost full time) and I found something which does allow me to trigger this effect.
<<snip>>
But at least I'm glad I have some reproducible setup which does a great job in debugging this.
Finally some progress, but no solution yet.
@TD-er : Thank you so much for spending time on this, and well done for making what looks like some very positive progress.

I've been relatively quiet lately but all I have been able to find is that while one of my units (running a supposedly-stable old version) has been continually rebooting every day or so, the rest of them had all been up for something like 8 days prior to another mains power outage here a short while ago :-/

Something that could be significant is that my ESP units all have fixed DHCP addresses (reserved on the DHCP server) and my DHCP lease time is set to a much longer than is typical 7 days :roll:

My [ISP-provided] router is particularly dumb and won't allow me to manipulate live connections (or do very much at all!) but If I can assist in any way I do have a second OpenWrt router that I could use to set up a separate network for testing and would be happy to help.

George

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#95 Post by TD-er » 06 Jul 2019, 17:08

rayE wrote:
06 Jul 2019, 03:30
[...] I have noticed on a few occasions that when i use the lap top that has been in sleep mode that one or both units will perform a manual reboot at exactly the time the laptop wakes up. [...]
It is possible the laptop is doing something that will confuse the AP which does give all connected nodes a reconnect request.
As I also stated, not all reconnect requests do fail and end up in a WD reboot. But what I'm seeing here is that there is a very strong correlation between WD reboots and WiFi reconnects.
In my test setup it is like at least 1-in-4 reconnects that result in a reboot, but it also depends highly on the (network) activity of the node itself.
If a node is inactive, it will often not even notice the disconnect and just continue working.

So this finding does at least not disprove my theory :)

rayE
Normal user
Posts: 136
Joined: 12 Oct 2017, 12:53
Location: Philippines

Re: Hardware Watchdog Reboots

#96 Post by rayE » 07 Jul 2019, 12:23

It's way to early to be sure but i just had the best uptime from both my devices since they started running continually in February. Ill just focus on one unit for this post. The previous best uptime was 2,434 (minutes). The latest uptime was 4,512 then there was a power outage, so we start again.

I have changed the router channel setting that both devices use from Auto to a fixed channel. This may be a fluke but both devices have surpassed the previous best uptime by a long way. Our power supply here is not very reliable at the moment (frequent brown outs) although i can see from the web GUI what caused the reset and can distinguish between a cold boot (brown out) or other.

It's an easy change to make on the router so perhaps someone else would like to try this out and confirm or otherwise that it makes a difference?

Ray

TD-er
Core team member
Posts: 1815
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: Hardware Watchdog Reboots

#97 Post by TD-er » 08 Jul 2019, 08:59

It is for sure a good thing to set the WiFi channel to some fixed value.
I would suggest to go for either 1, 6 or 11 as channel, since those are the only ones that do not have interference from > 1 other AP in range.

As I wrote a few posts before, it is important to make sure the ESP does not have to re-connect to WiFi, since that's a major cause for reboots.

epost
Normal user
Posts: 23
Joined: 09 Feb 2018, 20:04

Re: Hardware Watchdog Reboots

#98 Post by epost » 10 Jul 2019, 11:17

I have a Sonoff PowR2 in use with a small Oled screen connected to it.
Normally the PowR2 received a Hardware Watchdog Reboot within 8 hours.
The Sonoff PowR2 has a fixed IP number.
After switching off the Serial Port and the UDP Port, and switching on Force WiFi B/G
the device has been working for 4 days without Watchdog Reboot.

rayE
Normal user
Posts: 136
Joined: 12 Oct 2017, 12:53
Location: Philippines

Re: Hardware Watchdog Reboots

#99 Post by rayE » 10 Jul 2019, 11:58

My devices are Sonoff POW rev1, i tried what your saying several months back and tested over several weeks BUT after a few days testing it seemed to make little or no difference in the overall average reboots. This one is a tricky one to get to the underlying problem :-)

I used to have this old saying with intermittent problems (i come from an engineering design background), "it seems to depend on which way the wind is blowing"..............very non conclusive :-)

BUT I think (and others), in this case it's a pointer to the problem. I have noticed that when there is a storm brewing in my location in the tropics, ie low cloud base, heavy humidity that the signal strength from the multiple routers on my site go all over the place ( i monitor the channels and signal strength from an Android application). This presumably causes auto channel hopping of the routers which in turn lead to more reboots, this is maybe a pointer to the reboots the devices experience?

My point is FIX the router channel (as already pointed out by the developers) and if there are less reboots then there is a DEFINITE connection, and NOT just a theory to the ROOT cause of the problem! We need more testing and input from all out there to FIX this problem :-)

Ray

Shardan
Normal user
Posts: 1122
Joined: 03 Sep 2016, 23:27
Location: Bielefeld / Germany

Re: Hardware Watchdog Reboots

#100 Post by Shardan » 10 Jul 2019, 17:08

The reboots and "hangs" are at least partly caused by WiFi reconnects

Usually all WiFi in my appartment run on one Unifi Long Range AP.
Two nodes with Mega had permanent problems rebooting after 1...3 hours, hanging, showing blank web page etc.
The Unifi controller diagnostic told me "too much reconnects" for these nodes.

Luckily all home auto nodes run on a separate WiFi on the AP.
I've deleted that Wifi from the main AP and used another old AP I had in the basement shelf.
Configured the AP to a very silent channel and 20 MHz bandwith.
The nodes are set to keep WiFi up permanently.

And tada.... both nodes are up and running for 3 days now. No more web hangers etc.
Regards
Shardan

Post Reply

Who is online

Users browsing this forum: No registered users and 16 guests