ESP32 Cluster

Moderators: grovkillen, Stuntteam, TD-er

Post Reply
Message
Author
martinus
Normal user
Posts: 129
Joined: 15 Feb 2020, 16:57

ESP32 Cluster

#1 Post by martinus » 06 Jun 2020, 12:56

Two weeks ago, during holiday i wanted to remotely update a few rules on my central ESPEasy unit. And the unit got stuck after submit and no way to push the reset button. This is inconvenient because several HA logic is done by this main unit instead of using Domoticz.

I had used an I2C watchdog in the past, but while thinking of a more advanced solution, i came up with a plan for an ESP32 "Cluster" approach. The idea is that whatever happens (as long as we have mains power), at least one of the units will remain working and it could hard reset the other node and if that does not work, it could take over some work.

So this will be the setup for this experiment:
ESP32_Cluster.png
ESP32_Cluster.png (937.47 KiB) Viewed 13973 times
GPIO 26 is cross wired with the reset pin between both modules. Now every cluster needs some sort of heartbeat method. I could use another set of GPIO pins with an alternating signal, but maybe it's better to work with some UDP of HTTP calls between both nodes to reset a timer value. That would provide a basic way of node verification.

Next step would be a similar set of rules (or even config) that can be enabled/disabled on demand so the active node will take over duties from the other node.

Definitely needs more work (and likely customization to ESPEasy framework) but would be cool to get something like this working.

Any comments, ideas, or even proven solutions on this would be appreciated...

TD-er
Core team member
Posts: 8739
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: ESP32 Cluster

#2 Post by TD-er » 06 Jun 2020, 13:36

The ESP32 does have a HW watchdog, right?

Do you know what may have gone wrong for the unit to get stuck?
For example, did it reboot and have some GPIO pin low or high to force the unit into flash boot mode?

martinus
Normal user
Posts: 129
Joined: 15 Feb 2020, 16:57

Re: ESP32 Cluster

#3 Post by martinus » 06 Jun 2020, 14:38

TD-er wrote: 06 Jun 2020, 13:36 The ESP32 does have a HW watchdog, right?

Do you know what may have gone wrong for the unit to get stuck?
For example, did it reboot and have some GPIO pin low or high to force the unit into flash boot mode?
Yes, the ESP32 also has a build-in watchdog. But it will only assist in severe cases where the application likely hangs.
In my case, i could ping it so the network stack was still up. But i had lost the webgui access. Maybe some internal loop where the internal watchdog was still fed somehow.

Anyway, another ESP32 would make a much more advanced watchdog and it only adds $6 to the solution. This kind of watchdog could even check if the webgui on the other node still responses. So it's more like a true "OSI layer 7" type watchdog.

Well, still experimental... We will see what it brings us. :mrgreen:

TD-er
Core team member
Posts: 8739
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: ESP32 Cluster

#4 Post by TD-er » 06 Jun 2020, 17:03

I've seen that a few times myself that the ESP32 webserver was no longer accepting requests.

No clue yet, what may be causing this.

martinus
Normal user
Posts: 129
Joined: 15 Feb 2020, 16:57

Re: ESP32 Cluster

#5 Post by martinus » 06 Jun 2020, 18:49

I've also cross wired the standard serial port on both modules to the HW-serial2 port and started working on a watchdog/debug plugin.
I can now check the serial output from each node using the weblog feature on the other node and see all output even from the early boot stage:

Code: Select all

23164: SER 2: ets Jun 8 2016 00:22:57
23164: SER 2:
23165: SER 2: rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
23170: SER 2: configsip: 0, SPIWP:0xee
23178: SER 2: clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
23182: SER 2: mode:DIO, clock div:1
23186: SER 2: load:0x3fff0018,len:4
23190: SER 2: load:0x3fff001c,len:928
23194: SER 2: ho 0 tail 12 room 4
23198: SER 2: load:0x40078000,len:8424
23202: SER 2: ho 0 tail 12 room 4
23206: SER 2: load:0x40080400,len:5868
23210: SER 2: entry 0x4008069c
23764: SER 2: Checking devicetype: 0
23765: SER 2: Alternate checking on I2C: 0
23766: SER 2: �U43 : Info :
23774: SER 2: INIT : Booting version: My Build: May 30 2020 14:35:39 (ESP32 SDK v3.2.3-14-gd3e562907)
23780: SER 2: 44 : Info : INIT : Free RAM:283028
23866: SER 2: 127 : Info : CRC : No program memory checksum found. Check output of crc2.py
24065: SER 2: 288 : Info : INIT : Free RAM:278008
24565: SER 2: 789 : Info : INIT : I2C
24566: SER 2: 789 : Info : INIT : SPI not enabled
24569: SER 2: 791 : Info : P201 Core Init
24665: SER 2: 927 : Info : INFO : Plugins: 22 [Normal] (ESP32 SDK v3.2.3-14-gd3e562907)
24666: SER 2: 927 : Info : EVENT: System#Wake
24965: SER 2: 1236 : Info : WIFI : Start network scan
24971: SER 2: 1238 : Info : IP : Static IP : 192.168.0.137 GW: 192.168.0.1 SN: 255.255.255.0 DNS: 192.168.0.1
24977: SER 2: 1243 : Info : WIFI : Connecting test attempt #0
24982: SER 2: 1250 : Info : Webserver: start
25164: SER 2: 1401 : Info : WIFI : Scan finished, found: 1
26265: SER 2: 2496 : Info : WIFI : Connected! AP: test
26272: SER 2: 2497 : Info : WIFI : Static IP: 192.168.0.137 (Central1)
So when one of the units drops off the network, i could see if any useful output is produced on the serial connection using the other nodes web interface.

TD-er
Core team member
Posts: 8739
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: ESP32 Cluster

#6 Post by TD-er » 06 Jun 2020, 20:35

Ah that's a nice feature.

The log of the ESP32 already has a lot more buffer, but that's only for logs we supply.
Not the logs output by the core libs themselves. (e.g. crashdump and pre-boot logs)

martinus
Normal user
Posts: 129
Joined: 15 Feb 2020, 16:57

Re: ESP32 Cluster

#7 Post by martinus » 07 Jun 2020, 13:23

Finished a bare plugin that checks the remote node webgui and provide the Serial2 debug mode:
WD1.png
WD1.png (21.6 KiB) Viewed 13901 times
WD2.png
WD2.png (40.87 KiB) Viewed 13901 times
The plugin adds a small "/watchdog" web page without a password. It currently only responds with OK.
When i disable the watchdog on one node, the other node will reset it after three consecutive failures. So far so good.
Now have to run a small endurance test to see if things are stable...

TD-er
Core team member
Posts: 8739
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: ESP32 Cluster

#8 Post by TD-er » 07 Jun 2020, 14:25

You could also use the already existing /json page for it, or does the call to /watchdog do something different also?

Post Reply

Who is online

Users browsing this forum: No registered users and 50 guests