temp sensors stopped after 25 days of usage

Moderators: grovkillen, Stuntteam, TD-er

Post Reply
Message
Author
GravityRZ
Normal user
Posts: 109
Joined: 23 Dec 2019, 21:24

temp sensors stopped after 25 days of usage

#1 Post by GravityRZ » 13 Sep 2020, 13:06

i was using 20200812 firmware for 25 days without problems.
yesterday i noticed that the esp unit stopped updating 2 temp sensors(Environment - DS18b20) and a barometer sensor(Environment - BMx280)

it was accessable thruogh the browser but just did not update the sensors.
a reboot fixed the problem
to be save i updated to 20200829

anybody know if this sounds familiar?

User avatar
Ath
Normal user
Posts: 242
Joined: 10 Jun 2018, 12:06
Location: NL

Re: temp sensors stopped after 25 days of usage

#2 Post by Ath » 13 Sep 2020, 14:32

GravityRZ wrote: 13 Sep 2020, 13:06 anybody know if this sounds familiar?
Well, actually, I have a couple of Sonoff S20's, and a couple of them, after about 25-30 days, aren't showing anymore in the ESPEasy UDP network (as shown on the Main page in the web interface).
Strangely, on the device itself they do show other active devices, but not itself or any of the devices running > 25 days. They are still accessible from the web interface but commands sent to it also aren't responding as expected. Other Sonoff's and ESP's showed/worked nicely, only the ones running for a longer time are missing (I have 4 Sonoff's I bought, 2 years ago, as 1 purchase, they seem to come from 1 production batch, and 2 others I bought just a couple of months ago that even have a different board layout and red&green LED's instead of blue&green :?). Some of the other Sonoff's are close to the one that was missing from the list, and are connected to the same WiFi router & channel, but some are next to the first one, and connected to another WiFi accesspoint (signal is excellent for both networks, the router is less than 1m (one meter) away, and the other access point ~ 3m. Last network disconnect was ~ 13 days ago, (Last Disconnect Reason: '(202) Auth fail', another says '(201) No AP found') and that connection time was about the same with the first one, that I've reset already. (I suspect my NetGear WNDR4300 router to disconnect for no reason, as I have seen many other connection issues with the device).
I didn't check if the first device reported to Domoticz as expected, as the device lives in the living room, where its state can be viewed 'live', as needed, but the ones still up don't report to Domoticz (MQTT connection), nor respond to commands sent from Domoticz via the Domoticz MQTT Helper plugin, so I assume the MQTT subscription is cancelled/forgotten. Saving the Domoticz MQTT Helper settings doesn't bring it 'back to life', and a reboot will be needed.
Sending commands via the browser as an url (not from tools page) still works as expected.
Others from that batch of 4 have been reset/recycled more recent, like a couple of days ago, for several reasons, so not good comparison material, and though the 2 from the new batch are still up and running for 41 days and counting, they don't work as expected.

All of the Sonoff S20's are running the same ESPEasy release 20200801 Normal (download-version, not self-built, as I did an update for the entire Sonoff group, as it involves opening them up and wiring to the laptop, OTA is not possible on the normal 1M builds), as some of them have a BME280 built in.

I've been thinking of making a github issue of this phenomenon, but then I need it to happen again, and I've only reset the device(s) less than 2 days ago. But I'll keep an eye on this matter.
I don't use a syslog server, so no logging is available, unfortunately, but as this seems related to WiFi, not sure if that would have revealed anything other than the connection being lost after about 25 (or 13) days.
I could try to install an alt-wifi build to see if that changes anything, but it will take at least weeks before the results are available.

Edit:
I have an ESP01S, running the same ESPEasy build, that has an uptime of 22 days, that is still reporting its state in the UDP network and its 2 temperature sensors nicely to Domoticz over MQTT, so I'll be watching that one closely the next few days. It is connected to the accesspoint (TP-Link Deco M9 Plus) continuously (0 Reconnects), as it has a very poor signal from the NetGear router because of the location this ESP is in.
/Ton

GravityRZ
Normal user
Posts: 109
Joined: 23 Dec 2019, 21:24

Re: temp sensors stopped after 25 days of usage

#3 Post by GravityRZ » 13 Sep 2020, 14:55

ok, i have regular esp units so we can rule out hardware

i will monitor this as well and get back after 25 days

the unit with the pulsecounter seems to work at that time because it was still sending out pulses.

that unit is set to wifi no sleep so this affects the functionality in a positive way

TD-er
Core team member
Posts: 3357
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: temp sensors stopped after 25 days of usage

#4 Post by TD-er » 13 Sep 2020, 22:10

Hmm 25 days sounds a lot like 2^31 msec, from the top of my head.
A long time ago, I fixed a bug where we had a "Y2K-like" problem with the msec timer overflowing after 49.7 days
But 25 days sounds just like a signed version of that.

Are there other issues too, apart from the Dallas sensors?
I will have a look at the code for this.

User avatar
Ath
Normal user
Posts: 242
Joined: 10 Jun 2018, 12:06
Location: NL

Re: temp sensors stopped after 25 days of usage

#5 Post by Ath » 15 Sep 2020, 20:25

TD-er wrote: 13 Sep 2020, 22:10 Hmm 25 days sounds a lot like 2^31 msec, from the top of my head.
A long time ago, I fixed a bug where we had a "Y2K-like" problem with the msec timer overflowing after 49.7 days
But 25 days sounds just like a signed version of that.
...
Today the ESP01S stopped sending the temperature values to Domoticz, and, as expected, the uptime is 25 days and a couple of hours. CPU load shows being 100%, but despite that number, the unit is quite responsive on web requests.
Looking at the Info page, after refreshing a few times, I noticed that the value for Connected is now counting down, instead of the expected count up (it is at 24d13h43m now) :o Reporting of temperatures to Domoticz has stopped ca. 7h ago.

So this is either an ESPEasy issue or something in the Arduino WiFi code (as that has been the source of other WiFi issues the last couple of years), and overflow of a signed 32 bit integer for a msec counter seems like a very reasonable cause.

I have to reboot the unit as it is used for monitoring the temperatures.

Edit:
After the reboot, the temperatures are being reported to Domoticz again, CPU load is down to the usual 5-7%, and the time for Connected is counting up again.
/Ton

TD-er
Core team member
Posts: 3357
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: temp sensors stopped after 25 days of usage

#6 Post by TD-er » 15 Sep 2020, 22:58

I have seen the 'connected' time indeed reporting negative values.
Not sure if it is related, but at least it is a good starting point to see what else may be wrong here.

I guess it may be something simple as the timeDiff (returning a signed int) is assigned to an unsigned int.

Those bugs were not manifesting themselves roughly a year ago... so I guess we're on the right path here :)

User avatar
Ath
Normal user
Posts: 242
Joined: 10 Jun 2018, 12:06
Location: NL

Re: temp sensors stopped after 25 days of usage

#7 Post by Ath » 16 Sep 2020, 08:04

TD-er wrote: 15 Sep 2020, 22:58 I have seen the 'connected' time indeed reporting negative values.
Not sure if it is related, but at least it is a good starting point to see what else may be wrong here.

I guess it may be something simple as the timeDiff (returning a signed int) is assigned to an unsigned int.

Those bugs were not manifesting themselves roughly a year ago... so I guess we're on the right path here :)
Well, those counting down connection times I don't care much about, but the annoying part is the unit no longer sends data, including over the UDP channel, with no other clues than nothing received at the other end, even though the web interface is still working as expected. It does receive data from the UDP channel, as all other ESP's, active < 25 days, are visible on the Main page of the web interface.

A reboot fixes it for the next 25 days (I assume).
/Ton

TD-er
Core team member
Posts: 3357
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: temp sensors stopped after 25 days of usage

#8 Post by TD-er » 16 Sep 2020, 09:22

Well I can imagine the error would be something like this:

Code: Select all

int i = millis();
if (i > (some_prev_value + 100)) {
...
}
or maybe something like this:

Code: Select all

unsigned int i = timePassedSince(some_prev_value);
if (i < 1000) {
...
}
What controller(s) do you use?

User avatar
Ath
Normal user
Posts: 242
Joined: 10 Jun 2018, 12:06
Location: NL

Re: temp sensors stopped after 25 days of usage

#9 Post by Ath » 16 Sep 2020, 13:05

Domoticz MQTT, I can switch one, or a few, to Domoticz HTTP, but that will probably not explain why the devices become invisible on the UDP network...
/Ton

TD-er
Core team member
Posts: 3357
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: temp sensors stopped after 25 days of usage

#10 Post by TD-er » 16 Sep 2020, 14:26

Is the time displayed correct on those nodes?
And the unixTime?

User avatar
Ath
Normal user
Posts: 242
Joined: 10 Jun 2018, 12:06
Location: NL

Re: temp sensors stopped after 25 days of usage

#11 Post by Ath » 16 Sep 2020, 20:06

NTP is enabled on all ESP's here, so time should be fine (didn't see a discrepancy in date/time when the unit wasn't sending out data), and as I assume unixtime is directly derived from the NTP time, that should be fine as well.

It will take a couple of weeks before any of my units reaches the 25 days uptime again, so not much interesting to report, I guess.
/Ton

TD-er
Core team member
Posts: 3357
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: temp sensors stopped after 25 days of usage

#12 Post by TD-er » 16 Sep 2020, 21:47

Well the ESP will not query the NTP server very often, so all updates to the system time are done using the same timeDiff function.
Meaning that if there is now somehow some bug in the timeDiff function, or the millis() function, then it could be something to look at.

The (apparent) high CPU load and no longer sending data to a controller can also be attributed to a sudden change in the timeDiff function.
Not saying it is the only explanation, but it is one of the few commonalities between the symptoms you described.
And not incrementing system time is quite easy to verify, if it would happen.

TD-er
Core team member
Posts: 3357
Joined: 01 Sep 2017, 22:13
Location: the Netherlands
Contact:

Re: temp sensors stopped after 25 days of usage

#13 Post by TD-er » 17 Sep 2020, 10:10

Ah found an issue which could explain it all.
In the function bool WiFiConnected()

Code: Select all

  if (validWiFi) {
    // Connected, thus disable any timer to start AP mode. (except when in WiFi setup mode)
    if (!wifiSetupConnect) {
      timerAPstart = 0;
    }
    STOP_TIMER(WIFI_ISCONNECTED_STATS);
    // Only return true after some time since it got connected.
    return timePassedSince(lastConnectMoment) > 100;
  }
I will add a flag to set it was already checked and reset the flag when a new wifi connection is made.

I will also have a look at other parts of the code to see where these things may also result in an error.

Post Reply

Who is online

Users browsing this forum: No registered users and 27 guests