Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In
Maniphest T313004

Cold cache page view to metawiki times out then fails with OOM in timeout handler
Closed, ResolvedPublic

Description

I tested login via the secondary DC as part of T279664. The Special:CentralAutoLogin request to metawiki encountered an Excimer timeout exception after 60s, and then it spent 51s trying to make an error page before failing with an OOM (exceeding 666MB). I captured debug logs, which show it attempting to load 32471 page texts, of which 31662 have a title starting with "Centralnotice". After it hit the exception, it started loading them again.

This is the familiar problem of CentralNotice's abuse of the MessageCache system, previously reported at T33595, T468, T203925, etc.

Some stopgap solution may be warranted given FRTech's historical reluctance to rearchitect the code.

Event Timeline

Change 813737 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Disable the message cache if load() throws an exception

https://gerrit.wikimedia.org/r/813737

I have a patch almost ready which introduces $wgMessageCacheExcludedPrefixes, but I still need to investigate why isMainCacheable() is not doing its job.

Many instances of this timeout can be seen in the eqiad logs, it's not just a codfw issue.

I think isMainCacheable() can be made to work. The timeout occurs in RevisionStore::newRevisionsFromBatch(). The content option was passed so it loads the text of all revisions, including those which will fail isMainCacheable(). Filtering is done later. So if we filter the ResultWrapper before it is passed to newRevisionsFromBatch(), it should be fairly fast.

Change 813883 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] MessageCache: Don't load the content for uncacheable rows

https://gerrit.wikimedia.org/r/813883

Another option is to port isMainCacheable() to SQL, i.e. send all possible keys in a big IN() condition. But that may degrade performance on wikis that aren't metawiki. We can benchmark it in production to decide whether further work is necessary.

Change 813737 merged by jenkins-bot:

[mediawiki/core@master] language: Disable MessageCache if load() throws an exception

https://gerrit.wikimedia.org/r/813737

Change 813883 merged by jenkins-bot:

[mediawiki/core@master] MessageCache: Don't load the content for uncacheable rows

https://gerrit.wikimedia.org/r/813883

Change 814919 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Instrument LoadBalancer with statsd metrics

https://gerrit.wikimedia.org/r/814919

Change 814919 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Instrument LoadBalancer with statsd metrics

https://gerrit.wikimedia.org/r/814919

I confirmed that there are no recent timeouts from MessageCache::loadFromDBWithLock, so presumably it was fixed by 813883.