Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In
Maniphest T237063

EventLogging validation errors for EditAttemptStep
Open, Needs TriagePublic

Description

Errors from logstash:

  • None is not of type 'integer'
  • '.event.abort_timing' should be integer

Logstash search: https://logstash.wikimedia.org/goto/317e3a351a88fb9e361996847880321c

Maybe a dozen errors per minute. It isn't obvious to me which field is failing the validation.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In a month, January, @DLynch will check the dashboard.

Still seeing about 1k errors per day for this schema, now with these complaints:

  • '.event.abort_timing' should be integer
  • '.event.loaded_timing' should be integer
  • '.event.ready_timing' should be integer
  • '.event.first_change_timing' should be integer
  • '.event.save_intent_timing' should be integer

Looking at some example events, the problem is that we're sending an explicit "abort_timing": null, but the schema specifies that the fields are not nullable, only *omittable*.

I think that counts as the original issue, with the "it's back!" one resolved.

Quickly skimming the errors in logstash, I notice that the clear majority is action=abort + abort_type=nochange. This would seem to imply that in ve.init.mw.trackSubscriber.js's computeDuration, it attempting return timeStamp - timing.ready; is failing. I suspect this means that most of these errors are when someone navigates away from an edit page after the editor has finished activating but before it has become 'ready'... which should be a very narrow window. I'll need to do some more investigation to work out why it's apparently not.

Updated logstash link in the description.

I can reproduce this locally by clicking "Edit" then pressing Escape quickly. It takes a few tries to get the timing right, but it's reproducible.

I think this can occur because:

  • Tearing down and setting up the editor are both actually multi-step processes, with a few promises in the middle
  • Other code may run after a promise is resolved but before the promise success callback runs
  • We log the 'abort' event at the beginning of the teardown process; in contrast, we log 'ready' and 'loaded' at the end of the setup process.

As a consequence, if the user action to close the editor and the network response with article data arrive at just the right time, it's possible that we will:

  1. Send the 'init' event
  2. Set up the editor
  3. Send the 'abort' event (which expects the 'ready' event to have been sent, since the editor is set up)
  4. Tear down the editor
  5. And only then send the 'ready' and 'loaded' events (which expect the 'abort' event to not have been sent)

Most of our code can handle that, but apparently not the event logging. The events 'abort', 'ready' and 'loaded' will all have messed up timing fields.

Change 708347 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Move abort event tracking from the start to the end of the teardown process

https://gerrit.wikimedia.org/r/708347

I haven't really proved that the above is exactly what happens, but I think at this point it's a better use of time to just try it and see if that fixes the problem. I couldn't reproduce it locally, but there might be other scenarios that are just harder to reproduce.

Also, moving where the tracking happens will cause the 'abort' event to be logged a bit later than previously. I doubt that we're using the abort timing for any analysis though, so it shouldn't matter.

I don't think we've ever done any analysis that'd depend on the timing of the abort event at that kind of fine-grained level. Things like "X% of sessions ended with an abort, vs Y% that ended with a saveSuccess" is where it normally comes up.

Change 708347 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Move abort event tracking from the start to the end of the teardown process

https://gerrit.wikimedia.org/r/708347

We should check Logstash after this is deployed to confirm that the issue is fixed.

hi - just fyi, this error is still occurring ~1100 in the last 12 hours

Screen Shot 2021-08-06 at 10.26.56 AM.png (2×3 px, 791 KB)

That seems to be consistent with the pre-deploy rate of errors, so I don't think this patch actually touched whatever's causing this.

I looked into this again, and I can reproduce the error locally when using the mobile version, and my default mode is visual, and I close the loading screen before the editor loads.

Change 730071 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Remove duplicate load error handling code

https://gerrit.wikimedia.org/r/730071

Change 730071 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Remove duplicate load error handling code

https://gerrit.wikimedia.org/r/730071

This has massively dropped since 10/21 (when the deploy would have reached all the wikis):

image.png (338×426 px, 17 KB)

...but it's still not completely gone. We've clearly hit the major cause, though.

…and it increased again following T332438.

image.png (533×2 px, 59 KB)

Having clicked through for some very rough sampling, they're almost all [ready/loaded/abort]_timing should be integer and are coming from VisualEditor on desktop. I'd say that this suggests we're failing to properly set the init timing in VE sometimes, and that's carrying through to other timings which are based on init.

I believe it's related to switching modes. If you switch from source to VE you (sometimes?) wind up getting the VE init before the WikiEditor abort. This wasn't a problem before because they were maintaining separate timing registries, but now it causes problems because the centralized registry wipes out all timings when it receives that abort.

matmarex edited projects, added Editing-team (Kanban Board); removed Editing-team.

(@DLynch You mentioned today that you're looking into this)

Change 937964 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] When switching from WikiEditor activate VE *after* notifying WikiEditor

https://gerrit.wikimedia.org/r/937964

Change 937964 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] When switching from WikiEditor activate VE *after* notifying WikiEditor

https://gerrit.wikimedia.org/r/937964

This can be checked after July 20th to see the effect on the error rate. (As-of today it's at about 18k eventgate_validation_errors / day.)

It's down to a peak of around 12k/day now, so we've gotten rid of a lot of them. More investigation seems needed.

Change 961250 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: let timing values fall back to -1

https://gerrit.wikimedia.org/r/961250

Talked to @MNeisler about this and confirmed that we wouldn't disrupt any ongoing analysis by using -1 as a value in these fields to avoid the validation error. It's already used in specific event timings where the timing's meaningless as a way to say "don't use me".

Once this is merged and rolls out we can ask Megan for some sample sessions where the ready/loaded/abort events have -1 timings and see if that's helpful for debugging this.

Change 961250 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: let timing values fall back to -1

https://gerrit.wikimedia.org/r/961250

Almost resolved:

image.png (403×2 px, 88 KB)

The remaining trickle seems to be caused by events related to T243641, and should disappear as well once those changes are all live.

Hmm, there are also errors complaining about "init_mechanism":"new-sticky-header".

Huh, so there are. Looks like I didn't add them in https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/805728 -- I think because at the time it wasn't even possible to create a new section from the sticky header, if I'm remembering the timeline correctly.

You've added click-new-sticky-header though – is that supposed to be the same thing? Perhaps something is logging events wrong.

click means it came from a click on an edit link. new means it came from either a redlink or from a section=new action. (url and url-new are those two, but from direct navigation rather than in-page actions.)

Can't rule out that it's getting logged wrong somewhere, but I suspect it's just that we put the "add topic" button into the sticky header a month after we added that logging and so we never tested what clicking that button would get logged as.

A quick test got me a url-new-sticky-header from clicking the sticky header "add topic" with discussiontools disabled, but DiscussionTools itself doesn't adjust the mechanism accounting for this, so it's something else that's causing the plain new-sticky-header. The only way I've managed to trigger that is by visiting an uncreated page, deliberately making my window incredibly short so the sticky header can actually be accessed on it, then choosing to create the page from the sticky header.

You've added click-new-sticky-header though – is that supposed to be the same thing?

In the cold light of the morning I see you were actually pointing out that this was weirdly named, rather than asking what it means. Yeah, I think that click-new-sticky-header should just be new-sticky-header.

Change 969958 had a related patch set uploaded (by DLynch; author: DLynch):

[schemas/event/secondary@master] Add missing `new-sticky-header` init_mechanism to editattemptstep

https://gerrit.wikimedia.org/r/969958

Change 969959 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikiEditor@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969959

Change 969960 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969960

Change 969958 merged by jenkins-bot:

[schemas/event/secondary@master] Add missing `new-sticky-header` init_mechanism to editattemptstep

https://gerrit.wikimedia.org/r/969958

Change 969959 merged by jenkins-bot:

[mediawiki/extensions/WikiEditor@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969959

Change 969960 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969960

The last round of patches went out in the train that made it to most wikis on Nov 2nd. We now seem to be hovering at around 0-5 EditAttemptStep schema validation errors on a given day (there haven't been any since the 11th, at the time I write this).

Remaining errors since the 2nd seem to be one of:

  • '.event.mw_version' should be string (it was null)
  • '.event.action' should be equal to one of the allowed values (it was "https://eu15.proxysite.com/process.php?d=[big-long-id-number]&b=1" and other similar URLs, so that's weird)
  • '.event.editor_interface' should be string, '.event.editor_interface' should be equal to one of the allowed values (it was null)
  • '.event.abort_timing' should be integer (it was null)

The first two there are something being generally wrong with the page that I doubt is the fault of the logging code -- mw.config.get('wgVersion') returning null, and presumably a proxy meddling with page-contents.

Everything's happening at rates low enough that it's probably not an issue for the integrity of our data. (I was able to literally look at every single error that happened in the last two weeks to make that quick summary, the first time that's been possible since we started this task.)

I'll give it another week or so to make sure nothing crops up, but then I think we could close this.

Nice! Yeah some of that (like event.action) could just be accidental/spam events from other sites/producers.