EventLogging validation errors for EditAttemptStep
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	awight
	Oct 31 2019, 10:30 PM

Description

Errors from logstash:

~~None is not of type 'integer'~~
'.event.abort_timing' should be integer

Logstash search: https://logstash.wikimedia.org/goto/317e3a351a88fb9e361996847880321c

Maybe a dozen errors per minute. It isn't obvious to me which field is failing the validation.

Details

Subject	Repo	Branch	Lines +/-
Bump EditAttemptStep to 2.0.2	mediawiki/extensions/WikimediaEvents	master	+1 -1
Bump EditAttemptStep to 2.0.2	mediawiki/extensions/WikiEditor	master	+1 -1
Add missing `new-sticky-header` init_mechanism to editattemptstep	schemas/event/secondary	master	+616 -3
EditAttemptStep: let timing values fall back to -1	mediawiki/extensions/WikimediaEvents	master	+4 -1
When switching from WikiEditor activate VE after notifying WikiEditor	mediawiki/extensions/VisualEditor	master	+1 -1
Remove duplicate load error handling code	mediawiki/extensions/VisualEditor	master	+0 -11
Move abort event tracking from the start to the end of the teardown process	mediawiki/extensions/VisualEditor	master	+12 -6
Give revision_id a fallback that'll validate	mediawiki/extensions/VisualEditor	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T332438: Centralize EditAttemptStep logging code in WikimediaEvents
T287487: Uncaught TypeError: Cannot read property 'tools' of null
T286815: '.event.abort_timing' should be integer
T261664: EditAttemptStep validation error '0' is not of type 'integer'
Mentioned Here: T243641: Instrument mobile wikitext editor fallback worfklow
T332438: Centralize EditAttemptStep logging code in WikimediaEvents
T261664: EditAttemptStep validation error '0' is not of type 'integer'

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In a month, January, @DLynch will check the dashboard.

MBinder_WMF moved this task from Kanban Board to Next Quarter on the Editing-team board.Mar 17 2021, 10:17 PM

MBinder_WMF edited projects, added Editing-team; removed Editing-team (Kanban Board).

Still seeing about 1k errors per day for this schema, now with these complaints:

'.event.abort_timing' should be integer
'.event.loaded_timing' should be integer
'.event.ready_timing' should be integer
'.event.first_change_timing' should be integer
'.event.save_intent_timing' should be integer

Looking at some example events, the problem is that we're sending an explicit "abort_timing": null, but the schema specifies that the fields are not nullable, only *omittable*.

You probably already no, but just FYI https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#No_union_types_/_No_null_values :)

MBinder_WMF added a project: Growth-Team-Filtering.Apr 15 2021, 6:51 PM

I think that counts as the original issue, with the "it's back!" one resolved.

Quickly skimming the errors in logstash, I notice that the clear majority is action=abort + abort_type=nochange. This would seem to imply that in ve.init.mw.trackSubscriber.js's computeDuration, it attempting return timeStamp - timing.ready; is failing. I suspect this means that most of these errors are when someone navigates away from an edit page after the editor has finished activating but before it has become 'ready'... which should be a very narrow window. I'll need to do some more investigation to work out why it's apparently not.

matmarex merged a task: T281206: '.event.abort_timing' should be integer.Apr 27 2021, 6:09 PM

matmarex added subscribers: • mewoph, kostajh.

DLynch mentioned this in T286815: '.event.abort_timing' should be integer.Jul 19 2021, 4:52 PM

MNeisler subscribed.Jul 19 2021, 5:26 PM

DLynch merged a task: T286815: '.event.abort_timing' should be integer.Jul 19 2021, 8:33 PM

DLynch added subscribers: cjming, nshahquinn-wmf, mforns, nettrom_WMF.

matmarex mentioned this in T287487: Uncaught TypeError: Cannot read property 'tools' of null.Jul 27 2021, 3:35 PM

Updated logstash link in the description.

I can reproduce this locally by clicking "Edit" then pressing Escape quickly. It takes a few tries to get the timing right, but it's reproducible.

I think this can occur because:

Tearing down and setting up the editor are both actually multi-step processes, with a few promises in the middle
Other code may run after a promise is resolved but before the promise success callback runs
We log the 'abort' event at the beginning of the teardown process; in contrast, we log 'ready' and 'loaded' at the end of the setup process.

As a consequence, if the user action to close the editor and the network response with article data arrive at just the right time, it's possible that we will:

Send the 'init' event
Set up the editor
Send the 'abort' event (which expects the 'ready' event to have been sent, since the editor is set up)
Tear down the editor
And only then send the 'ready' and 'loaded' events (which expect the 'abort' event to not have been sent)

Most of our code can handle that, but apparently not the event logging. The events 'abort', 'ready' and 'loaded' will all have messed up timing fields.

Change 708347 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Move abort event tracking from the start to the end of the teardown process

https://gerrit.wikimedia.org/r/708347

gerritbot added a project: Patch-For-Review.Jul 27 2021, 8:53 PM

I haven't really proved that the above is exactly what happens, but I think at this point it's a better use of time to just try it and see if that fixes the problem. I couldn't reproduce it locally, but there might be other scenarios that are just harder to reproduce.

Also, moving where the tracking happens will cause the 'abort' event to be logged a bit later than previously. I doubt that we're using the abort timing for any analysis though, so it shouldn't matter.

matmarex edited projects, added Editing-team (Kanban Board); removed Editing-team.Jul 27 2021, 8:57 PM

matmarex moved this task from Upcoming to Code Review on the Editing-team (Kanban Board) board.

I don't think we've ever done any analysis that'd depend on the timing of the abort event at that kind of fine-grained level. Things like "X% of sessions ended with an abort, vs Y% that ended with a saveSuccess" is where it normally comes up.

Change 708347 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Move abort event tracking from the start to the end of the teardown process

https://gerrit.wikimedia.org/r/708347

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.17; 2021-08-02).Jul 28 2021, 5:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 28 2021, 5:11 PM

We should check Logstash after this is deployed to confirm that the issue is fixed.

hi - just fyi, this error is still occurring ~1100 in the last 12 hours

Screen Shot 2021-08-06 at 10.26.56 AM.png (2×3 px, 791 KB)

That seems to be consistent with the pre-deploy rate of errors, so I don't think this patch actually touched whatever's causing this.

Well, that is very sad.

I looked into this again, and I can reproduce the error locally when using the mobile version, and my default mode is visual, and I close the loading screen before the editor loads.

Change 730071 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/VisualEditor@master] Remove duplicate load error handling code

https://gerrit.wikimedia.org/r/730071

gerritbot added a project: Patch-For-Review.Oct 11 2021, 9:47 PM

matmarex moved this task from Upcoming to Code Review on the Editing-team (Kanban Board) board.Oct 11 2021, 9:47 PM

Back to waiting-to-check-the-logs.

Change 730071 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Remove duplicate load error handling code

https://gerrit.wikimedia.org/r/730071

Maintenance_bot removed a project: Patch-For-Review.Oct 12 2021, 4:11 PM

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.5; 2021-10-19).Oct 12 2021, 5:00 PM

ppelberg moved this task from Blocked / Needs More Work to Ready to Be Worked On on the Editing-team (Kanban Board) board.Oct 25 2021, 5:22 PM

This has massively dropped since 10/21 (when the deploy would have reached all the wikis):

...but it's still not completely gone. We've clearly hit the major cause, though.

ppelberg moved this task from Ready to Be Worked On to Blocked / Needs More Work on the Editing-team (Kanban Board) board.Nov 30 2021, 5:06 PM

ppelberg moved this task from Blocked / Needs More Work to Upcoming on the Editing-team (Kanban Board) board.Dec 8 2021, 5:09 PM

VPuffetMichel edited projects, added Editing-team; removed Editing-team (Kanban Board).Jun 6 2022, 7:20 PM

VPuffetMichel moved this task from Kanban Board to Upcoming on the Editing-team board.

…and it increased again following T332438.

Having clicked through for some very rough sampling, they're almost all [ready/loaded/abort]_timing should be integer and are coming from VisualEditor on desktop. I'd say that this suggests we're failing to properly set the init timing in VE sometimes, and that's carrying through to other timings which are based on init.

I believe it's related to switching modes. If you switch from source to VE you (sometimes?) wind up getting the VE init before the WikiEditor abort. This wasn't a problem before because they were maintaining separate timing registries, but now it causes problems because the centralized registry wipes out all timings when it receives that abort.

matmarex removed matmarex as the assignee of this task.Apr 11 2023, 8:23 PM

(@DLynch You mentioned today that you're looking into this)

matmarex moved this task from Upcoming to Doing on the Editing-team (Kanban Board) board.Apr 11 2023, 8:26 PM

matmarex mentioned this in T332438: Centralize EditAttemptStep logging code in WikimediaEvents.

Change 937964 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] When switching from WikiEditor activate VE *after* notifying WikiEditor

https://gerrit.wikimedia.org/r/937964

gerritbot added a project: Patch-For-Review.Jul 13 2023, 2:33 PM

DLynch moved this task from Doing to Code Review on the Editing-team (Kanban Board) board.Jul 13 2023, 2:34 PM

Change 937964 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] When switching from WikiEditor activate VE *after* notifying WikiEditor

https://gerrit.wikimedia.org/r/937964

This can be checked after July 20th to see the effect on the error rate. (As-of today it's at about 18k eventgate_validation_errors / day.)

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.18; 2023-07-18).Jul 13 2023, 5:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 13 2023, 5:11 PM

It's down to a peak of around 12k/day now, so we've gotten rid of a lot of them. More investigation seems needed.

ppelberg moved this task from Blocked / Needs More Work to Ready to Be Worked On on the Editing-team (Kanban Board) board.Aug 1 2023, 4:09 PM

Change 961250 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: let timing values fall back to -1

https://gerrit.wikimedia.org/r/961250

gerritbot added a project: Patch-For-Review.Sep 26 2023, 10:45 PM

DLynch moved this task from Ready to Be Worked On to Code Review on the Editing-team (Kanban Board) board.Sep 26 2023, 10:46 PM

Talked to @MNeisler about this and confirmed that we wouldn't disrupt any ongoing analysis by using -1 as a value in these fields to avoid the validation error. It's already used in specific event timings where the timing's meaningless as a way to say "don't use me".

Once this is merged and rolls out we can ask Megan for some sample sessions where the ready/loaded/abort events have -1 timings and see if that's helpful for debugging this.

Change 961250 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: let timing values fall back to -1

https://gerrit.wikimedia.org/r/961250

Maintenance_bot removed a project: Patch-For-Review.Sep 27 2023, 6:30 PM

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.29; 2023-10-03); removed MW-1.41-notes (1.41.0-wmf.18; 2023-07-18).Sep 27 2023, 7:00 PM

Almost resolved:

The remaining trickle seems to be caused by events related to T243641, and should disappear as well once those changes are all live.

Hmm, there are also errors complaining about "init_mechanism":"new-sticky-header".

Huh, so there are. Looks like I didn't add them in https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/805728 -- I think because at the time it wasn't even possible to create a new section from the sticky header, if I'm remembering the timeline correctly.

You've added click-new-sticky-header though – is that supposed to be the same thing? Perhaps something is logging events wrong.

click means it came from a click on an edit link. new means it came from either a redlink or from a section=new action. (url and url-new are those two, but from direct navigation rather than in-page actions.)

Can't rule out that it's getting logged wrong somewhere, but I suspect it's just that we put the "add topic" button into the sticky header a month after we added that logging and so we never tested what clicking that button would get logged as.

A quick test got me a url-new-sticky-header from clicking the sticky header "add topic" with discussiontools disabled, but DiscussionTools itself doesn't adjust the mechanism accounting for this, so it's something else that's causing the plain new-sticky-header. The only way I've managed to trigger that is by visiting an uncreated page, deliberately making my window incredibly short so the sticky header can actually be accessed on it, then choosing to create the page from the sticky header.

You've added click-new-sticky-header though – is that supposed to be the same thing?

In the cold light of the morning I see you were actually pointing out that this was weirdly named, rather than asking what it means. Yeah, I think that click-new-sticky-header should just be new-sticky-header.

Change 969958 had a related patch set uploaded (by DLynch; author: DLynch):

[schemas/event/secondary@master] Add missing `new-sticky-header` init_mechanism to editattemptstep

https://gerrit.wikimedia.org/r/969958

gerritbot added a project: Patch-For-Review.Oct 30 2023, 7:36 PM

Change 969959 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikiEditor@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969959

Change 969960 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969960

Change 969958 merged by jenkins-bot:

[schemas/event/secondary@master] Add missing `new-sticky-header` init_mechanism to editattemptstep

https://gerrit.wikimedia.org/r/969958

Change 969959 merged by jenkins-bot:

[mediawiki/extensions/WikiEditor@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969959

Change 969960 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Bump EditAttemptStep to 2.0.2

https://gerrit.wikimedia.org/r/969960

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.3; 2023-10-31).Oct 30 2023, 9:00 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 30 2023, 9:10 PM

Another wait-for-train pause.

The last round of patches went out in the train that made it to most wikis on Nov 2nd. We now seem to be hovering at around 0-5 EditAttemptStep schema validation errors on a given day (there haven't been any since the 11th, at the time I write this).

Remaining errors since the 2nd seem to be one of:

'.event.mw_version' should be string (it was null)
'.event.action' should be equal to one of the allowed values (it was "https://eu15.proxysite.com/process.php?d=[big-long-id-number]&b=1" and other similar URLs, so that's weird)
'.event.editor_interface' should be string, '.event.editor_interface' should be equal to one of the allowed values (it was null)
'.event.abort_timing' should be integer (it was null)

The first two there are something being generally wrong with the page that I doubt is the fault of the logging code -- mw.config.get('wgVersion') returning null, and presumably a proxy meddling with page-contents.

Everything's happening at rates low enough that it's probably not an issue for the integrity of our data. (I was able to literally look at every single error that happened in the last two weeks to make that quick summary, the first time that's been possible since we started this task.)

I'll give it another week or so to make sure nothing crops up, but then I think we could close this.

Nice! Yeah some of that (like event.action) could just be accidental/spam events from other sites/producers.

DLynch moved this task from Blocked / Needs More Work to Ready for Sign Off on the Editing-team (Kanban Board) board.Jan 29 2024, 6:19 PM

	F34586256: Screen Shot 2021-08-06 at 10.26.56 AM.png
	Aug 6 2021, 4:29 PM

	F38612016: image.png
	Oct 18 2023, 4:56 PM

	F36941845: image.png
	Apr 6 2023, 7:06 PM

	F34710640: image.png
	Oct 25 2021, 5:24 PM

EventLogging validation errors for EditAttemptStepOpen, Needs TriagePublicActions

Description

Details

Related Objects

Event Timeline

EventLogging validation errors for EditAttemptStep
Open, Needs TriagePublic
Actions