The data collected by the completion suggester experiement doesn't make sense and is all over the place. Starting on Sept 10 the TestSearchSatisfaction2 and CompletionSuggestion experiments started seeing what should be unique 64 bit numbers coming from multiple ip addresses. Figure out why this is happening and fix it.
Description
Details
Event Timeline
Change 238306 had a related patch set uploaded (by EBernhardson):
Update CompletionSuggestion bucket selection
Change 238355 had a related patch set uploaded (by EBernhardson):
Update CompletionSuggestion bucket selection
patch swatted out. Will evaluate the data collected tomorrow morning to decide if this fixes the problem.
Looks like dan found the issue today, reported at https://lists.wikimedia.org/pipermail/analytics/2015-September/004285.html
So this basically means we need to throw away the clientIp information until this can be fixed.
Can we get useful test data without this value? We can still correlate together events by the same user on the same page, we just can't correlate them together across pages (but chances are they won't be opted into the test more than once).
Are there other oddities in the data we can't explain?
Update: we discussed this on IRC and arrived at the conclusion that we can assume relative independence of sets of events. Which is to say, given our low sampling rates, we are not likely to see logs of sessions from the same users.
Does this mean the answer to the question "Is this test still scientifically valid, can be analysed as-is, and does not need to be re-run?" is "Yes"?
To be valid I think we have to start the test over as of when the adjusted schema was deployed today. There were a few changes made to bucketing (that will also help on other tests going forward) so the data moving forward isn't directly comparable with the data prior. Maybe? Not entirely confident but putting it out there
Understood. We should get the test restarted ASAP. How about restarting the test on Thursday 17th, running for one week? @mpopov @Ironholds @EBernhardson Thoughts?
I think we can consider the test restarted the moment the new schema started collecting data