Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In

Joe (Giuseppe Lavagetto)
Spy

Projects (22)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (518 w, 6 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Joe updated subscribers of T368654: Determine which API we should use to fetch Lexeme data from Wikidata when specified in the function-orchestrator.

If you want to use change dispatching from wikidata, which is a proven mechanism, then you'd probably be better off keeping the lexeme data within the wiki structure, and pass it to the orchestrator as a parameter of the function. That would allow you to re-parse the wiki page using the normal mechanism that's already established for wikis, and solve a lot of problems for you (including I think how to fetch the items, but I'd let @LucasWerkmeister comment on that).

Wed, Sep 11, 1:25 PM · Wikidata Dev Team, OKR-Work, Wikidata, Abstract Wikipedia team (25Q1 (Jul–Sep)), WikiLambda, function-orchestrator
Joe added a comment to T368654: Determine which API we should use to fetch Lexeme data from Wikidata when specified in the function-orchestrator.

In Wikidata:Data_access, it says:

The following URL formats are used by the user interface and by the query service updater, respectively, so if you use one of the same URL formats there’s a good chance you’ll get faster (cached) responses:

https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266 (JSON)

Does that mean we would have to specify a specific revision to get the benefit of caching?

Special pages are not cached at the edge, so there is no caching for that url, independently of indicating a revision or not:

$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate
$ curl -Is https://www.wikidata.org/wiki/Special:EntityData/Q42.json?revision=1600533266  | grep cache-control
cache-control: private, s-maxage=0, max-age=0, must-revalidate
Wed, Sep 11, 1:18 PM · Wikidata Dev Team, OKR-Work, Wikidata, Abstract Wikipedia team (25Q1 (Jul–Sep)), WikiLambda, function-orchestrator

Tue, Sep 10

Joe added a comment to T374394: TypeError: tests.values().toArray is not a function (Wikicurious_beat_en).

That's not possible, the whole point of the API query is to decide whether or not to show the banner. (on enwiki the categories I'm filtering on for article (ns0) views are on the corresponding Talk. (ns1) so they're also not available in wgCategories.) I was worried more about performance for the API requests. I tested pretty thoroughly on a few devices so I was less worried about JS errors but of course WP has an enormous variety of clients.

CN banners work on a different level, one API request might be fast but if you add it to let's say enwiki CN, you will trigger a flood of several thousand requests a second every second. Reaching billions more requests to production in a day. That can easily bring down everything specially since API requests don't get CDN cached.

If something is not possible natively in CN, it's better not to circumvent it by API calls.

Tue, Sep 10, 10:41 AM · Wikimedia-production-error, Wikimedia-CentralNotice-Administration

Mon, Sep 9

Joe reopened T374318: Wikifunctions is down as "In Progress".

For the record, the cause was a relatively aggressive crawler filling up all resources. While we've rate-limited this bot, I think we should use robots.txt to ban crawling from most pages.

Mon, Sep 9, 1:10 PM · Traffic, Abstract Wikipedia team

Thu, Sep 5

Joe added a comment to T338761: Bouncehandler is broken.

I'm finally seeing bounces get processed in logstash https://logstash.wikimedia.org/goto/3d34190bb82088f19669b0c66331d7c9

Thu, Sep 5, 9:45 AM · MW-1.43-notes (1.43.0-wmf.23; 2024-09-17), SRE, MediaWiki-Engineering, Wikimedia-Hackathon-2024, Observability-Metrics, Grafana, MediaWiki-extensions-BounceHandler
Joe added a comment to T338761: Bouncehandler is broken.

I still see the errors in the logs, and it's baffling. In fact, I've tried the command now listed in exim's configuration:

Thu, Sep 5, 7:52 AM · MW-1.43-notes (1.43.0-wmf.23; 2024-09-17), SRE, MediaWiki-Engineering, Wikimedia-Hackathon-2024, Observability-Metrics, Grafana, MediaWiki-extensions-BounceHandler
Joe added a comment to T338761: Bouncehandler is broken.

The check is done using Webrequest::getIP() which uses REMOTE_ADDR as a source for the address, and then overrides that using X-Forwarded-For only for trusted proxies.

Thu, Sep 5, 6:16 AM · MW-1.43-notes (1.43.0-wmf.23; 2024-09-17), SRE, MediaWiki-Engineering, Wikimedia-Hackathon-2024, Observability-Metrics, Grafana, MediaWiki-extensions-BounceHandler

Fri, Aug 30

Joe triaged T371782: Create simple web view of requestctl status as Medium priority.
Fri, Aug 30, 1:35 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe added a comment to T371633: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker.

Let me add some prospect, as I've heard people are complaining about this.

Fri, Aug 30, 1:21 PM · User-ItamarWMDE, Wikidata Dev Team, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Shellbox, serviceops, Wikibase-Quality-Constraints, Deployments
Restricted Application added a project to T371633: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker: User-ItamarWMDE.
Fri, Aug 30, 1:16 PM · User-ItamarWMDE, Wikidata Dev Team, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Shellbox, serviceops, Wikibase-Quality-Constraints, Deployments

Wed, Aug 28

Joe awarded T373526: Migrate the ownership of Docker images in production-images repo to mailing lists a Like token.
Wed, Aug 28, 2:24 PM · User-Elukey, Data-Platform-SRE, Machine-Learning-Team, serviceops, Infrastructure-Foundations

Tue, Aug 27

Joe changed the status of T373449: Extract an api class for requestctl from Open to In Progress.
Tue, Aug 27, 3:06 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe changed the status of T373449: Extract an api class for requestctl, a subtask of T371782: Create simple web view of requestctl status, from Open to In Progress.
Tue, Aug 27, 3:06 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe updated the task description for T373449: Extract an api class for requestctl.
Tue, Aug 27, 3:02 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe created T373449: Extract an api class for requestctl.
Tue, Aug 27, 3:01 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe changed the status of T371782: Create simple web view of requestctl status from Open to In Progress.
Tue, Aug 27, 2:59 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe changed the status of T371782: Create simple web view of requestctl status, a subtask of T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks., from Open to In Progress.
Tue, Aug 27, 2:58 PM · Epic, User-CDanis, User-Joe, conftool, Traffic

Mon, Aug 26

Joe added a comment to T292322: Support large files in Shellbox.

For the record, the reason we wanted to support large file uploads was not to worsen the performance of upload-by-url, which has since been fixed by making the process asynchronous. Better performance handling large files would be welcome, though.

Mon, Aug 26, 5:29 AM · Patch-For-Review, MW-1.38-notes (1.38.0-wmf.21; 2022-02-07), SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Aug 8 2024

Joe closed T322523: Check confd last index in a mw-on-k8s world as Declined.

I don't think we really need this, because I can't remember one episode in which this check has acutally fired and it wasn't expected/a false positive.

Aug 8 2024, 8:48 AM · Observability-Metrics, MW-on-K8s, User-fgiunchedi
Joe closed T322523: Check confd last index in a mw-on-k8s world, a subtask of T314118: Reduce IRC flood/spam during incidents, as Declined.
Aug 8 2024, 8:48 AM · Observability-Alerting, serviceops-radar, User-fgiunchedi, SRE
Joe added a comment to T371885: Gaps in Grafana graphs using Thanos.

Now prometheus only reports scraping the correct ports https://prometheus-eqiad.wikimedia.org/k8s/targets?scrapePool=k8s-pods-metrics&search=statsd-exporter

Aug 8 2024, 6:57 AM · SRE Observability (FY2024/2025-Q1), serviceops, MW-on-K8s, Grafana, Observability-Metrics
Joe added a comment to T371885: Gaps in Grafana graphs using Thanos.

Edit: Whoops, I completely missed T371885#10048618 onward before posting this. In any case, question still stands re: the annotation :)

For the endpoints marked down: it looks as if prometheus is scraping both container ports - i.e., 9102 (correct) and 9125 (statsd listen port, incorrect).

Not sure if that could somehow cause problems like those described in the task description, but it would at least explain T371885#10048571.

I wonder if we need to add an explicit prometheus.io/port annotation to ensure only 9102 is scraped?

Aug 8 2024, 6:13 AM · SRE Observability (FY2024/2025-Q1), serviceops, MW-on-K8s, Grafana, Observability-Metrics

Aug 6 2024

Joe closed T369606: Allow integrating requestctl rules into haproxy as Resolved.
Aug 6 2024, 9:41 AM · User-CDanis, User-Joe, conftool, Traffic
Joe closed T369606: Allow integrating requestctl rules into haproxy, a subtask of T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks., as Resolved.
Aug 6 2024, 9:41 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe added a comment to T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions.

I think we need to be able to pass to the release script a set of base images for mediawiki and web, and build the final images for each of those pairs.

Aug 6 2024, 5:39 AM · Kubernetes, Deployments, Release-Engineering-Team (Priority Backlog 📥)

Aug 5 2024

Joe added a comment to T368064: Some mw-api-int traffic is going cross-DC.

I should add, wgLocalHTTPProxy and all those mechanisms are hacks, and if we want to do such things the right way, we should have a proper configuration table for read-only and read-write paths for different domains, and then instruct the application writers to use one or the other. I'd avoid replicating the hack we have at the traffic layer, because say one request gets erroneously sent to the wrong datacenter - now instead of paying 30 ms in total of penalty, we'll pay 30 ms per query to the mysql master.

Aug 5 2024, 1:16 PM · MediaWiki-Platform-Team (Radar), serviceops
Joe added a comment to T368064: Some mw-api-int traffic is going cross-DC.

There's quite a bit of incorrect information in the wall of text above, but to actually keep it short:
We went with just pointing to the read-write api because there's no system, within MediaWiki, to split requests between write and read and we didn't want to add ad-hoc logic to the service mesh just for that.

Aug 5 2024, 1:16 PM · MediaWiki-Platform-Team (Radar), serviceops
Joe claimed T371783: Create an audit log for conftool.
Aug 5 2024, 5:51 AM · Epic, User-CDanis, User-Joe, conftool
Joe created T371783: Create an audit log for conftool.
Aug 5 2024, 5:50 AM · Epic, User-CDanis, User-Joe, conftool
Joe created T371782: Create simple web view of requestctl status.
Aug 5 2024, 5:47 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe triaged T370745: Integrate requestctl haproxy rules into our TLS terminator as High priority.
Aug 5 2024, 5:42 AM · Patch-For-Review, User-CDanis, User-Joe, conftool, Traffic

Jul 30 2024

Joe added a comment to T371144: support the haproxy silent-drop hysteresis gadget from requestctl.

Thanks for the thorough explanation! I know the traffic folks were a bit worried about controlling stick tables from requestctl but I think this format is ok.

Jul 30 2024, 8:21 AM · User-CDanis, User-Joe, conftool, Traffic

Jul 29 2024

Joe added a comment to T362331: SessionBackend: remove dependency on Kask/Cassandra.

While I do get why this solution seems attractive, I don't think it's really viable at any level.

Jul 29 2024, 3:24 PM · MediaWiki-Platform-Team
Joe added a comment to T371144: support the haproxy silent-drop hysteresis gadget from requestctl.

To remove even more doubt, can you make an example of what you'd expect in haproxy dsl with the following rule:

Jul 29 2024, 10:46 AM · User-CDanis, User-Joe, conftool, Traffic
Joe added a comment to T371144: support the haproxy silent-drop hysteresis gadget from requestctl.

Before I think of implementations, I'd like to understand better what you want to offer:

Jul 29 2024, 10:33 AM · User-CDanis, User-Joe, conftool, Traffic
Joe updated the task description for T369606: Allow integrating requestctl rules into haproxy.
Jul 29 2024, 10:17 AM · User-CDanis, User-Joe, conftool, Traffic

Jul 25 2024

Joe created P66931 VogonScript.
Jul 25 2024, 4:02 PM

Jul 24 2024

Joe added a comment to T215217: deployment-prep (beta cluster): Code stewardship request.

I have been thinking about this general problem as well this quarter. I am interested in putting a Kubernetes cluster into deployment-prep, possibly by using Magnum and OpenTofu to provision the service. In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

Jul 24 2024, 4:20 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure, Code-Stewardship-Reviews
Joe added a comment to T370808: Consider registering citoid as a verified or friendly bot with Cloudflare.

@joanna_borun I think this should probably be discussed in the next SRE I/F meeting for approval.

I can confirm that with our account we can access the form linked above. If approved I'll paste here all the fields required to submit a request to register the bot.

Jul 24 2024, 10:26 AM · Infrastructure-Foundations, Citoid, Editing-team
Joe added a comment to T215217: deployment-prep (beta cluster): Code stewardship request.

@bd808 to add to your point: We're near another inflection point for deployment-prep: soon the puppet code for configuring mediawiki in deployment-prep is going to be dismissed and not used in the production environment anymore. By the end of the calendar year, we count on having moved all of production (hopefully, all on kubernetes) to using php 8.x.

Jul 24 2024, 5:33 AM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure, Code-Stewardship-Reviews

Jul 23 2024

Joe created T370745: Integrate requestctl haproxy rules into our TLS terminator.
Jul 23 2024, 7:28 AM · Patch-For-Review, User-CDanis, User-Joe, conftool, Traffic
Joe added a comment to T369606: Allow integrating requestctl rules into haproxy.

As @Fabfur points out, in haproxy 3.0+ (but not haproxy 2.8.x) we have the option of evaluating many ACLs together with negation, as part of a fetching samples.

Jul 23 2024, 7:20 AM · User-CDanis, User-Joe, conftool, Traffic
Joe updated the task description for T369606: Allow integrating requestctl rules into haproxy.
Jul 23 2024, 6:05 AM · User-CDanis, User-Joe, conftool, Traffic

Jul 22 2024

Joe added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.

Jul 22 2024, 3:00 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Joe added a comment to T370425: Misbehaving mw-api-ext pods serving 5xx.

The SIGILL thing happened on bare metal as well, albeit quite rarely. We never properly tracked down what happened, but it seemed to have some relation to accessing the shared anonymous memory and the related semaphores, so I guess one of apcu and opcache are responsible. I'm starting to think we might need a liveness probe of some kind for the pod to depend on that?

Jul 22 2024, 2:11 PM · Wikimedia-production-error, serviceops
Joe changed the status of T369606: Allow integrating requestctl rules into haproxy, a subtask of T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks., from Open to In Progress.
Jul 22 2024, 1:30 PM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe changed the status of T369606: Allow integrating requestctl rules into haproxy from Open to In Progress.
Jul 22 2024, 1:30 PM · User-CDanis, User-Joe, conftool, Traffic

Jul 17 2024

Joe updated subscribers of T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Some of the standard filters on the logstash dashboard didn't make me see all entries, this is the right link: https://logstash.wikimedia.org/goto/77120810b3dbe1cfe36bcf3478ebeaf9 where we can also find the request IDs reported in the bug originally. So the blast radius was quite ample, although I still think this might have been caused by the job running, I'm not 100% sure anymore.

Jul 17 2024, 5:26 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
Joe triaged T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis as Medium priority.

This looks like a genuine issue of overloading and seem to be caused by MediaModerationScanFileJob. So I'd say the protection worked as intended; I am wary of having circuit breaking on shared-by-everything databases - but here it looks like the only web view that failed also did so due to MetaModeration failing.

Jul 17 2024, 5:20 PM · MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, User-notice, Wikimedia-Incident, DBA, Wikimedia-production-error
Jelto awarded T317794: requestctl can't act on cache hits a Like token.
Jul 17 2024, 2:15 PM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe closed T317794: requestctl can't act on cache hits as Resolved.
Jul 17 2024, 1:58 PM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe closed T317794: requestctl can't act on cache hits, a subtask of T305580: requestctl v1 improvements, as Resolved.
Jul 17 2024, 1:56 PM · SRE, conftool
Joe closed T317794: requestctl can't act on cache hits, a subtask of T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks., as Resolved.
Jul 17 2024, 1:56 PM · Epic, User-CDanis, User-Joe, conftool, Traffic

Jul 16 2024

Joe created P66639 daily moment of zen.
Jul 16 2024, 3:14 PM
Joe added a comment to T369606: Allow integrating requestctl rules into haproxy.

There's an interesting problem to manage with haproxy, which is making me think we should support a much simplified syntax.

Jul 16 2024, 8:28 AM · User-CDanis, User-Joe, conftool, Traffic

Jul 15 2024

kamila awarded T317794: requestctl can't act on cache hits a Love token.
Jul 15 2024, 12:25 PM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe updated subscribers of T369606: Allow integrating requestctl rules into haproxy.

As @Fabfur made me notice, conditions can also be expressed inline:

Jul 15 2024, 10:57 AM · User-CDanis, User-Joe, conftool, Traffic
Joe added a comment to T369606: Allow integrating requestctl rules into haproxy.

Haproxy has a logic that's very different from varnish, but it should be possible to translate most of our current patterns or ipblocks to something haproxy can read.

Jul 15 2024, 9:58 AM · User-CDanis, User-Joe, conftool, Traffic
Joe claimed T317794: requestctl can't act on cache hits.
Jul 15 2024, 9:41 AM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe added a comment to T317794: requestctl can't act on cache hits.

To clarify a bit, I didn't take the route described in the task. In fact, we want:

Jul 15 2024, 9:38 AM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe closed T369594: Move conftool to gitlab, turn on deb package auto-generation, a subtask of T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks., as Resolved.
Jul 15 2024, 5:56 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe closed T369594: Move conftool to gitlab, turn on deb package auto-generation as Resolved.
Jul 15 2024, 5:56 AM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool

Jul 12 2024

Joe added a comment to T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl.

A couple of notes:

Jul 12 2024, 2:04 PM · SRE, Traffic
Joe claimed T369594: Move conftool to gitlab, turn on deb package auto-generation.
Jul 12 2024, 12:17 PM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool

Jul 11 2024

Joe added a comment to T317794: requestctl can't act on cache hits.

While implementing the the dry-run version of the current hotlinking patch i found out that its not possible to update the resp object in the vcl_hit hook. As such its not possible to update the X-Analytics header with the current suggested approach. This impacts two aspects of the design

Jul 11 2024, 3:57 PM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe triaged T369606: Allow integrating requestctl rules into haproxy as High priority.

Changing the priority to high as we had yet other instances where being able to limit bandwidth on uploads would've helped us.

Jul 11 2024, 9:09 AM · User-CDanis, User-Joe, conftool, Traffic
Joe added a comment to T369594: Move conftool to gitlab, turn on deb package auto-generation.

Yesterday we merged https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/3 which triggered the job building the debian packages for bullseye: https://gitlab.wikimedia.org/repos/sre/conftool/-/jobs/308138

Jul 11 2024, 9:07 AM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool
Joe updated the task description for T369594: Move conftool to gitlab, turn on deb package auto-generation.
Jul 11 2024, 9:05 AM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool
Joe triaged T369594: Move conftool to gitlab, turn on deb package auto-generation as Medium priority.
Jul 11 2024, 7:40 AM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool

Jul 10 2024

Joe updated the task description for T369594: Move conftool to gitlab, turn on deb package auto-generation.
Jul 10 2024, 2:51 PM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool

Jul 9 2024

Joe added a comment to T369594: Move conftool to gitlab, turn on deb package auto-generation.

I don't see how the two as conflicting options, although in the case of conftool, I'd prefer to keep distributing it in production as deb packages, which makes a lot of the 'required' stages in python-release kind of not useful.

Jul 9 2024, 9:59 AM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool
Joe closed T341115: Rationalize and update the use of base images in our docker-pkg repositories as Resolved.
Jul 9 2024, 9:42 AM · serviceops, docker-pkg
Joe created T369606: Allow integrating requestctl rules into haproxy.
Jul 9 2024, 9:30 AM · User-CDanis, User-Joe, conftool, Traffic
Joe added a subtask for T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks.: T310009: Make it easier to create a new requestctl object.
Jul 9 2024, 8:06 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe added a parent task for T310009: Make it easier to create a new requestctl object: T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks..
Jul 9 2024, 8:06 AM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool
Joe closed T342577: Data Quality - requestctl not getting set, a subtask of T351117: Move analytics log from Varnish to HAProxy, as Resolved.
Jul 9 2024, 7:58 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
Joe closed T342577: Data Quality - requestctl not getting set as Resolved.

I'll boldly consider this task resolved, please reopen it if the problem is still present.

Jul 9 2024, 7:58 AM · Data Products, SRE, Traffic
Joe added a comment to T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate.

I think this is a waste of time and resources to make such an endpoint part of the rest api, including the fact it's going to get published and other clients might start using it. But more importantly, it's yet another specialized endpoint that I can't just see clearly in the same place where others are. I've learned the hard way when moving to k8s that php files under docroots often are sour surprises. Even as part of the rest api, this will at a minimum need to be restricted to just mediawiki.org (how?) and monitored somehow.
It also introduces yet another set of rewrite rules for stuff under rest.php, which we managed to avoid up to now.

Jul 9 2024, 7:48 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review, MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), MediaWiki-Platform-Team (Radar), Event-Platform, MediaWiki-General
Joe created T369594: Move conftool to gitlab, turn on deb package auto-generation.
Jul 9 2024, 7:39 AM · Patch-For-Review, serviceops, User-CDanis, User-Joe, conftool

Jul 8 2024

Joe added a parent task for T317794: requestctl can't act on cache hits: T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks..
Jul 8 2024, 6:28 AM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool
Joe added a subtask for T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks.: T317794: requestctl can't act on cache hits.
Jul 8 2024, 6:27 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe added projects to T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks.: Traffic, conftool, User-Joe, User-CDanis.
Jul 8 2024, 6:27 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe created T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks..
Jul 8 2024, 6:12 AM · Epic, User-CDanis, User-Joe, conftool, Traffic
Joe raised the priority of T352650: Migrate current-generation dumps to run from our containerized images from Medium to High.

The priority of this task has become high, as doing this is currently a blocker to finishing the k8s migration of mediawiki and thus moving to php 8.x, which is desirable for various reasons and is a request from a lot of teams.

Jul 8 2024, 5:42 AM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Joe added a comment to T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate.

Personal opinion, we should never ever turn off DirectorySlash. It's a perfect landmine waiting to be stepped onto, apart from the other risks.

Jul 8 2024, 5:32 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review, MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), MediaWiki-Platform-Team (Radar), Event-Platform, MediaWiki-General

Jul 7 2024

Joe added a comment to T352113: Core addWiki.php.

From an SRE prespective, I'd expect us to build some automation around what addWiki.php does and probably move some of its functions outside of the realm of MediaWiki for specific things we might not want MediaWiki to be able to do (for instance: create databases and/or elasticsearch indexes).

Jul 7 2024, 6:53 AM · LPL Technical Support, MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), MediaWiki-extensions-WikimediaMaintenance

Jul 2 2024

Joe added a comment to T369080: statsd-exporter in k8s is not configured to use its mapping configuration.

https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1051428 and followups should fix the issue

Jul 2 2024, 6:15 PM · SRE, Observability-Metrics
Joe created T369048: Create maintenance script to execute jobs provided in json format from standard input.
Jul 2 2024, 1:39 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe closed T363342: glogger crashes regularly in mw-on-k8s containers as Resolved.

This should be solved with this morning's release.

Jul 2 2024, 7:05 AM · serviceops, MW-on-K8s
Joe closed T368640: glogger produces invalid JSON when given input with non-printable characters as Resolved.

This should be solved with the deployment in production.

Jul 2 2024, 7:04 AM · Observability-Logging, serviceops, MW-on-K8s
Joe closed T223413: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes) as Resolved.

Whenever you find a new occurence of a bug after months the original one has been resolved, please open a new task. The origin of your problem is most probably very different than the one we solved here.

Jul 2 2024, 5:42 AM · serviceops, MW-on-K8s, Growth-Team-Filtering, Growth-Team, Notifications

Jun 28 2024

Joe claimed T368640: glogger produces invalid JSON when given input with non-printable characters.
Jun 28 2024, 9:41 AM · Observability-Logging, serviceops, MW-on-K8s

Jun 27 2024

Joe added a comment to T368544: IPIP encapsulation considerations for low-traffic services.

theoretically speaking we could keep low-traffic on liberica/IPVS (instead of liberica/Katran) to be able to get rid of pybal entirely. Besides k8s based services we have some other services on low-traffic that need load balancing AFAIK.

But even if we stay on Liberica/IPVS for low-traffic, having the ability of running healthchecks properly, using the same network path as incoming requests would be a nice benefit for low-traffic services.

Jun 27 2024, 3:19 PM · Infrastructure-Foundations, serviceops, netops, Traffic
Joe closed T367269: Download of Azure cloud ranges for requestctl is broken as Resolved.

Uh this task was solved on that day, not sure why I forgot to close it. Sorry Andre if this messes with your UBN stats :)

Jun 27 2024, 3:11 PM · SRE
Joe added a comment to T368544: IPIP encapsulation considerations for low-traffic services.

I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services?

Is there a compelling reason why it's better than our current solution?

I believe we wanna move away from all the VXLAN setup and stop requiring L2 connectivity between load balancers and realservers.

Jun 27 2024, 2:46 PM · Infrastructure-Foundations, serviceops, netops, Traffic
Joe added a comment to T368545: weighted maglev viability for low-traffic services.

It is pretty clear to me that the only way to have fair load balancing with maglev is if we do the consistent hashing using the remote port as well.

Jun 27 2024, 2:03 PM · Infrastructure-Foundations, netops, serviceops, Traffic
Joe added a comment to T368544: IPIP encapsulation considerations for low-traffic services.

I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services?

Jun 27 2024, 2:00 PM · Infrastructure-Foundations, serviceops, netops, Traffic
kamila awarded T357309: Create a deployment for `shellbox-timedmedia` a Love token.
Jun 27 2024, 9:59 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Jun 26 2024

Joe added a comment to T364400: map the /api/ prefix to /w/rest.php.

Could we implement this remapping at the ATS layer rather than the Apache one, in a manner that would mean that when we need to cache we only need to store each effective URL once?

It would be preferable to not do so. The caching gains would be minimal, but more importantly: we hope to minimize the details of the application layer that are spread into the cache configuration (there will always be necessary cases, but the more we avoid it, the easier things are in the future).

Jun 26 2024, 1:42 PM · serviceops, Traffic, MW-Interfaces-Team
Joe added a comment to T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec).

The degradation seems to have started around midnight between June 18th and June 19th.

Jun 26 2024, 8:36 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Growth-Team (FY2024-25 Q1 Sprint 1), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Data Products, User-Michael, Data-Platform, Performance Issue, GrowthExperiments-Homepage