[W.E.1.2.4] Detecting Peacock behavior with LLMs
Open, HighPublic
Actions

Assigned To

Authored By

	diego
	Jun 24 2024, 2:24 PM

Description

Hypothesis: "If we train an LLM on detecting peacock behavior, then we can learn if it can detect this policy violation with at least >70% precision and >50% recall and ultimately, decide if said LLM is effective enough to power a new Edit Check and/or Suggested Edit."

Related Objects
Search...

Status	Assigned	Task
Open	None	T265163 Create a system to encode best practices into editing experiences
Open	None	T365300 Introduce Edit Checks that encourage behavior aligned with Manual of Style consensus
Open	None	T365301 Peacock Check: Prompt people to revise promotional language
Open	diego	T368274 [W.E.1.2.4] Detecting Peacock behavior with LLMs

Event Timeline

diego created this task.Jun 24 2024, 2:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 24 2024, 2:24 PM

Aklapper added a project: Research.Jun 24 2024, 6:22 PM

Aklapper updated the task description. (Show Details)

XiaoXiao-WMF moved this task from Backlog to FY2024-25-Research-July-September on the Research board.Jun 28 2024, 6:30 PM

XiaoXiao-WMF edited projects, added Research (FY2024-25-Research-July-September); removed Research.

leila renamed this task from Detecting Peacook behavior with LLMs to [W.E.1.2.4] Detecting Peacook behavior with LLMs.Jul 2 2024, 10:00 PM

leila triaged this task as High priority.

KStoller-WMF subscribed.Jul 11 2024, 3:48 PM

Based on our previous research, we have created a dataset containing 9276 articles affected by peacock and other related policy violations on English Wikipedia. For each of them we have negative (no policy violations) and positive examples: * Autobiography: 1472

fanpov: 350
peacock 2587
weasel 805
advert: 4062
Total: 9276

Also, reviewing the latest literature on the field, we have found a recent study on detecting No Neutral Point of View (NPOV) in Wikipedia. Researchers from the University of Michigan tested ChatGPT 3.5, Mistral-Medium, and GPT-4 for detecting NPOV, finding a poor performance. By testing different strategies of prompt engineering, they just were able to reach 64% of accuracy.

This results shows the limitations of LLMs for detecting Wikipedia Policy violations.
Nonetheless, it is important to highlight that our focus is on a (potentially) simpler policy, peacock behavior.
Notice that the experiments on this paper were done using prompt engineering, while in our case, we should explore fine tuning an LLM.

diego renamed this task from [W.E.1.2.4] Detecting Peacook behavior with LLMs to [W.E.1.2.4] Detecting Peacock behavior with LLMs.Jul 12 2024, 3:53 PM

diego updated the task description. (Show Details)Jul 12 2024, 4:09 PM

Studied how to create prompts for Gemma2. Noticed the importance of using special tokens and format.
Designed zero-shot experiment for detecting Peacock behavior.
Wrote code for testing the Gemma2 instance hosted by the ML-team.
- The instance took more than 5 seconds per query.
- After few requests (around 200) the instance stop responding.
- O've reported this issue to ML-Team, my understanding is they will be working on fixing this during the next week (cc: Chris Albon)

diego added a subscriber: XiaoXiao-WMF.Jul 22 2024, 4:02 PM

Progress update

I've been coordinating with ML-team to show code examples that make their (experimental) infrastructure to fail. They will be using this code as part of their use-case studies when testing new LLMs infrastructure.
In the meantime I've been working on writing code to fine-tune smaller Language Models, this requires:
- Data preprocessing and cleaning (done)
- Experimental design (done)
- Run experiments on stats machine (in progress)
Met with KR owner (Peter Pelberg) and explain the progress and next steps for this hypothesis.

Any new metrics related to the hypothesis

Any emerging blockers or risks

There is certain "congestion" with the GPUs on stat machine (many users for few GPUs). This requires waiting until GPUs are free. Having access to new GPUs would help to work faster on this front

Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?

Have there been any new lessons from the hypothesis?

Not this week

Have there been any changes to the hypothesis scope or timeline?

ppelberg subscribed.Jul 27 2024, 1:41 AM

ppelberg mentioned this in T371158: [SPIKE] What percentage of edits are reverted because of peacock behavior?.Jul 27 2024, 1:53 AM

Progress update

Fine-tune model:
- I've tested the fine-tune approach, creating an classifier based on an smaller Language Model. I used BERT, because we already have other products hosted in Liftwing based on this model, and it has shown that is fast enough and scale well on our existing infrastructure.
- I've run several experiments, testing different datasets and model configuration. The (best) results expressed on Precision and Recall on a balanced dataset (same number of articles with and without peacock behavior) were:
  - Precision: 0.67
  - Recall: 0.15
- These results are bellow our target, but should be consider as a baseline to compare with the LLMs experiments.
- Depending on the results of those experiments, we should consider trying to improve the fine-tuning approach, because these numbers shows that the model is learning (finding a signal on the data), and probably with some tweaks we could (significantly) improve it's performance.
The ML-team is experimenting with a new LLM, called AYA23, I've done a quick test, and the service seems to be fast and robust enough to run experiments on it.

Any new metrics related to the hypothesis

Fine-tuned (BERT) model performance on detecting peacock behavior :

Training data: 9213 cases (unbalanced)
Validation data: 522 cases (balanced)
Precision: 0.67
Recall: 0.15

Any emerging blockers or risks

No

Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?

No

Have there been any new lessons from the hypothesis?

No

Have there been any changes to the hypothesis scope or timeline?

No

Next steps

Run zero-shot experiments using AYA23 LLM hosted by the ML-team

ppelberg added a parent task: T365301: Peacock Check: Prompt people to revise promotional language.Wed, Aug 7, 6:01 PM

Progress update

I've been working on the few-shot approach without good results. I've tried a set of prompts, changing the format, number and distribution of examples, but the LLM used (aya23) is not processing this examples correctly, and it is over-fitting to one class.
In parallel I've been working on improving the fine-tunning approach by refining the hyperparameters. Currently I'm reaching a 0.69 precision and 0.23 recall.

Have there been any new lessons from the hypothesis?

There are not well-established procedures to create a successful few-shot prompt. After reviewing the literature, review examples, and tried several prompts this approach doesn't as a good solution for detecting peacock behavior.

Have there been any changes to the hypothesis scope or timeline?

Progress update

This week I worked on improving the few-shot and fine-tune experiments. Unfortunately, the few-shot approach didn't show relevant improvements, so I decided to discard it.
It is important to say that few-shot learning is a new technique, still under deployment, and there might be several reasons why it didn't work for this task. Anyhow, it might worth to explore it again the future, when more clear procedures are established.
On the other hand, after some tweaks, the the fine-tuned Bert model improved significantly, reaching a 0.72 precision and 0.4 recall on balanced data

Have there been any new lessons from the hypothesis?

The current results suggests that if we aim to detect at least 40% of the cases of peacock behavior, the model would fail in 28% assessments. This is below, but not so far of our target. I think there is a product decision to be made, if we want to focus on precision (avoiding wrong classifications that can disturb the editors workflow), or in recall (focusing on trying to detect all the cases of peacock behavior, even if that implies to show more false positives)
So far, I've been focusing in model's precision, without considering other factors , such as the serving time (how long it takes to get an answer from the model). This would depend on the resources we are going to have, and also on the length of the revision we are processing. If we decide to proceed with this project I think we should have that conversation, to see what is reasonable processing time and if is possible to have it with our current resources.

Have there been any changes to the hypothesis scope or timeline?

[W.E.1.2.4] Detecting Peacock behavior with LLMsOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

[W.E.1.2.4] Detecting Peacock behavior with LLMs
Open, HighPublic
Actions

Related Objects
Search...