Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In
Maniphest T368274

[W.E.1.2.4] Detecting Peacock behavior with LLMs
Open, HighPublic

Description

Hypothesis: "If we train an LLM on detecting peacock behavior, then we can learn if it can detect this policy violation with at least >70% precision and >50% recall and ultimately, decide if said LLM is effective enough to power a new Edit Check and/or Suggested Edit."

Event Timeline

leila renamed this task from Detecting Peacook behavior with LLMs to [W.E.1.2.4] Detecting Peacook behavior with LLMs.Jul 2 2024, 10:00 PM
leila triaged this task as High priority.

Based on our previous research, we have created a dataset containing 9276 articles affected by peacock and other related policy violations on English Wikipedia. For each of them we have negative (no policy violations) and positive examples: * Autobiography: 1472

  • fanpov: 350
  • peacock 2587
  • weasel 805
  • advert: 4062
  • Total: 9276

Also, reviewing the latest literature on the field, we have found a recent study on detecting No Neutral Point of View (NPOV) in Wikipedia. Researchers from the University of Michigan tested ChatGPT 3.5, Mistral-Medium, and GPT-4 for detecting NPOV, finding a poor performance. By testing different strategies of prompt engineering, they just were able to reach 64% of accuracy.

  • This results shows the limitations of LLMs for detecting Wikipedia Policy violations.
  • Nonetheless, it is important to highlight that our focus is on a (potentially) simpler policy, peacock behavior.
  • Notice that the experiments on this paper were done using prompt engineering, while in our case, we should explore fine tuning an LLM.
diego renamed this task from [W.E.1.2.4] Detecting Peacook behavior with LLMs to [W.E.1.2.4] Detecting Peacock behavior with LLMs.Jul 12 2024, 3:53 PM
  • Studied how to create prompts for Gemma2. Noticed the importance of using special tokens and format.
  • Designed zero-shot experiment for detecting Peacock behavior.
  • Wrote code for testing the Gemma2 instance hosted by the ML-team.
    • The instance took more than 5 seconds per query.
    • After few requests (around 200) the instance stop responding.
    • O've reported this issue to ML-Team, my understanding is they will be working on fixing this during the next week (cc: Chris Albon)

Progress update

  • I've been coordinating with ML-team to show code examples that make their (experimental) infrastructure to fail. They will be using this code as part of their use-case studies when testing new LLMs infrastructure.
  • In the meantime I've been working on writing code to fine-tune smaller Language Models, this requires:
    • Data preprocessing and cleaning (done)
    • Experimental design (done)
    • Run experiments on stats machine (in progress)
  • Met with KR owner (Peter Pelberg) and explain the progress and next steps for this hypothesis.

Any new metrics related to the hypothesis

  • No

Any emerging blockers or risks

  • There is certain "congestion" with the GPUs on stat machine (many users for few GPUs). This requires waiting until GPUs are free. Having access to new GPUs would help to work faster on this front

Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?

  • No

Have there been any new lessons from the hypothesis?

  • Not this week

Have there been any changes to the hypothesis scope or timeline?

  • No

Progress update

  • Fine-tune model:
    • I've tested the fine-tune approach, creating an classifier based on an smaller Language Model. I used BERT, because we already have other products hosted in Liftwing based on this model, and it has shown that is fast enough and scale well on our existing infrastructure.
    • I've run several experiments, testing different datasets and model configuration. The (best) results expressed on Precision and Recall on a balanced dataset (same number of articles with and without peacock behavior) were:
      • Precision: 0.67
      • Recall: 0.15
    • These results are bellow our target, but should be consider as a baseline to compare with the LLMs experiments.
    • Depending on the results of those experiments, we should consider trying to improve the fine-tuning approach, because these numbers shows that the model is learning (finding a signal on the data), and probably with some tweaks we could (significantly) improve it's performance.
  • The ML-team is experimenting with a new LLM, called AYA23, I've done a quick test, and the service seems to be fast and robust enough to run experiments on it.

Any new metrics related to the hypothesis

  • Fine-tuned (BERT) model performance on detecting peacock behavior :
Training data: 9213 cases (unbalanced)
Validation data: 522 cases (balanced)
Precision: 0.67
Recall: 0.15

Any emerging blockers or risks

No

Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?

No

Have there been any new lessons from the hypothesis?

No

Have there been any changes to the hypothesis scope or timeline?

No

Next steps

Run zero-shot experiments using AYA23 LLM hosted by the ML-team

Progress update

  • I've been working on the few-shot approach without good results. I've tried a set of prompts, changing the format, number and distribution of examples, but the LLM used (aya23) is not processing this examples correctly, and it is over-fitting to one class.
  • In parallel I've been working on improving the fine-tunning approach by refining the hyperparameters. Currently I'm reaching a 0.69 precision and 0.23 recall.

Have there been any new lessons from the hypothesis?

  • There are not well-established procedures to create a successful few-shot prompt. After reviewing the literature, review examples, and tried several prompts this approach doesn't as a good solution for detecting peacock behavior.

Have there been any changes to the hypothesis scope or timeline?

  • No

Progress update

  • This week I worked on improving the few-shot and fine-tune experiments. Unfortunately, the few-shot approach didn't show relevant improvements, so I decided to discard it.
  • It is important to say that few-shot learning is a new technique, still under deployment, and there might be several reasons why it didn't work for this task. Anyhow, it might worth to explore it again the future, when more clear procedures are established.
  • On the other hand, after some tweaks, the the fine-tuned Bert model improved significantly, reaching a 0.72 precision and 0.4 recall on balanced data

Have there been any new lessons from the hypothesis?

  • The current results suggests that if we aim to detect at least 40% of the cases of peacock behavior, the model would fail in 28% assessments. This is below, but not so far of our target. I think there is a product decision to be made, if we want to focus on precision (avoiding wrong classifications that can disturb the editors workflow), or in recall (focusing on trying to detect all the cases of peacock behavior, even if that implies to show more false positives)
  • So far, I've been focusing in model's precision, without considering other factors , such as the serving time (how long it takes to get an answer from the model). This would depend on the resources we are going to have, and also on the length of the revision we are processing. If we decide to proceed with this project I think we should have that conversation, to see what is reasonable processing time and if is possible to have it with our current resources.

Have there been any changes to the hypothesis scope or timeline?

  • No