Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In
Maniphest T157045

archivebot.py has 'ascii' codec bug
Closed, ResolvedPublic

Description

I run bot by this command at fa.wikipedia

python pwb.py archivebot.py کاربر:Dexbot/Archivebot

and it shows this error and doesn't do any edit.

15 Threads found on [[fa:ویکی‌پدیا:قهوه‌خانه/خبررسانی/بایگانی 1]]
Looking for: {{کاربر:Dexbot/Archivebot}} in [[fa:ویکی‌پدیا:قهوه‌خانه/خبررسانی/بایگانی 1]]
Processing 15 threads
ERROR: Error occurred while processing page [[fa:ویکی‌پدیا:قهوه‌خانه/خبررسانی/بایگانی 1]]
ERROR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)
Traceback (most recent call last):
  File "/data/project/rezabot/pycore/scripts/archivebot.py", line 741, in main
    archiver.run()
  File "/data/project/rezabot/pycore/scripts/archivebot.py", line 608, in run
    whys = self.analyze_page()
  File "/data/project/rezabot/pycore/scripts/archivebot.py", line 595, in analyze_page
    if self.feed_archive(archive, t, max_arch_size, params):
  File "/data/project/rezabot/pycore/scripts/archivebot.py", line 554, in feed_archive
    and not self.key_ok():
  File "/data/project/rezabot/pycore/scripts/archivebot.py", line 523, in key_ok
    s.update(self.page.title().encode('utf8') + '\n')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)Processing [[fa:ویکی‌پدیا:قهوه‌خانه/فنی]]

Event Timeline

I use python3 it shows this error

python3 pwb.py archivebot.py کاربر:Dexbot/Archivebot
15 Threads found on [[fa:ویکی‌پدیا:قهوه‌خانه/خبررسانی/بایگانی 1]]
Looking for: {{کاربر:Dexbot/Archivebot}} in [[fa:ویکی‌پدیا:قهوه‌خانه/خبررسانی/بایگانی 1]]
Processing 15 threads
ERROR: Error occurred while processing page [[fa:ویکی‌پدیا:قهوه‌خانه/خبررسانی/بایگانی 1]]
ERROR: TypeError: Unicode-objects must be encoded before hashing
Traceback (most recent call last):
  File "./scripts/archivebot.py", line 741, in main
    archiver.run()
  File "./scripts/archivebot.py", line 608, in run
    whys = self.analyze_page()
  File "./scripts/archivebot.py", line 595, in analyze_page
    if self.feed_archive(archive, t, max_arch_size, params):
  File "./scripts/archivebot.py", line 554, in feed_archive
    and not self.key_ok():
  File "./scripts/archivebot.py", line 522, in key_ok
    s.update(self.salt + '\n')

At least for python 3 error, it should be this: https://docs.python.org/3/library/hashlib.html

Note: Feeding string objects into update() is not supported, as hashes work on bytes, not on characters.

In python2, probably "from future import absolute_import, unicode_literals" forcing types (and a ascii decode) behind the scenes when adding the two 'strings'?

>>> u'\u0628\u062d\u062b \u06a9\u0627\u0631\u0628\u0631:IranianNationalist'.encode('utf-8') +'\n'
'\xd8\xa8\xd8\xad\xd8\xab \xda\xa9\xd8\xa7\xd8\xb1\xd8\xa8\xd8\xb1:IranianNationalist\n'

>>> from __future__ import absolute_import, unicode_literals

>>> '\u0628\u062d\u062b \u06a9\u0627\u0631\u0628\u0631:IranianNationalist'.encode('utf-8')
'\xd8\xa8\xd8\xad\xd8\xab \xda\xa9\xd8\xa7\xd8\xb1\xd8\xa8\xd8\xb1:IranianNationalist'

>>> u'\u0628\u062d\u062b \u06a9\u0627\u0631\u0628\u0631:IranianNationalist'.encode('utf-8') +'\n'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)

>>> '\u0628\u062d\u062b \u06a9\u0627\u0631\u0628\u0631:IranianNationalist'.encode('utf-8').decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)

Change 336918 had a related patch set uploaded (by Mpaa):
archivebot.py: fix Unicode encodings in py2 and py3

https://gerrit.wikimedia.org/r/336918

Change 336918 merged by jenkins-bot:
archivebot.py: fix Unicode encodings in py2 and py3

https://gerrit.wikimedia.org/r/336918

Mpaa claimed this task.