Mod:wiki checker

Wiki Checking Code

This code serves two purposes:

Crosscheck text inside a wiki page against an existing database.
Check ownership of images and scan for existing duplicates.

The code is parallelised and designed to be used on the HPC.

Usage

Setting up a Job

The following sections go through the structure of the modifiable part of the script

PBS Variables

These variables can be adjusted in the same way as any PBS job script.

Setting the Database Path

For the first run, a blank database file must be created. This can have any name (in this example, "stringdatabase") and can be created by executing:

touch stringdatabase

The path is then set in the wiki_checker script, under the string_database_path variable:

string_database_path = "/full/path/to/stringdatabase"

As PBS jobs are executed in a temporary path, the full path to the database must be used.

Logging

The output of the checker is sent to the file corresponding to the log_path. This is a write-only task, and therefore the file does not need to be manually created. For the same reason as above, the full path must be used.

Checking a List of Wikis

Each job will perform text or image checking, or both, on a list of wiki URLs given by the wikilist variable:

wikilist = [
"https://wiki.ch.ic.ac.uk/wiki/index.php?title=<wikipagename1>",
"https://wiki.ch.ic.ac.uk/wiki/index.php?title=<wikipagename2>"
]

Each URL should be enclosed by double quotation marks and separated by a comma (a new line can be used in addition for ease of reading).

Text Checking

Setting the check_text variable to True switches on text checking, where each wiki in the list is compared against the database. Setting to False switches off this functionality.

(Added 8th May 2017)

Strings that are similar (using the same threshold) as paragraphs in exclude_paras are ignored. It's useful to add all explicit questions from the exercise to this list - some students will write the questions into their wikis which would otherwise be flagged.

(Added 8th May 2017)

Strings that contain any of the strings in exclude_strings are ignored. For example, JMols can frequently appear as paragraphs. If you're not interested in testing similarity of JMol code, you can add "<jmol>" to prevent them appearing in the log.

Adding a URL to the Database

When add_to_database is set to True, paragraphs are added to the database. Setting to False switches off this functionality.

Similarity Threshold

Set the tolerance to check for a percentage match between new wiki paragraphs and the database paragraphs. A higher number corresponds to a strict match, and 0 will include all possible combinations of paragraphs, and is not recommended!

Ownership of Images

Setting check_images to True will perform analysis on all images in the page. It has two functions:

It checks whether any images do not belong to the page owner. This is a naive algorithm, and looks for the most common username in all images to set the page owner. Under normal circumstances this should not fail, but in the extremely rare occasion where a user has copied every image from one person only it will not flag a warning.

It checks for images that are exact pixel matches of existing images. A flag usually occurs when a user has uploaded the same file with a different name, but can occur when an image is copied from another page or the file is copied directly from another user. It can also rarely occur when exactly the same file is generated by two separate users by chance.

Checking Time of Latest Edits

(Added 6th May 2017)

Setting latest_edits to an integer greater than 0 will switch on printing of the times of that number of edits in history.

This utility can be used to check whether the wiki has been edited after a deadline.

Submitting a Job

The script is itself a PBS job script, and can be submitted directly with qsub:

qsub wiki_checker.py

Analysing the Results

Each wiki page that is checked produces an output of the following structure:

<current date and time>
Target URL:       <url>

<text analysis>

<image analysis>

Execution completed in <real walltime>

Text Analysis

Any similarity flags have the following structure:

Database Match: <URL>
String:         <string1>
Match:          <string2>
Similarity:     <percentage>

The <URL> indicates which database url triggered the match. string1 and string2 are the strings from the wiki under test and the database string respectively. The percentage indicates how similar the strings are.

Image Analysis

Images are flagged if there the image belongs to or has belonged to another user, or it is a pixel match of an existing image.

Ownership issues are shown as:

Image:      <image URL>
Owner:      <username>
Users:      ['<username1>', '<username2>'...]

This is usually not a problem, and can occur when files are overwritten. If none of the users in the Users list correspond to the Owner, then it suggests the image URL has been used and the file was not created/uploaded. The image URL can be checked for further analysis.

Duplicates appear as:

Image:      <image URL>
Owner:      <username>
Duplicates: ['<duplicate1 URL>', '<duplicate2 URL>'...]

Again, this is usually not a problem and can occur when a user has renamed a file after uploading (creating a duplicate on the server). Occasionally it can occur when a user has downloaded another user's image and uploaded it with their own name.

Latest Edition

The times of the latest editions can be printed using the latest_edits to an integer greater than 0.

The output of latest_edits=3 will look like:

Latest Editions:
Time: <time0>, <date0>         User: <user0>
Time: <time1>, <date1>         User: <user1>
Time: <time2>, <date2>         User: <user2>

The username is printed in case an edit is made by a marker. It can also be used to check whether anyone else has edited a wiki page.

Scaling and Parallelisation

The code scales linearly with the number of items in the database. Over time, it will become slower (~6 mins CPU time for 8000 database strings * 44 wiki strings).

It is recommended therefore to make use of multiprocessing by installing the joblib module and setting ncpus in the #PBS node selection line. Bottlenecks in the code are parallelised, and linear scaling with number of cores can be seen when the database is heavily used.

When parallelisation isn't available, the code will default to multithreading for image analysis.

Known Issues

Image checking hangs for 180 seconds

Sometimes image analysis will take 180 seconds instead of the typical ~1 second. This is a server-side issue and can't be fixed.

Installation

Notes

The easiest way to distribute small pieces of code across wiki is pasting it in as text. The code below can be copied and pasted directly into an empty .py file on the server, such as "wiki_checker.py". Note that you must run the following local-session commands on vim to prevent a huge number of tabs being inserted:

:set noautoindent
:set nosmartindent

Code

#! /usr/bin/env python

#############
#           #
# JOB SETUP #
#           #
#############

#This section can be modified as needed 

#PBS -l select=1:ncpus=4:mem=16000MB
#PBS -l walltime=1:00:00
#PBS -j oe

__version__ = '1.0.5'

#String database should be empty for first run 
#It will be populated during use and shouldn't be modified
string_database_path = ""#/path/to/string/database

#Output results to log
log_path             = ""#path/to/log

#Below is a comma separated list of strings of new URLs to check
wikilist = [
#"https://wiki.ch.ic.ac.uk/wiki/index.php?title=<wikipagename1>",
#"https://wiki.ch.ic.ac.uk/wiki/index.php?title=<wikipagename2>"
]


#check_text to perform plagiarism checking
#Set add_to_database to False for test runs, otherwise duplicate strings will be added to the string database
#tolerance sets the threshold to consider a string suspicious. Default is 0.8
check_text           = True
add_to_database      = True
tolerance            = 0.8

#check_images will perform owner and duplicate analysis on images
check_images         = True

#This will print out the submission time of the latest editions
#The number is how far back in history to go (useful if someone has changed the page)
latest_edits         = 3

#Add paragraphs here that you might expect to appear frequently, but aren't potential cases of plagiarism (eg questions from the exercise)
exclude_paras = [
]

#Add strings here that indicate a paragraph is not to be flagged (eg <jmol>)
#Adding <jmol> prevents everyone's JMOL code from being flagged repeatedly
exclude_strings = [
]



#######################
#                     #
# WIKI CHECKER SCRIPT #
#                     #
#######################

#This section should not be modified

from difflib import SequenceMatcher as sim
try:
    from joblib import Parallel, delayed
except:
    is_parallel = False
    import threading
else:
    is_parallel = True
from lxml import html
import time, urllib, os, requests


def wikiVsDB(wiki_url, string_database_path, log_path=None, add_to_database=True, ncpus=4, tolerance=0.8, exclude_strings=None, exclude_paras=None):
    """Compare a wiki URL with the string database
    wiki_url:             Wiki URL to compare
    string_database_path: Path to string database containing previous entries
    log_path:             Path to log results. If none or empty, this defaults to stdout
    add_to_database:      If true, the wiki will be added to the database. Set to False for trial runs
    ncpus:                Numbed of processors to run on
    tolerance:            Level of match before a string is flagged
    exclude_strings:      If any of these strings are in the string to be tested, it is ignored.
    exclude_paras:        If any of these paragraphs are similar to the string to be tested, it is ignored.
    """

    url_dict = urlToDict(wiki_url)
    DB_dict = stringDBToDict(string_database_path)
    url_list = [[k, v] for k, v in url_dict.items()]

    if is_parallel:
        Parallel(n_jobs = ncpus)(delayed(strVsDict)(string, url, DB_dict, log_path, tolerance, exclude_strings, exclude_paras) for string, url in url_list)
    else:
        for string, url in url_list:
            strVsDict(string, url, DB_dict, log_path, tolerance, exclude_strings, exclude_paras)

    if add_to_database:
        addToDB(string_database_path, url_dict)

def strVsDict(string, url, string_url_dict, log_path, tolerance=0.8, exclude_strings=None, exclude_paras=None):
    """Compare a string to a database dict
    string:          String to compare
    url:             URL from where the string came from. For printing purposes
    string_url_dict: Dictionary to compare with in the form {<string0>: <url0>, <string1>: <url1>}
    tolerance:       Level of match before a string is flagged
    """

    a_str = string
    a_url = url
    tol = float(tolerance)

    if not exclude_strings:
        exclude_strings = []

    if not exclude_paras:
        exclude_paras = []

    #For every string, perform tests. If they fail a test, the next string is considered
    #Ordered tests for efficiency
    for b_str, b_url in string_url_dict.items():

        #Make sure string doesn't contain any strings we don't want in the database
        if any([a_str in e_s for e_s in exclude_strings]):
            continue
 
        #Make sure the record isn't already in the database
        if b_url == a_url:
            continue
 
        #Test if strings are similar
        similarity = sim(None, a_str, b_str).ratio()
        if float(similarity) < tol:
            continue
 
        #Make sure this string isn't similar to the paragraph exclusion list (eg a question from the exercise)
        exclude_para_sims = [sim(None, a_str, e_str).ratio() for e_str in exclude_paras]
        if any([float(e_sim) >= tol for e_sim in exclude_para_sims]):
            continue

        #If the string has made it this far, it is flagged and printed in the log
        comp_strings = [
            "Database Match: {}\n".format(b_url),
            "String:         {}\n".format(a_str),
            "Match:          {}\n".format(b_str),
            "Similarity:     {:.1%}\n\n".format(similarity)]
        write(log_path, "".join(comp_strings))
                

def urlToDict(url, min_string_length=10):
    """Convert HTML from a URL to a dictionary of paragraphs"""

    string_dict={}
    wiki = urllib.urlopen(url)

    paras = [p for p in wiki.readlines() if "<p>" in p]
    for p in paras:
        split_list = p.split(">")
        join_list = []

        for s in split_list:
            join_list.append(s.split("<")[0])
        joined = "".join(join_list)
        if len(joined.split()) >= min_string_length:
            string_dict[joined.strip("\n").replace("\t", " ")] = url

    return string_dict

def stringDBToDict(string_database_path):
    """Parses the string database into a dictionary"""

    string_dict = {}

    with open(string_database_path, "r") as db:
        for line in db.readlines():
            split_line = line.split("\t")
            if len(split_line) == 2:
                string, url = line.split("\t")
                string_dict[string.strip()] = url.strip()

    return string_dict

def getTimeStr(m, s):
    m = int(m)
    s = int(s)
    if m == 1:
        m_string = " " + str(m) + " Minute"
    elif m == 0:
        m_string = ""
    else:
        m_string = " " + str(m) + " Minutes"

    if s == 1:
        s_string = " " + str(s) + " Second"
    else:
        s_string = " " + str(s) + " Seconds"

    return "Execution completed in" + m_string + s_string

def addToDB(string_database_path, url_dict=None, url=None, min_string_length=10):

    if url:
        url_dict = urlToDict(url, min_string_length)

    with open(string_database_path, "a") as db:
        for string, url in url_dict.items():
            db.write(string + "\t" + url + "\n")

def wikiImageInfo(url, log_path="", ncpus=4):

    wikipage = requests.get(url)
    wiki_html = html.fromstring(wikipage.content)
    i_s = wiki_html.find_class("image")
    images = ["https://wiki.ch.ic.ac.uk" + i.attrib['href'] for i in i_s if i.attrib.get('href')]

    image_dict = {}
    if is_parallel:
        image_list = Parallel(n_jobs = ncpus)(delayed(checkImage)(image, image_dict, is_parallel) for image in images)
        image_dict = {k: v for k, v in image_list}
    else:
        threads = [None] * len(images)
        for i in range(len(threads)):
            threads[i] = threading.Thread(target=checkImage, args=(images[i], image_dict, is_parallel))
            threads[i].start()

        for i in range(len(threads)):
            threads[i].join()

    users = [image_info["users"] for image_info in image_dict.values()]
    users = [a for b in users for a in b]
    if users:
        user = max(set(users), key=users.count)

        for image, info in image_dict.items():
            if not all(user==u for u in info["users"]):
                write(log_path, "Image:      {}\n".format(image))
                write(log_path, "Owner:      {}\n".format(user))
                write(log_path, "Users:      {}\n\n".format(info["users"]))
            if image_info["duplicates"]:
                if not all(user==u for u in info["users"]):
                    write(log_path, "Image:      {}\n".format(image))
                    write(log_path, "Owner:      {}\n".format(user))
                write(log_path, "Duplicates: {}\n\n".format(info["duplicates"]))

def checkImage(image_url, image_dict, is_parallel):
    image_info = {}
    imagepage = requests.get(image_url, timeout=None)
    image_html = html.fromstring(imagepage.content)
    c_s = image_html.find_class("mw-userlink")
    users = [c.text for c in c_s]
    image_info["users"]=users

    a_s = image_html.find_class("mw-imagepage-duplicates")
    d_s = [a.findall("li") for a in a_s]
    d_s = [a for b in d_s for a in b]
    image_info["duplicates"] = []
    duplicates = [d[0].attrib['href'] for d in d_s if d[0].attrib.get('href')]
    if duplicates:
        for d in duplicates:
            if d.startswith("/"):
                image_info["duplicates"].append("https://wiki.ch.ic.ac.uk" + d)
            else:
                image_info["duplicates"].append(d)
    if is_parallel:
        return [image_url, image_info]
    else:
        image_dict[image_url] = image_info

def getLatest(wiki_url, log_path, history_number):
    if not history_number:
        return

    url = wiki_url + "&action=history"
    wikipage = requests.get(url)
    wiki_html = html.fromstring(wikipage.content)
    history = wiki_html.find_class("mw-changeslist-date")
    history_users = wiki_html.find_class("mw-userlink")

    latest = [h.text for h in history]
    users  = [u.text for u in history_users]

    write(log_path, "Latest Editions:\n")
    for n in range(min(len(latest), history_number)):
        write(log_path, "Time: {:<25s} User: {}\n".format(latest[n], users[n]))
    write(log_path, "\n")

def write(target, string):
    if target:
        with open(target, "a") as log:
            log.write(string)
    else:
        print(string)

ncpus = int(os.environ["NCPUS"]) #Set this in the PBS resources above

for wiki_url in wikilist:
    write(log_path, time.ctime() + "\n")
    write(log_path, "Target URL:     {}\n\n".format(wiki_url))
    i_time = time.time()

    if check_text:
        wikiVsDB(
            wiki_url,
            string_database_path,
            log_path,
            add_to_database,
            ncpus,
            tolerance,
            exclude_strings,
            exclude_paras
        )

    if check_images:
        wikiImageInfo(
            wiki_url,
            log_path,
            ncpus
        )

    if latest_edits:
        getLatest(
            wiki_url,
            log_path,
            latest_edits
        )

    exec_time = time.time() - i_time
    m, s = divmod(exec_time, 60)
    write(log_path, getTimeStr(m, s) + "\n\n\n")