Blayne Dreier
Expeller v1.0 The Cheat Detector
Index
Expeller is meant to curb the problem of source code plagiarism among students. It is a piece of software that searches through student program files and makes textual comparisons between them.
Many Computer Science students will openly admit that they do not like to program. Many of them also become frustrated and anxious due to program due dates and specific instructor-requested guidelines. Rather than go to tutoring or ask for help in understanding programming fundamentals, they become experts at cheating by copying another students source code, manipulating it in order to change its basic structure or appearance, and then submitting the plagiarized result as their own.
A manipulated program is not as easy for a human to detect as, for example, a copied term paper. The reader of linguistic texts can obtain an idea or feeling while he reads each paper, even if all of the papers topics are the same. This is not often possible when reading source code, as the code defines specific actions in a procedure-oriented language, which is not easily read and retained block-by-block. Additionally, programs written for any given programming assignment all have the same fundamental purpose, meaning that a human will be unable to distinguish one program from another, based on the function each describes.
Expeller takes a different approach. Rather than simply basing its comparison on the function of a program, it compares the style. It then outputs a minimized list of all possible plagiarized programs, along with their source, to a simple text file. This file can then be read by the user and the possible plagiarized programs can be examined. By the methods described in this paper, you will see that Expeller is surprisingly accurate by replicating and comparing the techniques used by cheating students.
User Manual
Overview
Expeller detects source code plagiarism among students by making textual comparisons between files. It then outputs the results to a simple, minimized text file for user-review. A few examinations can then be made by the user to determine whether the students are indeed believed to be guilty of cheating.
User Operations
A listing of all user operations is easily reached by using the help runtime argument, which is actually a user option itself. Upon using the help argument, the user will receive the following list of user options:
-d <full path>
Provides
the user with the ability to select the first directory in which Expeller will
search for files.
-o
<output file name>
Allows
the user to choose the name of the output file that will contain all of the
possible cheaters.
-f
<filename or extension>
Filters
the search to only include specific file types (by extension) or specific file
names (such as if all of the students files are named Lab_1).
-s
Flag that enables the searching of subdirectories within the specified directory in the -d argument.
-t <cheat threshold>
Allows the user to set a threshold for which
cheaters will be reported.
help
Displays the help screen with a listing of these user operations.
Screens
Expeller is completely text based to allow it to be used remotely, so there are only four terminal screens that the user will encounter:
Help Screen

The Help Screen lists all of Expellers user options (described on previous page).
Run Screen

In this screen, we are running the program with various user options. The directory (d) option tells Expeller to begin searching in /afs/uncc.edu/usr/f/cbdreier/sp/samplefiles, the output (o) option says that the output will be in cheaters.txt, the filter (f) option tells Expeller to only search for files with the .java extension, the subdirectories (s) option tells Expeller to search all subdirectories, and the threshold (t) option says to only output files that are 85% (or greater) alike.
Run-time Screen

When Expeller runs, it displays all of the values for user configurable options. It then lets the user know that it is loading files by displaying one period for each file loaded and then displaying the total number of files loaded. It does the same for comparing files and then displays the total time that the comparisons took. Lastly, Expeller tells the user how many possible cheats it found and the output files name.
User Review Screen

The last screen isnt a screen per se (unless you are using Expeller via terminal), but a display of the results file. Here we see that the results file is telling the user to make four manual comparisons (down from 120). We have each file pairs cheat possibility listed as well as the full path location of the offending files. At the bottom, the total number of possible cheats is displayed. This can be helpful in cases of very long cheaters files.
Defects
There are no known defects in Expeller.
Extensions to the Project
There is a clear direction that the project should now take. First, the modCountTokens(File file1, File file2); method should be expanded upon by adding more regular expressions. Second, a new module should be implemented. This module should be a parsed tokenized string matching algorithm. The student that builds this module will benefit greatly by having taken a compiler construction course. In order to build this module, the following description should be followed:
Tokens are created by categorizing statements. This is easily understood by example. If your program contains a statement: int thisNumber = 10; the token to replace it would be VARDEF which stands for variable declaration. Similarly, a statement such as import java.io.*; would be replaced with an IMPORT token. The table found on page twelve is taken from one of the best cheat detectors in the industry, JPlag (http://www.jplag.de/). It is a generic token table, so it should not be considered plagiarism to implement it into Expeller. Additionally, one can add his/her own tokens to make this table more complete. Tokens can be added for any language, thereby expanding Expellers abilities.
The easiest way to carry out the implementation of the module described above is to use Expellers getFileAsString(File file, boolean deleteComments); method, with deleteComments set to true. This will return a file as a string object and allow you to replace matches to regular expressions with tokens. (Tokens are simply strings with values equal to the token name.) You will find information about how to create and match regular expressions in the java.util.regex package, as well as in the modCountTokens(File file1, File file2); method of Expeller where tokens are matched with regular expressions. They are then counted and totals are compared via the percentMatch(int number1, int number2); method.
After the token strings are built by replacing all token matches by token names, an algorithm will need to be written to match the token strings of each file to the other. The programmer will need to keep in mind that he/she will need to match token strings (combinations of tokens), as well as individual tokens, between the two files. The programmer will need to devise a sophisticated algorithm, remembering that tokens in the supposed cheaters program can be moved around without affecting its operation.
Lastly, more modules can be added that compare the two files in any fashion. However, the programmer should remember to test each module on real sample data, and throw out any modules that skew Expeller away from actual detection of cheating. This was done many times during the development of Expeller v1.0. Also, to carry a tradition and to let the current programmer know how many revisions have been made to Expeller, please add one tenth to the version number for every new senior project on which the code is based.
A note on building modules: Expeller was built to be completely modular. At minimum, all one has to do to add to the sophistication of Expellers results is add a module to the main code, and then run it and add its percent match to the total percent match of the compareFiles method. In order to keep Expeller additions this simplistic, when creating a method, always only allow it to accept, as parameters, two files, and always return the percent match for that module.

What I would do differently
When I started this project, I had not taken any compiler construction courses, and therefore knew nothing of their topics. I will say that one introduction course in compiler construction has changed the direction of this project so drastically that any student that works on Expeller in the future should have first taken one of these classes.
If I were to begin this project again, I would start with the algorithm described in the Extensions to the Project section of this paper. Even though the current modules of Expeller work incredibly well, seeing the results of the suggested algorithm could spark ideas for even more sophisticated modules than the ones that are currently implemented. Since the project only exists, for most students, over a short period of time (one to two semesters) brainstorming is costly as implementation tends to take the most time.
Also, if there was more than one member in the group, I would have assigned one of the members to do some undercover investigative research. He would have asked a professor that teaches an introductory Java programming course (1214 or 1215) to sit in on about a quarter of his/her classes (especially around the time programs are due). Students would probably not be tipped off that he was not a regularly enrolled student, and he could seek out the cheaters and record their methods. They would most likely not hesitate to provide him with information, especially if they had guaranteed anonymity. Additionally, he may be able to work out some scenario where the professor will make an assignment for every student in the class to alter a program in order to fool the professor into believing that its an original. He could then collect this assignment from the teacher and record what the students have altered, thereby strengthening Expeller.
Expeller works by detecting mimics of cheat methods. The most sophisticated module algorithm is useless if it does not detect real student cheating habits.
Technical Manual
Installation Process
There is no installation process, per se. You simply copy the expeller.java file into a directory anywhere on the system that the target files are located and compile it with the systems native Java Runtime Environment (JRE).
Technical Overview
The most thorough way to describe Expeller in a technical manner is to explain its methods in detail. I will first describe the functional methods (the methods that allow Expeller to function as a program) and then move to the modular methods
.expeller(String[] args) Constructor
Since Expeller runs more as a standalone program and not simply a class, the constructor takes care of running all of the various methods in order. First, it prints out a version number. Second it gathers and dissects the argument user options using the dissectArgs(LinkedList argList) method. It then prints the set user options, and begins loading files with the growFileAndDirectoryArray() method. Fourth, it compares the files using the compareFiles() method and keeps track of the total time that the comparisons took so that it can be outputted to the user. Lastly, it tells the user how many cheats were found and what file they have been outputted to.
dissectArgs(LinkedList
argList)
Arguments are supplied to Expeller at runtime. They need to be recursively evaluated because no argument is mandatory. dissectArgs takes, as a parameter, a LinkedList of arguments, which will act as a stack, from the constructor method. It then checks to see if a help argument is first (meaning the user would like to have a display of all the user options) and then proceeds to check each other argument in the list. Invalid arguments are those that are not help and do not start with a -, so the validity of this is checked first. If an invalid flag is found Expeller tells the user of this error. After Expeller knows that its dealing with a valid argument, the letter for the argument is examined, an action (determined by the argument option letter, described in the User Options section) is carried out. The argument is then popped off the stack, and then the method checks to see if the stack is empty. If not, dissectArgs is called again recursively.
compareFiles()
This method runs the comparison modules of Expeller, records percent matches from each module, computes the total percent match between the two files and then prints the result to an output file.
Files are stored in two arrays, file1[] and file2[]. They are cross-matched via two for loops so that every file is compared against every other. Each pair of files is run through all four modules, the score is recorded, and the loop continues with the next pair. After the total results are computed, the final scores that are at or above the threshold user option are written to an output file. The file stream is then closed.
percentMatch(int
number1, int number2)
This method finds the percent match between two numbers. It uses the generic percent difference formula, (difference/average)*100, and then subtracts the result from 100. If the sum of the two numbers is equal to 0, this method returns a percent match of 100 because this would mean that both numbers are equal to 0 (this method will never receive negative parameters) and the absence of any attribute is just as important as its presence. If the sum is not zero, the method continues. After the percent match is calculated, if it is greater than 0 (percent difference is less than 100), the percent match is returned. If the percent match is less than 0 (percent difference is greater than 100), a 0 is returned.
calculateComparisons(int
numberOfFiles)
This method is used to calculate the number of comparisons that Expeller will have to make, based on the number of files that it found. The formula for comparisons based on number of files is: number of comparisons = (n-1) + (n-2) + (n-3) + + (n-(n-1)), n being equal to the number of files.
getDepthOfFile(File
file)
When we create the arrays for depth length comparisons in the modCountBlockSize method, we need to know the depth of each file. This method calculates that depth by incrementing the depth value for every found { and decrementing it for every found }, all the while keeping track of the deepest depth found.
getFileAsString(File
file, Boolean deleteComments)
This method is the heart of textual comparisons in Expeller. It takes as arguments a file and a boolean that determines whether the returned file string will contain comments.
The method reads in each line of a file and determines if a comment is contained in the line by using the regular expression, (?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*) to make matches and replace them with a blank line, and by searching for occurrences of /* and */ and removing them. If a comment is completely contained within a line, the line is removed, and if a comment is in a line along with valid code, the comment is removed and the code is saved. After all comments are removed from a line, the length is tested, if it is greater than 0, the line is added to the final file string. If not, it is forgotten and the next line is read in. Therefore, during the process of removing comments, this method also removes blank lines (white space) which are not used in comparison.
greaterOfTwo(int
int1, int int2)
This method simply returns the greater of two numbers. It is used to size arrays.
printManual()
When a user needs to know what user option arguments that he can provide Expeller at runtime, he will use the argument help. This will trigger the printManual method, which will print all of the available user options to the terminal. Additionally, if a user enters an invalid command, this method is run to educate him on how his error could be corrected.
growFileAndDirectoryArray()
In order to compare files within a directory, they need to be loaded into some data structure. The easiest way to cross-compare elements is by putting them in arrays and running two for loops, one encapsulated in the other. This method searches through the directory specified at runtime, loads all of the files in that directory into an array and if the subdirectory argument has been given, it examines any directories within the initial directory and also adds files that they contain to the files array.
Additionally, the idea of the filter argument is carried out in this method. If the current found file does not match the filter pattern, it is not added to the file array.
While files are added to the array, a . is printed for each one, to let the user know that the loading is occurring.
modCompareLength(File
file1, File file2)
This module is one of two dumb modules implemented. It simply counts the character length of the string representation of a file1 and file2, and then compares the lengths based on the percentMatch method. It is thought of as dumb because there is no sophisticated calculation involved. However, dumb is not to be confused with ineffective; the results are very extremely useful.
modCompareNumberOfLines(File
file1, File file2)
This method effectively does the same thing as modCompareLength, except that it compares the number of lines in each file rather than the number of characters. It is another dumb module and its results are most important when dealing with files of very different lengths. It very easily rules those out as possible matches. So, even though modCompareLength and modCompareNumberOfLines are simplistic, in some ways, they play a greater role than the more sophisticated modules.
modCountBlockSize(File
file1, File file2)
Most cheaters will change variable names and move code around. They are looking for the easy way out and that does not include re-writing any methods. If they do, they might as well not cheat to begin with. This method counts the number of lines at each depth of a program. Depths are described in the getDepthOfFile method, and are simply thought of as a number value for every instance of { or }. Any text outside of curly braces is thought to be in depth 0, any in the first { is depth 1 and so on.
The number of lines at each depth is calculated and stored in an array at the index equal to the depth value. Each individual depths number of lines of each file is compared with that of the other file and a percent match array is formed. Then all of the percent matches are summed and divided by the number of entries in the percent match array, giving Expeller a total percent match for this module, for any two given files.
The module breaks a file string up into lines via delimiting by the \n character. It then determines if the line contains a { in order to increment the block counter. Once inside the {, it begins to add to the array index [blockCounter] for that depth. If a } is found, the block counter is decremented and lines succeeding it are added to the next shallow depth. If no curly braces are found on a line, one is added to the current depth.
The percent match array is sized by the largest depth array size. Meaning if a file1 has one more depth than file2, the percent match array will have the length of file1s block count array. The value for this index in the percent match array will be zero because one file is 100% different than another in respect to that depth value.
modCountTokens(File
file1, File file2)
When plagiarizing another students code, it is very easy to change variable names. However, changing the variable types, loop types, included libraries, and system calls is more than difficult; it requires knowledge of the language something that the cheater most likely doesnt have.
Counting generally found keywords in files and then comparing the number of matches is a very effective way of catching copies of files. However, comparing the matches of one single keyword between two files is counter-productive -- it would be easy to rewrite one form of loop. So, as more regular expressions are added to this module, the more effective it will become.
This
module works by taking an array of regular expressions, cycling through it, and
counting the number of matches within each file, and then recording each match
at the regular expressions index within the expressions match array. It then
calculates a percent match between each index of file1s match array and the
corresponding index of file2s match array.
Source Code
Java was chosen as the language for Expeller because of its usability across many platforms without the need to change any of the source code. It was determined that having all of the modules and functional code contained in one file would be better than having separate module classes, for ease of transferability (sharing). Rather than requiring the user to download and install a package on a system with which he/she may not be familiar, all that is required to install Expeller is the compilation of one file.
package expeller;
import java.io.*;
import java.util.*;
import java.util.regex.*;
public class expeller {
/* File Array - will contain all of the files
found to be compared */
private
File[] filesArray;
/* Directory Array - will contain all of the
directories where files
are found */
private File[] directoriesArray;
/* Array used to hold each file in the Linked
List for them to be
compared against each other */
Object[] filesListToArray;
/* Holds the directory from which the program
was started to know
where to begin searching */
String topDir =
System.getProperty("user.dir");
/* The
program will need to move through many directories
recursively, so initially, the curDir will
be the topDir */
String curDir = topDir;
/* A flag is a complete argument to the
program, the '-' and whatever
letters or words are attached to it */
String curFlag;
/* The argument key that is attached to the
'-' */
String curArg;
/* This string can be changed to represent the
current version number
of Expeller, which will be displayed at
runtime */
final String VERSION_NUMBER =
"1.0";
/* Linked list that contains all of the
directories to be searched */
LinkedList directories = new LinkedList();
/* Linked list that contains all of the files
to be searched */
LinkedList files = new LinkedList();
/* Linked List that contains all of the
runtime arguments to the
program */
LinkedList argsList = new LinkedList();
/* One argument that can be given to the
program is a filter, so that
the program will only search for a
specific filename or extension,
this
variable holds the filter */
String fileNameOrTypeFilter = "No
filter";
/* Contains the output filename for the file
where the possible
cheaters will be located. By default, it's
cheaters.txt, but the
name can also be supplied as an argument
at runtime */
String outputFileName =
"cheaters.txt";
/* An argument can be given that tells the
program whether to search
through subdirectories */
boolean subDirectories;
/* Argument to supply the percent match
threshold for which cheaters
will be reported */
int percentMatchThreshold = 75;
int cheaters = 0;
/* This is the number of built-in intelligent
modules. Intelligent
modules are deemed intelligent by their
complexity and their
ability to provide a sophisticated
comparison of two files */
final int NUMBER_OF_INTELLIGENT_MODULES = 2;
/* This is the number of built-in dumb
modules. Dumb modules return
some simple comparison about the file. The
two dumb modules
currently implemented count the lines and
character lengths of
files */
final int NUMBER_OF_DUMB_MODULES = 2;
/* This public method allows Expeller to be
run as a stand alone
program */
public static void main(String[] args) {
expeller Expeller = new expeller(args);
}
/* This constructor allows Expeller to be
called from another program
as a stand alone class */
public expeller (String[] args) {
/* Program title and version number */
System.out.println("\nExpeller v"
+ VERSION_NUMBER);
/* If there were command line arguments,
add them to an array and
then dissect each one */
if(args.length != 0) {
for(int i = 0; i < args.length; i++) {
argsList.add(args[i]);
}
dissectArgs(argsList);
}
/*
These are informational lines to tell the user which options
he/she has selected at runtime */
System.out.println();
System.out.println("Root search
directory: " + curDir);
System.out.println("Filename or
extension filter: " +
fileNameOrTypeFilter);
System.out.println("Search
subdirectories: " + subDirectories);
System.out.println("Output file name:
" + outputFileName);
System.out.println("Percent match
threshold: " +
percentMatchThreshold);
System.out.println();
/* This line displays to let the user know
that the program is
currently loading files */
System.out.print("Loading
files");
/* This method actually builds the files
and directories arrays in
order to do recursive searching for
files */
growFileAndDirectoryArray();
/* Copies the elements in the files
LinkedList to an Array to
compare each file against each other */
filesListToArray = files.toArray();
/* Output spacing at runtime */
System.out.println();
System.out.println();
/* This line tells the user how many files
have been loaded for
comparison */
System.out.println("(" +
files.size() + " files loaded)" + "\n");
/* This variable will be used to show the
number of comparisons to
the user */
int numberOfComparisons =
calculateComparisons(files.size());
/* This line tells the user that file
comparisons are currently
happening */
System.out.print("Comparing
Files");
/* This variable is used to hold the time
before the comparisons
begin so that it can be compared to the
end time and the time
that the comparisons took can be
calculated */
long startTime = System.currentTimeMillis();
/* This method actually handles the
comparison of the files */
compareFiles();
/* Holds the system time at the end of
comparisons to calculate the
time it took to compare them all */
long endTime = System.currentTimeMillis();
/* Calculates how long the comparisons took
*/
long secondsForComparisons = (endTime -
startTime)/1000;
/* These lines tell the user how many
comparisons took place and
how long they took */
System.out.println();
System.out.println("(" +
numberOfComparisons + " comparisons
completed in "
+ secondsForComparisons + "
seconds)");
System.out.println();
/* This line tells the user how many cheats
were found and where
the
output file is located */
System.out.println(
"(Output of " + cheaters +
" possible Cheats located in " +
outputFileName + ")");
System.out.println();
}
public void dissectArgs(LinkedList argsList)
{
/* Get the first flag from the argsList */
curFlag = argsList.getFirst().toString();
/* Then remove the flag so that we can get
the argument next */
argsList.removeFirst();
/* If the flag is the help/? flag */
if
(curFlag.equalsIgnoreCase("help")) {
/* Print the Help Menu */
printManual();
System.exit(0);
}
/* If the first flag starts correctly, with a
"-" */
else if(curFlag.charAt(0) == '-') {
/* If it's a directory flag */
if
(curFlag.equalsIgnoreCase("-d")) {
/* Try getting it off the stack */
try {
/* Take the first argument off the
stack and make it curArg */
curArg =
argsList.getFirst().toString();
/* Now remove the first argument on the
top of the stack so
that next time the first argument is
pulled, it will be the
next argument on the stack */
argsList.removeFirst();
}
/* If the flag was thrown and no argument
was supplied, catch it
*/
catch (Exception e) {
/* Print out an error letting the user
know what's going on */
System.out.println("\nNo directory
argument supplied");
/* Print the manual so that the user
can correct his/her error
*/
printManual();
/* Kill the program so that the user
can think about his/her
error and re-initiate the program
with valid arguments */
System.exit(0);
}
/* Temporarily storing the directory
object to check if its
actually a directory */
File tempCurDir = new File(curArg);
/* Check if the directory object is
actually a directory. If so,
make the current directory this
directory */
if (tempCurDir.isDirectory())
curDir = curArg;
/* If the current argument is not a
directory, throw error and
exit */
else {
System.out.println("\nERROR:
Directory does not exist.");
printManual();
System.exit(0);
}
}
/* If the flag is a directory flag */
else if (curFlag.equalsIgnoreCase("-f"))
{
try {
curArg =
argsList.getFirst().toString();
argsList.removeFirst();
}
catch (Exception e) {
System.out.println("\nNo file
argument supplied");
printManual();
System.exit(0);
}
/* Pull the current argument off the
stack and make it the
fileNameOrType variable so that we can
use it to search files
*/
fileNameOrTypeFilter = curArg;
}
/* If the flag is the outputFile flag */
else if
(curFlag.equalsIgnoreCase("-o")) {
try {
curArg =
argsList.getFirst().toString();
argsList.removeFirst();
}
catch (Exception e) {
System.out.println("\nNo file
argument supplied");
printManual();
System.exit(0);
}
/* Pull the current argument off the
stack and make it the
outputFileName variable so that we can
use it to output
Cheaters */
outputFileName = curArg;
}
/* If the flag is the percentMatchThreshold
flag */
else if
(curFlag.equalsIgnoreCase("-t")) {
try {
curArg =
argsList.getFirst().toString();
argsList.removeFirst();
}
catch (Exception e) {
System.out.println("\nNo file
argument supplied");
printManual();
System.exit(0);
}
/* Pull the current argument off the
stack and make it the
percentMatchThreshold variable so that
we can use it to output