Ambitious sophomore advances research in code stylometry
The Department welcomes comments, suggestions and corrections. Send email to editor [at] cs [dot] umd [dot] edu.
Some freshmen spend their summers lounging around the pool or marathoning TV shows--not Andrew Liu.
The ambitious UMD sophomore Computer Science major spent his previous summer at the United States Army Research Laboratory (ARL), where he used data mining techniques to further research in the field of code stylometry.
In spring 2014, Liu applied to ARL and was matched with Aylin Caliskan-Islam, a doctoral student at Drexel University studying computer science privacy issues. Under ARL’s Open Campus initiative, Caliskan-Islam had the opportunity to work in world class research facilities and collaborate with talented scientists and engineers. Together, Liu and Caliskan-Islam took on the project to find patterns and relationships in code stylometry.
What is stylometry? In general, it is the study of linguistic style. Linguists use stylometry techniques to analyze text and detect an author’s distinct writing style. Code stylometry is the same thing except applied to source code.
One motivation behind this field of research is to identify cases of copyleft and copyright. With the prevalence of open source software, plagiarism is becoming a bigger issue. Another motivation is the ability to identify authors of malware.
“If we have access to samples of code that known malware writers write, we can use this technology to identify the author,” said Liu.
Data Mining Techniques
For their project, Liu and Caliskan-Islam took data sets from Google Code Jam, one of the top programming competitions in the world. They analyzed code from 250 different programmers and used an average of 650 lines of code per author. In order to guarantee uniformity, they limited their data to C++ code.
The team relied on data mining techniques to find patterns and relationships in the source code. They first took attributes from each sample and analyzed them. Attributes are features like loop type, line length, number of functions, and amount of white space. After analyzing many pieces of code, they then used algorithms called classifiers to process the data.
“We used a random forest classifier,” says Liu. “It picks random attributes and processes them. When data is too specific and the results follow the data too narrowly, this can lead to broad fallacious assumptions; this is called overfitting. We introduced a factor of randomness to prevent this from happening.”
How are ASTs different normal linear programs? Linear programs provide information about lexical attributes. While lexical features are easier to visualize (white space, line length, etc.), they are also easier to obfuscate. If malware writers and plagiarizers can easily obfuscate their code by chopping up their lines, it becomes harder to identify the authors.
In contrast, it is much more difficult to change AST attributes. AST attributes are less surface-level and give information about the internal structure of the code. They ultimately provide a more in-depth analysis of an author’s coding style.
After running the source code, translating them into ASTs, checking attributes, converting them into data files, and using a data mining suite called WEKA to execute algorithms, Liu and Caliskan-Islam discovered that they could predict the author of a piece of source code with 95% accuracy.
This is a really good number, but there are two important underlying assumptions. Firstly, the team had a set of authors to guess from. Secondly, they already compiled data for each author. For real world implications, this technique only works for guessing within a set of data with known authors.
Even still, the team found interesting results. Based on statistical analysis, they discovered that the “good programmers write less code” idea is not necessarily true. For the Google Code Jam, some of the more efficient programmers wrote more lines of code.
One future step for this research is to use control flow graphs instead of ASTs. Just like how ASTs provide more in-depth attributes than linear programs, control flow graphs provide more details and attributes than ASTs.
“We also want to extend this research to different languages,” Liu said. “It would be cool to test not just C++, but also maybe Python or Java.”
Most importantly, the team wants to apply the same methods to binary code. Malware is written in the form of binaries, and it is often difficult to decompile binary into source code. If researchers can detect binary code styles, it will have more useful applications.
Caliskan-Islam says that working with Liu has been a great experience. "I introduced him to machine learning and feature extraction as the lead of the project," she says. "He was [then] able to understand in-depth machine learning concepts quickly. This is a significant achievement for a rising sophomore."
Liu’s personal next steps are to learn more about cybersecurity through the ACES program, participate in more hackathons, and complete Google’s Software Engineering Internship this upcoming summer. One of the greatest things he realized from his ARL experience is that the research world is constantly moving and building off prior research.
“I also realized that data is becoming more and more important,” Liu says. “Scientists don’t have enough background knowledge to prove everything. When the data set is very large, it can provide very useful conclusions. I think that data [science] will be the future of computer science.”