Record: oai:ARNO:634486

AuteurT. Heijligers
TitelStatistical Lexical Analysis
BegeleiderVadim Zaytsev
FaculteitFaculteit der Natuurwetenschappen, Wiskunde en Informatica
OpleidingSoftware Engineering, MSc
TrefwoordenLexer; conditional random fields
SamenvattingHow does a statistical lexer, created with a sophisticated Natural Language Processing (NLP) algorithm in the field of word segmentation and POS tagging, named Conditional Random Fields, compare to a deterministic one? Can it be used for determining source lines of code, multilingual lexing or as a language detection tool?

Segmenting code into tokens works best when the model learns if a character is a token beginning, token end, token internal or from a single character token as apposed to if a character is a token beginning or not. Labeling code is done by providing 1-gram, 2-gram and 3-gram fragments. Lexing could be done by doing segmentation and labeling at the same time (1-step method) or by first segmenting and then labeling (2-step method). The 1-step method is much more effective.

Creating a monolingual statistical lexer for Python, Java, HTML, Javascript, CSS and Rascal from up to 50 files has an overall f1-score between 0.89 en 0.99. When the statistical lexer is used to decide the lines of code, the results are disappointing. The statistical lexer is often confused about string literals, variable names and comments.

A multilingual statistical lexer for HTML, Javascript and CSS has low f1-scores. The multilingual lexer is not capable of recognising what languages are presented. Further investigation could improve those results.
Soort document scriptie master