Description: In this paper, we systematically explore feature definition and selection strategies for sentiment polarity classification. We begin by exploring basic questions, such
as whether to use stemming, term frequency versus binary weighting, negation-enriched features, n-grams or
phrases. We then move onto more complex aspects
including feature selection using frequency-based vocabulary trimming, part-of-speech and lexicon selection (three types of lexicons), as well as using expected Mutual Information (MI). Using three product
and movie review datasets of various sizes, we show,
for example, that some techniques are more beneficial
for larger datasets than the smaller. A classifier trained
on only few features ranked high by MI outperformed
one trained on all features in large datasets, yet in small
dataset this did not prove to be true. Finally, we perform a space and computation cost analysis to further
understand the merits of various feature types.
To Search:
File list (Check if you may need any files):
2808-14159-1-PB.pdf