Please note, due to scheduled maintenance online transactions will not be possible between 08:00 and 12:00 BST, on Sunday 17th February 2019 (03:00-07:00 EDT, 17th February, 2019). We apologise for any inconvenience
Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.
Email your librarian or administrator to recommend adding this journal to your organisation's collection.
* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.
Usage data cannot currently be displayed