Results obtained by transformer-based token classification models are now considered to be a benchmark for the Automatic Terminology Extraction (ATE) task. However, the unsatisfactory results (they rarely exceed 0.7 of the F1 value) raise the question of whether this approach is correct and of what text features are being remembered or inferred by the model trained on this type of annotation. In the paper, we describe a number of experiments using the fine-tuned RoBERTa base model on the ACTER data, RD-TEC, and three Wikipedia articles, which proved that the results of the ATE task obtained by such models depend considerably on the type of texts being processed and their relationship to the training data. While the results are relatively good for some texts with highly specialized vocabulary, the poor results seem to correlate with the high frequency (in general English texts) of tokens that are part of terms in a particular domain. Another property that affects the results is the degree of overlap between the vocabulary of the test data and the vocabulary of terms from the training data. Words that have been labeled as terms in the training data are usually labeled as terms in other, unrelated domains as well. Moreover, we show that the results obtained by these models are unstable—models trained on more data do not include all the items identified by models trained on a smaller dataset and can present substantially lower performance.