A Failed Text Classification

  • — version 1@20190401
  • –version 2@20190402: change to category to 2

Today I tried a text classification task where the data is about the message on the flights and labeled into 5 levels. Observably, it is a supervised problem. And I though there are bunch of solutions already for this kind of problem. So that I started with full of confidence. But…. the result was so bad, no more than 35% accurate for 5 classification. Only a little bit better than guess.

ModelAcc%
word tf-idf with kernal SVM32.6
word tf-idf with random forest31.3
Naïve Bayes35.3
word embedding(FastText) with GRU30.3

The detail can be found in Google Colaboratory.

I am considering following reasons leading this failure:

  1. The module is easy to over fitting. For example, when GRU model’s training loss decreasing, the invalidation loss was decreasing in the beginning, but after 40 epochs, it started to increasing or jumped up/down.
  2. Since I used trained embedding model(FastText) which is based on wiki but the dataset is in civil aviation. The words and word vectors may far away.
  3. In the data source, there might be lack of significant or clear rules to classify them to 5 categories. If we just label it to binary “attention/no worry”. The result will be better.

@20190402: I change the category from 5 to 2, hopefully the result would be better. But NO improvement. Only 63.5% for 2 categories.

Leave a comment

Your email address will not be published. Required fields are marked *