A Failed Text Classification

— version 1@20190401
–version 2@20190402: change to category to 2

Today I tried a text classification task where the data is about the message on the flights and labeled into 5 levels. Observably, it is a supervised problem. And I though there are bunch of solutions already for this kind of problem. So that I started with full of confidence. But…. the result was so bad, no more than 35% accurate for 5 classification. Only a little bit better than guess.

Model	Acc%
word tf-idf with kernal SVM	32.6
word tf-idf with random forest	31.3
Naïve Bayes	35.3
word embedding(FastText) with GRU	30.3

The detail can be found in Google Colaboratory.

I am considering following reasons leading this failure:

The module is easy to over fitting. For example, when GRU model’s training loss decreasing, the invalidation loss was decreasing in the beginning, but after 40 epochs, it started to increasing or jumped up/down.
Since I used trained embedding model(FastText) which is based on wiki but the dataset is in civil aviation. The words and word vectors may far away.
In the data source, there might be lack of significant or clear rules to classify them to 5 categories. If we just label it to binary “attention/no worry”. The result will be better.

@20190402: I change the category from 5 to 2, hopefully the result would be better. But NO improvement. Only 63.5% for 2 categories.

NEO_AKSA

A Failed Text Classification

Capturing Moments: Beyond the Lens

Reviving Kodak: Leveraging Color Science

Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

Lessons on photography from the movie “Civil War”

Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

Decoding A24’s Rise: A Blueprint for Indie Success

Leave a Reply Cancel reply