[NLP]Project Review: Text Difficulty Classification

Some lessons learned while implementing Machine Learning Algorithms

7 min readOct 25, 2022

Background

The goal of the project is to predict text difficulty from a given dataset. Unfortunately, I will not be able to share too many details about the data source and the codes our team implemented. However, I still believe that there are a few lessons that I feel are worth noting and would like to document in this article. Overall, there are two sections in this project. Text Classification and Topic Modeling. For the text classification task, our team tried on the traditional machine learning method, the standard deep learning method, and the state-of-the-art BERT model. Meanwhile, we experimented with LDA, NMF, and GSDMM algorithms for Topic Modeling. Here is the structure of this article

Data Processing
Traditional ML Approach
Standard Deep Learning Approach
State-of-the-Art BERT
Topic Modeling

Problem Definition

Many occasions require simplification of complex English texts to make them comprehensible to audiences like children, students, adults with learning/reading difficulties, and non-native speakers. There is an apparent need to recognize such complex sentences before simplification can be carried out. We wish to automate this process via a classification model that flags such instances.

Moreover, we believe aiming for higher precision is more important than higher recall. Having higher recall captures more skeptical complex sentences but results in more human effort. This means that our goal of automating the process becomes questionable when lots of human judgment is required.

Data Processing

Data noise reduction is probably easier to do with numerical datasets since anyone can manipulate the numbers and find a quantitative method to exclude outliers. Whereas, text data is hard to comprehend and to decide what documents are outliers. However, we came up with a few assumptions that helped us proceed with noise reduction.

We believed that longer texts mean more content and could easily become complex
We believed that more punctuations lead to more complex documents
If we could identify document topics, generally some topics are harder to understand due to various reasons such as domain-specific words. For example, Science versus Sports. Sports is probably a lot more comprehensible than a Science article

Essentially, the bullet points above are document descriptions. We believe that deriving features from our assumptions can raise our precision score. These descriptions become new features that anyone could apply to their projects. In short, THEY DO MAKE SENSE!! Guiding by our assumptions, we started our experiments with the modeling approaches mentioned above.

Modeling

Traditional Machine Learning Methods

After data processing, we transformed documents to document term-frequency matrices. The first thing we tried was implementing the popular Naive Bayes classifiers. These algorithms are super fast and efficient in training and predicting new documents. However, since they are probabilistic models, it often leads to a less generalized model. Besides Naive Bayes, we also tried other algorithms using scikit-learn. Ultimately, we selected Logistic Regression. Logistic Regression gave us a relatively well result and it was quite efficient to train and make predictions. While tuning the hypermeters we observed a few interesting insights…

By increasing the number of document term-frequency matrix (max_feature) did not improve overall accuracy after 20,000 words.
Stop words seemed to be an important indicator for our tasks.

Standard Deep Learning Approach

Unlike the traditional machine learning approach, we further explored deep learning architectures. These included GloVe Embeddings and Word2Vec Embeddings followed by LSTM layers. Before the emergence of BERT, using pre-trained embeddings for text classification tasks is quite popular. Unfortunately, the deep learning approach did not significantly improve our overall accuracy. A good news was that using SHAP values, we discovered that our perceptions of this classification tasks may be incorrect.

For instance, the SHAP values suggested that some difficult words are contributing more to making a sentence to be classified as simple. Some words include “sebaceous, gland, microscopic, lubricate….”

Such discovery clarifies that perhaps words/terms may not be the key to our problem!

State-of-the-Art BERT

We decided to take the advantage of the state-of-the-art architecture BERT. The reason for using BERT was because the pre-trained weights were trained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables, and headers) which highly resembles English documents.

With all the insights gained from previous experiences, we altered a few features. Besides standard BERT features, word-piece tokens, and attention masks, we incorporated numerical document properties. Unfortunately, I could not share the inputs but I could tell that these numerical properties improved model convergence and stabilized the BERT training process.

After some investigations and altering the model architecture, we hoped to explain our predictions. Unfortunately, we realized that “whether attention weights are helpful in text classification tasks” was questionable. One research suggested that

1. Looking at attention weights distributions can be misleading
2. Risk of mistakenly believe that that a small number of representations are responsible for decision
3. Risk of mistakenly believe that some tokens are important than others that are actually more influential to model’s predictions
4. Depending on the model structure preceding the attention layer, attention weights might be much worse at describing token importance.

Regardless, we attempted the Local Interpretable Model-agnostic Explanations (LIME) method for BERT. Frankly, speaking, the results were hard to understand but we figured out that the model did consider stop words as informative features.

Conclusion

We believed that more exploration of the dataset was necessary. For this project, we processed and built our models purely from assumptions and hypotheses. This could become biased and especially since we were not native speakers at all. Furthermore, the state-of-the-art model, BERT, indeed has an advantage in the NLP classification task. Nonetheless, debates arise about BERT’s interpretability. It is still uncertain how informative attention weights are to text classifications.

Topic Modeling

The motivation of the topic modeling approach was to test whether our assumption №3 was favorable. Some topics are by nature more complicated. Under the assumption that some topics are genuinely more difficult than others, we believe that topic assignments can provide valuable insights as a new feature for the supervised learning task. Just like supervised learning tasks, we defined a few steps to accomplish our goal.

Data noise reduction

We went on reducing documents by keeping words/terms that were identified as [NOUN, PRONOUN, VERB ADJECTIVE]. We proposed such reduction was beneficial for topic modeling because some POS tags carried more information than others. Part-of-speech tags are expected to assist the algorithm in finding coherent topics via words. For example, “Consequently (ADV)” does not provide topic-related information compared to “Nuclear (NOUN)”.

Selecting The Optimal Number of Topics

The most challenging part of topic modeling must be determining the number of k clusters as topic clusters. Ultimately, we decided to use the coherence score to decide the optimal number of topics for LDA, NMF, and GSDMM. The coherence score is effective in helping us shrink the range of search topic numbers. However, the trade-off was it required the user to recursively built models for every k value. This is computationally expensive and time-consuming.

Human Judgement

After reviewing the coherence scores per model, we targeted LDA for k from 15 ~ 20, NMF for k from 15~30, and GSDMM for k from 15 ~ 20. We evaluated our results using word clouds. We attempted to see whether the top words per algorithm per cloud were indeed coherent. For example, a coherent topic can be “[Sports, baseball, basketball, season, championship]”. Whereas a poor topic will be “[Human, tropical, weather, computer, internet]”.

As a result, GSDMM seemed to perform the best among the three algorithms. GSDMM, an altered version of LDA, has a definite upper hand in terms of results. GSDMM on the backend follows the Movie Group Process, and the rule strictly follows a way to optimize cluster
completeness and homogeneity. More importantly, our documents were short, and GSDMM was well-known for having good performance with shorter texts. Before making the final call, we carried on our evaluation of GSDMM using the 1. silhouette score using K-means. If the topics were coherent as GSDMM suggested, we expected to 2. see well-formed clusters via visualization methods like t-SNE or UMAP.

To summarize, we did see that GSDMM formed better clusters and each cluster was overall well separated. Meanwhile, the average silhouette scores showed a range of values between 0.8 ~ and 0.95. (We ran bootstrapped cluster evaluation)

Final words

Nonetheless, there are limitations in our study. Our data reduction strategy can be biased toward our limited understanding of the English language. Furthermore, regarding the K topic selection, the coherence scores could peak at other ranges above K = 100. Such ranges are not presented and considered in our project. These areas can be further investigated in the future.