Using word2vec to Analyze News Headlines and Predict Article Success

CC

Charlene Chambliss

Senior Software Engineer

As part of my efforts to learn in public earlier on in my data science journey, I wrote this article on an end-to-end analysis I did on a dataset of news headlines (apologies, I can’t find the original dataset, but I got it from the UCI ML Repository.)

The article includes:

  • Preprocessing/cleaning the text data, using NLTK
  • Using word2vec to create word and title embeddings, then visualizing them as clusters using t-SNE
  • Visualizing the relationship between title sentiment and article popularity
  • Attempting to predict article popularity from the embeddings and other available features, using XGBoost (gradient-boosted trees)
  • Using model stacking (ensembling) to improve the performance of the popularity model (this step was not successful, but was still a valuable experiment!)

The full text of the article (with code snippets and a link to the Jupyter Notebook) is here.