Word embedding is an efficient way to represent word content as well as potential information contained in a document (a collection of words). Using the data set of the news article title, which includes features about source, emotion, theme, and popularity (#share), I began to understand through the respective embedding that we can understand the relationship between the articles.
The objectives of the project are:
- 使用NLTKPreprocess / clean up text data
- 使用word2vecCreate word and title embedding, then use t-SNE to display them as clusters
- Visualize the relationship between headline sentiment and article popularity
- Try to predict article popularity from embedding and other available features
- Use model stacking to improve the performance of the popularity model (this step is not successful, but it is still a valuable experiment!)
使用Nbviewer it's hereHost the entire notebook.
Import and pretreatment
We will start with the import:
import pandas as pd
import gensim
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import xgboost as xgb
Then read in the data:
main_data = pd.read_csv('News_Final.csv')
main_data.head()
# Grab all the titles
article_titles = main_data['Title']
# Create a list of strings, one for each title
titles_list = [title for title in article_titles]
# Collapse the list of strings into a single long string for processing
big_title_string = ' '.join(titles_list)
from nltk.tokenize import word_tokenize
# Tokenize the string into words
tokens = word_tokenize(big_title_string)
# Remove non-alphabetic tokens, such as punctuation
words = [word.lower() for word in tokens if word.isalpha()]
# Filter out stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if not word in stop_words]
# Print first 10 words
words[:10]
Next, we need to load the pre-trained word2vec model. You canit's hereFind a few such models. Since this is a news dataset, I am using the Google News model, which is trained in about 1000 billion words (wow).
# Load word2vec model (trained on an enormous Google corpus)
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)
# Check dimension of word vectors
model.vector_size
Therefore, the model will generate an 300 dimension word vector, and all we have to do to create the vector is to pass it to the model. Each vector looks like this:
economy_vec = model['economy']
economy_vec[:20] # First 20 components
word2vec (can be understood) cannot create a vector from a word that is not in its vocabulary. Therefore, we need to specify "if model in model.vocab" when creating a complete list of word vectors.
# Filter the list of vectors to include only those that Word2Vec has a vector for
vector_list = [model[word] for word in words if word in model.vocab]
# Create a list of the words corresponding to these vectors
words_filtered = [word for word in words if word in model.vocab]
# Zip the words together with their vector representations
word_vec_zip = zip(words_filtered, vector_list)
# Cast to a dict so we can turn it into a DataFrame
word_vec_dict = dict(word_vec_zip)
df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
df.head(3)
Use t-SNE to reduce the dimension
Next, we will use t-SNE to compress these word vectors (read: do dimensionality reduction) to see if any patterns appear. If you are not familiar with t-SNE and its explanation, please checkThisAbout t-SNEExcellent interactive distill.pub article.
The use of t-SNE parameters is very important because different values can produce very different results. I tested the confusion of several values between 0 and 100 and found that it produced roughly the same shape each time. I tested several learning rates between 20 and 400 and decided to keep the learning rate at the default (200).
For visibility (and processing time), I used 400 word vectors instead of about 20,000.
from sklearn.manifold import TSNE
# Initialize t-SNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)
# Use only 400 rows to shorten processing time
tsne_df = tsne.fit_transform(df[:400])
Now we are ready to draw a reduced array of word vectors. I used toadjust_text
Intelligently separate text for readability:
sns.set()
# Initialize figure
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)
# Import adjustText, initialize list of texts
from adjustText import adjust_text
texts = []
words_to_plot = list(np.arange(0, 400, 10))
# Append words to list
for word in words_to_plot:
texts.append(plt.text(tsne_df[word, 0], tsne_df[word, 1], df.index[word], fontsize = 14))
# Plot text using adjust_text (because overlapping text is hard to read)
adjust_text(texts, force_points = 0.4, force_text = 0.4,
expand_points = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
If you are interested in tryingadjust_text
Your own drawing needs can beHereFind it. Be sure to import it using camelcaseadjustText
,Please noteadjustText
Currently withmatplotlib
3.0 or higher is not compatible.
Encouragingly, even if vector embedding is reduced to 2 dimensions, we will see some items coming together. For example, we have left/top left cornersMonths,ourCorporate financing conditionsNear the bottom, we have more in the middleGeneral non-themed words(such as "complete", "real", "swing").
Note that if we run t-SNE again with different parameters, we may find that the results have some similarities, but we can't guarantee to see the exact same pattern. t-SNE is not deterministic. Relatedly, the tightness of the clusters and the distance between the clusters are not always meaningful. It is primarily used as an exploratory tool rather than a decisive indicator of similarity.
Average word embedding
We have seen how embedded text can be applied to this data set. Now we can move on to some of the more interesting ML applications: find the headings that are grouped together and see the patterns that appear.
We can use it more simply(sometimes even more effective)Tips, instead of using Doc2Vec without a pre-trained model, so you needLongerTraining process: averaging the embedding of word vectors in each document. In our case, the document refers to the title.
We need to redo the preprocessing steps to keep the title intact – as we will see, this is more complicated than splitting words. Thankfully, Dimitris Spathis Created a series of features,I findthese functionsCan be used perfectly for this precise use case. Thank you, Dimitris!
def document_vector(word2vec_model, doc):
# remove out-of-vocabulary words
doc = [word for word in doc if word in model.vocab]
return np.mean(model[doc], axis=0)
# Our earlier preprocessing was done when we were dealing only with word vectors
# Here, we need each document to remain a document
def preprocess(text):
text = text.lower()
doc = word_tokenize(text)
doc = [word for word in doc if word not in stop_words]
doc = [word for word in doc if word.isalpha()]
return doc
# Function that will help us drop documents that have no word vectors in word2vec
def has_vector_representation(word2vec_model, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vec_model.vocab for word in doc)
# Filter out documents
def filter_docs(corpus, texts, condition_on_doc):
"""
Filter corpus and texts given the function condition_on_doc which takes a doc. The document doc is kept if condition_on_doc(doc) is true.
"""
number_of_docs = len(corpus)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if condition_on_doc(doc)]
corpus = [doc for doc in corpus if condition_on_doc(doc)]
print("{} docs removed".format(number_of_docs - len(corpus)))
return (corpus, texts)
Now we will use them for processing:
# Preprocess the corpus
corpus = [preprocess(title) for title in titles_list]
# Remove docs that don't include any words in W2V's vocab
corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: has_vector_representation(model, doc))
# Filter out any empty docs
corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: (len(doc) != 0))
x = []
for doc in corpus: # append the vector for each document
x.append(document_vector(model, doc))
X = np.array(x) # list to array
t-SNE, 2 round: file vector
Now that we have successfully created our document vector array, let's see if we can get similar interesting results when drawing them with t-SNE.
# Initialize t-SNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)
# Again use only 400 rows to shorten processing time
tsne_df = tsne.fit_transform(X[:400])
fig, ax = plt.subplots(figsize = (14, 10))
sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)
from adjustText import adjust_text
texts = []
titles_to_plot = list(np.arange(0, 400, 40)) # plots every 40th title in first 400 titles
# Append words to list
for title in titles_to_plot:
texts.append(plt.text(tsne_df[title, 0], tsne_df[title, 1], titles_list[title], fontsize = 14))
# Plot text using adjust_text
adjust_text(texts, force_points = 0.4, force_text = 0.4,
expand_points = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
very interesting! We can see that t-SNE has folded the document vector into a dimensional space, and the document is based on whether its content is related to the country, world leader and foreign affairs, or has more relationships with technology companies.
Let us now explore the popularity of the article. The consensus is that the more sensational the article title or the higher the click rate, the more likely it is to share, right? Next, we will look at whether there is evidence in this particular data set.
Popularity and sentiment analysis
First, we need to remove all articles that have no popularity measurements or sources. The null measurement for popularity is represented in this data as -1.
# Drop all the rows where the article popularities are unknown (this is only about 11% of the data)
main_data = main_data.drop(main_data[(main_data.Facebook == -1) |
(main_data.GooglePlus == -1) |
(main_data.LinkedIn == -1)].index)
# Also drop all rows where we don't know the source
main_data = main_data.drop(main_data[main_data['Source'].isna()].index)
main_data.shape
We still have 81,000 articles available, so let's see if we can find the connection between emotions and the number of shares.
fig, ax = plt.subplots(1, 3, figsize=(15, 10))
subplots = [a for a in ax]
platforms = ['Facebook', 'GooglePlus', 'LinkedIn']
colors = list(sns.husl_palette(10, h=.5)[1:4])
for platform, subplot, color in zip(platforms, subplots, colors):
sns.scatterplot(x = main_data[platform], y = main_data['SentimentTitle'], ax=subplot, color=color)
subplot.set_title(platform, fontsize=18)
subplot.set_xlabel('')
fig.suptitle('Plot of Popularity (Shares) by Title Sentiment', fontsize=24)
plt.show()
It's a bit difficult to figure out if there is any relationship here, because some articles are important outliers in terms of their share. Let's try to logarithmically convert the x-axis to see if we can reveal any patterns. We will also use regplot, soseaborn
The linear regression of each graph will be covered.
# Our data has over 80,000 rows, so let's also subsample it to make the log-transformed scatterplot easier to read
subsample = main_data.sample(5000)
fig, ax = plt.subplots(1, 3, figsize=(15, 10))
subplots = [a for a in ax]
for platform, subplot, color in zip(platforms, subplots, colors):
# Regression plot, so we can gauge the linear relationship
sns.regplot(x = np.log(subsample[platform] + 1), y = subsample['SentimentTitle'],
ax=subplot,
color=color,
# Pass an alpha value to regplot's scatterplot call
scatter_kws={'alpha':0.5})
# Set a nice title, get rid of x labels
subplot.set_title(platform, fontsize=18)
subplot.set_xlabel('')
fig.suptitle('Plot of log(Popularity) by Title Sentiment', fontsize=24)
plt.show()
Contrary to what we might expect (from our view of highly emotional, click-through headlines), in this data set, we found that there is no relationship between headline sentiment and the popularity of articles measured by the number of stocks.
To get a clearer picture of how popular you look at yourself, let's make the final log (popularity) by platform.
fig, ax = plt.subplots(3, 1, figsize=(15, 10))
subplots = [a for a in ax]
for platform, subplot, color in zip(platforms, subplots, colors):
sns.distplot(np.log(main_data[platform] + 1), ax=subplot, color=color, kde_kws={'shade':True})
# Set a nice title, get rid of x labels
subplot.set_title(platform, fontsize=18)
subplot.set_xlabel('')
fig.suptitle('Plot of Popularity by Platform', fontsize=24)
plt.show()
As part of our final exploration, let's look at the emotions themselves. Does the publisher seem to be different?
# Get the list of top 12 sources by number of articles
source_names = list(main_data['Source'].value_counts()[:12].index)
source_colors = list(sns.husl_palette(12, h=.5))
fig, ax = plt.subplots(4, 3, figsize=(20, 15), sharex=True, sharey=True)
ax = ax.flatten()
for ax, source, color in zip(ax, source_names, source_colors):
sns.distplot(main_data.loc[main_data['Source'] == source]['SentimentTitle'],
ax=ax, color=color, kde_kws={'shade':True})
ax.set_title(source, fontsize=14)
ax.set_xlabel('')
plt.xlim(-0.75, 0.75)
plt.show()
The distribution looks very similar, but it's a bit hard to sayhow is itWhen they are similar, they are all on different plots. Let's try to stack them all on one graph.
# Overlay each density curve on the same plot for closer comparison
fig, ax = plt.subplots(figsize=(12, 8))
for source, color in zip(source_names, source_colors):
sns.distplot(main_data.loc[main_data['Source'] == source]['SentimentTitle'],
ax=ax, hist=False, label=source, color=color)
ax.set_xlabel('')
plt.xlim(-0.75, 0.75)
plt.show()
We see that the sentiment distribution of the source of the article title is very similar-it seems that no one source is an outlier of a positive or negative title. In contrast, all the 12 most common sources have a 0-centered distribution with moderately sized tails. But can this tell the complete story? Let's look at these numbers again:
# Group by Source, then get descriptive statistics for title sentiment
source_info = main_data.groupby('Source')['SentimentTitle'].describe()
# Recall that `source_names` contains the top 12 sources
# We'll also sort by highest standard deviation
source_info.loc[source_names].sort_values('std', ascending=False)[['std', 'min', 'max']]
We can see at a glance that the Wall Street Journal has the largest standard deviation and the largest range, and the lowest sentiment is the lowest compared to any other top resources. This suggests that the Wall Street Journal may be extremely negative in terms of article titles. To rigorously verify this requires a hypothesis test, which is beyond the scope of this article, but it is an interesting potential discovery and future direction.
Popularity forecast
The first task we prepared for modeling data was to rejoin the document vector with the respective title. Fortunately, when we are preprocessing corpus, we deal withcorpus
和titles_list
Synchronized, so the vectors and titles they represent will still match. at the same time,main_df
We have removed all articles with -1 popularity, so we need to remove the vector representing the title of these articles.
It is impossible to model these huge carriers on this computer, but we will see what can be done by reducing a little dimension. I will also design a new feature based on the release date: "DaysSinceEpoch", which is based on Unix time (HereRead more).
import datetime
# Convert publish date column to make it compatible with other datetime objects
main_data['PublishDate'] = pd.to_datetime(main_data['PublishDate'])
# Time since Linux Epoch
t = datetime.datetime(1970, 1, 1)
# Subtract this time from each article's publish date
main_data['TimeSinceEpoch'] = main_data['PublishDate'] - t
# Create another column for just the days from the timedelta objects
main_data['DaysSinceEpoch'] = main_data['TimeSinceEpoch'].astype('timedelta64[D]')
main_data['TimeSinceEpoch'].describe()
As we have seen, all of these articles are published within about 250 days of each other.
from sklearn.decomposition import PCA
pca = PCA(n_components=15, random_state=10)
# as a reminder, x is the array with our 300-dimensional vectors
reduced_vecs = pca.fit_transform(x)
df_w_vectors = pd.DataFrame(reduced_vecs)
df_w_vectors['Title'] = titles_list
# Use pd.concat to match original titles with their vectors
main_w_vectors = pd.concat((df_w_vectors, main_data), axis=1)
# Get rid of vectors that couldn't be matched with the main_df
main_w_vectors.dropna(axis=0, inplace=True)
Now we need to remove the non-numeric and non-virtual columns so that we can provide the data to the model. We will alsoDaysSinceEpoch
The feature applies scaling because it is much larger than the reduced word vector, mood, etc.
# Drop all non-numeric, non-dummy columns, for feeding into the models
cols_to_drop = ['IDLink', 'Title', 'TimeSinceEpoch', 'Headline', 'PublishDate', 'Source']
data_only_df = pd.get_dummies(main_w_vectors, columns = ['Topic']).drop(columns=cols_to_drop)
# Standardize DaysSinceEpoch since the raw numbers are larger in magnitude
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Reshape so we can feed the column to the scaler
standardized_days = np.array(data_only_df['DaysSinceEpoch']).reshape(-1, 1)
data_only_df['StandardizedDays'] = scaler.fit_transform(standardized_days)
# Drop the raw column; we don't need it anymore
data_only_df.drop(columns=['DaysSinceEpoch'], inplace=True)
# Look at the new range
data_only_df['StandardizedDays'].describe()
# Get Facebook data only
fb_data_only_df = data_only_df.drop(columns=['GooglePlus', 'LinkedIn'])
# Separate the features and the response
X = fb_data_only_df.drop('Facebook', axis=1)
y = fb_data_only_df['Facebook']
# 80% of data goes to training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
let usXGBoost
Unoptimize the data to see how it works out of the box.
from sklearn.metrics import mean_squared_error
# Instantiate an XGBRegressor
xgr = xgb.XGBRegressor(random_state=2)
# Fit the classifier to the training set
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
mean_squared_error(y_test, y_pred)
To say the least, the results are not satisfactory. Can we improve this performance through hyperparameter adjustment? I have alreadyIn this Kaggle articleExtract and re-adjust a hyperparameter adjustment grid.
from sklearn.model_selection import GridSearchCV
# Various hyper-parameters to tune
xgb1 = xgb.XGBRegressor()
parameters = {'nthread':[4],
'objective':['reg:linear'],
'learning_rate': [.03, 0.05, .07],
'max_depth': [5, 6, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [250]}
xgb_grid = GridSearchCV(xgb1,
parameters,
cv = 2,
n_jobs = 5,
verbose=True)
xgb_grid.fit(X_train, y_train)
according toxgb_grid
As we know, our best parameters are as follows:
{'colsample_bytree':0.7,'learning_rate':0.03,'max_depth':5,'min_child_weight':4,'n_estimators':250,'nthread':4,'objective':'reg:linear','silent' :1,'subsample':0.7}
Try again with the new parameters:
params = {'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 5, 'min_child_weight': 4,
'n_estimators': 250, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}
# Try again with new params
xgr = xgb.XGBRegressor(random_state=2, **params)
# Fit the classifier to the training set
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
mean_squared_error(y_test, y_pred)
It's about 35,000 better, but I'm not sure if I say a lot. At this point, we may conclude that the data in the current state does not seem to be sufficient for the model to execute. Let's see if we can improve it with more feature engineering: we will train some classifiers to separate two main articles: Duds (0 or 1) and Not Duds.
The idea is that if we can give regression a new feature (the article has a very low share of the possibility), it may be more conducive to predicting highly shared articles, thereby reducing the residual value of these articles and reducing the mean square. . error.
Bypass: detecting no supplies
From our previous logarithmic transformation diagram, we can notice that, in general, there are 2 article blocks: 1 clusters are in 0, and another cluster (long tail) starts from 1. We can train some classifiers to identify whether the article is a "dumb" (in the 0-1 stock box) and then use the predictions of these models as features of the final regression, which will predict the probability. This is calledModel stacking.
# Define a quick function that will return 1 (true) if the article has 0-1 share(s)
def dud_finder(popularity):
if popularity <= 1:
return 1
else:
return 0
# Create target column using the function
fb_data_only_df['is_dud'] = fb_data_only_df['Facebook'].apply(dud_finder)
fb_data_only_df[['Facebook', 'is_dud']].head()
# 28% of articles can be classified as "duds"
fb_data_only_df['is_dud'].sum() / len(fb_data_only_df)
Now that we have completed the dud function, we will initialize the classifier. We will use random forests, optimized XGBC classifiers andK-Nearest NeighborsClassifier. I will omit the section that adjusts the XGB because it looks exactly the same as the adjustment we made before.
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X = fb_data_only_df.drop(['is_dud', 'Facebook'], axis=1)
y = fb_data_only_df['is_dud']
# 80% of data goes to training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Best params, produced by HP tuning
params = {'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 5, 'min_child_weight': 4,
'n_estimators': 200, 'nthread': 4, 'silent': 1, 'subsample': 0.7}
# Try xgc again with new params
xgc = xgb.XGBClassifier(random_state=10, **params)
rfc = RandomForestClassifier(n_estimators=100, random_state=10)
knn = KNeighborsClassifier()
preds = {}
for model_name, model in zip(['XGClassifier', 'RandomForestClassifier', 'KNearestNeighbors'], [xgc, rfc, knn]):
model.fit(X_train, y_train)
preds[model_name] = model.predict(X_test)
Test the model and get the classification report:
from sklearn.metrics import classification_report, roc_curve, roc_auc_score
for k in preds:
print("{} performance:".format(k))
print()
print(classification_report(y_test, preds[k]), sep='\n')
f1-score's highest performance comes from XGC, followed by RF, and finallyKNN.However, we can also notice that KNN does a good job in recallthe best(Successfully identified duds). This is why model stacking is valuable-sometimes even a good model like XGBoost will perform poorly on tasks like this, and obviously the features to be identified can be approximated locally. Forecasts including KNN should add some much-needed diversity.
# Plot ROC curves
for model in [xgc, rfc, knn]:
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.show()
Popularity prediction: 2 round
Now we can average the probability prediction from the three classifiers and use it as a feature of the regression.
averaged_probs =(xgc.predict_proba(X)[:,1] +
knn.predict_proba(X)[:,1] +
rfc.predict_proba(X)[:,1])/ 3
X ['prob_dud'] = averaged_probs
y = fb_data_only_df ['Facebook']
Next is another round of HP adjustments, including new features, which I will miss. Let's see how we handle performance:
xgr = xgb.XGBRegressor(random_state=2, **params)
# Fit the classifier to the training set
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
mean_squared_error(y_test, y_pred)
Oh! This performance is essentially the same as we did before any model stacking. That is, we can remember that MSE tends to be overweight outliers as an error measure. In fact, we can also calculate the mean absolute error (MAE), which is used to evaluate the performance of data with significant outliers. In mathematical terms, MAE calculates residualsl1 norm, basically the absolute value, not the l2 norm used by MSE. We can compare the MAE to the square root of the MSE, also known as the root mean square error (RMSE).
mean_absolute_error(y_test,y_pred),np.sqrt(mean_squared_error(y_test,y_pred))
The average absolute error is only about RMSE's 1/3! Maybe our model is not as bad as we originally thought.
The final step is to let us know the importance of each feature based on XGRegressor:
neat! Our model was foundprob_dud
Is the most important feature, our customizationStandardizedDays
Function is the second most important function. (特征0 to 14 correspond to the reduced title embedding vector. )
Although this round of model stacking did not improve overall performance, we can see that we have successfully captured an important source of variation in the data and the model has begun.
If I want to continue to extend this project to make the model more accurate, I might consider using external data to increase the data, including using Source as a variable by binning or hashing, running the model on the original 300 dimension vector, and using each article. The "time sliced" data of the popularity at different points in time (the accompanying data set of the data) is used to predict the final popularity.
If you find this analysis interesting, feel free to use the code and expand it further! Notebook atHere(Note that the order of some cells may differ slightly from the order shown here), the raw data used by this projectHere.
Comments