Being an avid redditor (lurker) myself, I’ve always wondered how unique certain subreddits are. For the uninitiated, subreddits are equivalent to sub-topics of a message board. As an example, the r/Singapore subreddit would cover all or most discussions about Singapore and can range from the fascinating to the truly … strange.
On the topic of subreddits, i’m a sucker for reading into ‘juicy’ subreddits that have posts spanning interpersonal relationships. It’s not uncommon for a random internet stranget to spill their heart out and treat other strangers as their ‘aunt agony’. Interestingly, there exists two similar subreddits relationship
and confessions
.
Wouldn’t it be cool if, on the basis of historical posts, we can develop a method to differentiate between r/relationships or r/confessions subreddit? Well, we can - and it is pretty straightforward!
Using common approaches in Natural Language Processing (NLP) - an increasingly popular Data Science topic - this post will go through some of the key steps invovled in this process. More importantly, it is to share an easily generalisable methodology that can be used on other subreddits as well.
A caveat, however, is that this method is constrained to text-based subreddits which only have text in their posts. Posts with images are not going to be used as it is outside of the scope of this post.
Table of Contents
Importing our libraries
import numpy as np
import requests
import pandas as pd
import time
import random
import regex as re
import matplotlib.pyplot as plt
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from sklearn.metrics import classification_report, roc_curve
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import warnings
from psaw import PushshiftAPI
# After the imports
warnings.filterwarnings(action='ignore')
Data Acquisition
Scrap data using the PushShiftAPI to extract more than 1000 posts per subreddit to overcome Reddit’s imposed limitation.
%time
api = PushshiftAPI()
confessions = pd.DataFrame(list(api.search_submissions(subreddit='confessions',
filter=['author','title','subreddit','selftext'],
limit=5000)))
relationships = pd.DataFrame(list(api.search_submissions(subreddit='relationships',
filter=['author','title','subreddit','selftext'],
limit=5000)))
# store the scrapped data.
confessions.to_csv('./data/confessions.csv')
relationships.to_csv('./data/relationships.csv')
Data Cleaning
We create a filter_columns
function that filters out the title, self text and subreddit name (our target)
We use the .count()
function in our DataFrame object to understand the class balance of our dataset. Ideally, we want the number of entries of type confessions and/or relationships to be the same.
def filter_columns(df):
columns_to_retain = ['title','selftext','subreddit','author']
return df[columns_to_retain]
df_relationships_clean = filter_columns(df_relationships)
df_conf_clean = filter_columns(df_confessions)
`
display(df_relationships_clean['title'].count())
display(df_conf_clean['title'].count())
Below is a sample of our data:
df_relationships_clean.head()
title | selftext | subreddit | author | |
---|---|---|---|---|
0 | Hi I'm here to find my friends without anybody... | NaN | relationships | 0100100001010000 |
1 | My (M31) mind might be broken when i thi k abo... | [removed] | relationships | obviousThrowaway274 |
2 | How do I (26m) apologize to my ex (25f) in a d... | Long story short, we broke up 4 months ago and... | relationships | Throwitallaway73734 |
3 | Do you believe it's better to solve an argumen... | [removed] | relationships | EvenKealed |
4 | Am i broken? | [removed] | relationships | obviousThrowaway274 |
df_conf_clean.head()
title | selftext | subreddit | author | |
---|---|---|---|---|
0 | Thought a girl was giving me a quarter and the... | So, this was back in 2nd grade. It's a normal ... | confessions | jessthatrandomperson |
1 | How can I enjoy my last few days? | I am going to die very soon. \n\nI am terrifie... | confessions | throwaway948118 |
2 | I am a narcissistic asshole and I know it and ... | I am basically just a manipulative horrible pe... | confessions | royjorbison |
3 | I use Reddit as an audience for my puns | I can't go ten sentences without thinking of a... | confessions | anikdylan27 |
4 | I'm sorry for being an asshole last night | To the guy I met last night, who's name escape... | confessions | roodeeMental |
Prior to this, we may wish to remove posts that have ‘Moderator’ as an author to train our model on more ‘authentic’ posts.
df_relationships_clean.loc[:,'author'] = df_relationships_clean.author.map(lambda x : x.lower())
df_conf_clean.loc[:,'author'] = df_conf_clean.author.map(lambda x : x.lower())
df_relationships_clean = df_relationships_clean[~df_relationships_clean.author.str.contains('moderator')]
df_conf_clean = df_conf_clean[~df_conf_clean.author.str.contains('moderator')]
df_relationships_clean.isna().sum()
title 0
selftext 16
subreddit 0
author 0
dtype: int64
df_conf_clean.isna().sum()
title 0
selftext 739
subreddit 0
author 0
dtype: int64
We also observe empty selftext in both subreddits. we shall drop rows with empty selftext.
df_relationships_clean = df_relationships_clean.dropna(axis=0)
df_conf_clean = df_conf_clean.dropna(axis=0)
Ensure only posts with selftext
more than 10 words are selected.
df_relationships_clean ['selftext_len'] = df_relationships_clean .selftext.map(lambda x: len(x.split()))
df_relationships_clean = df_relationships_clean [df_relationships_clean .selftext_len > 10]
df_conf_clean['selftext_len'] = df_conf_clean.selftext.map(lambda x: len(x.split()))
df_conf_clean = df_conf_clean[df_conf_clean.selftext_len > 10]
Next, we drop our duplicates:
df_relationships_clean.drop_duplicates(inplace=True)
df_conf_clean.drop_duplicates(inplace=True)
Displaying our class balances after dropping the rows:
display(df_relationships_clean.count())
display(df_conf_clean.count())
For relationships
:
title 2925
selftext 2925
subreddit 2925
author 2925
selftext_len 2925
dtype: int64
For confessions
:
title 3893
selftext 3893
subreddit 3893
author 3893
selftext_len 3893
dtype: int64
Seeing that a value of 2900 is the limiting number, we randomly select 2900 entries from both sets.
subset_relationships_clean = df_relationships_clean.sample(n=2900,random_state=666)
subset_conf_clean = df_conf_clean.sample(n=2900,random_state=666)
Class Balance
# combine both subsets into a DF
df_pre = subset_relationships_clean.append(subset_conf_clean,ignore_index=True)
df_pre.subreddit.value_counts(normalize=True)
Target Encoding
We then perform an encoding of our target : 1 corresponds to posts of type confessions
while 0 corresponds to posts of type relationships
.
# create target class columns 0 = relationships, 1 = confessions - encoding
df_pre['label'] = df_pre.subreddit.map({'relationships':0,'confessions':1}).astype('int')
df_pre.head()
title | selftext | subreddit | author | selftext_len | label | |
---|---|---|---|---|---|---|
0 | I (F18) am questioning the intentions of a ran... | My lovely (quite attractive) new boyfriend (M1... | relationships | boterbabbelaartje | 334 | 0 |
1 | Jealousy | My boyfriend(29m) and I(30f) have been togethe... | relationships | taramarie87 | 103 | 0 |
2 | I [28F] wants sex all the time. I've made this... | Lately, I've been wanting more sex. To have se... | relationships | missionblueberry | 236 | 0 |
3 | I [32m] am having issues with jealousy with my... | Hooo boy. Here we go. \n\nMy wife and I have b... | relationships | dcsrm | 438 | 0 |
4 | Is my girlfriend into wedgies? | My girlfriend (F 21yrs old) and I (M 32yrs old... | relationships | davidsardinas36 | 73 | 0 |
Cleaning Function
Ensure formatting of text by:
- Converting all to lower cases
- removing groups of words in parentheses
- remove line breaks
- removing special characters
We encapsulate this cleaning into the function clean_text
# convert the stop words to a set.
stops = set(stopwords.words('english'))
def clean_text(text):
#01 convert titles, selftext into lowercase
lower_text = text.lower()
#02 remove brackets and parenthesis from the title and selftext.
no_br_paret_text = re.sub(r'\(.+?\)|\[.+?\]',' ',str(lower_text))
#03 remove special characters
removed_special = re.sub(r'[^0-9a-zA-Z ]+',' ',str(no_br_paret_text))
#04 remove xamp200b
remove_xamp200b = re.sub(r'ampx200b',' ',str(removed_special))
#05 remove digits
result = re.sub(r'\d+', '', remove_xamp200b).split()
#06 split into individual words
meaningful_words = [w for w in result if not w in stops]
#07 Join the words back into one string separated by space,
# and return the result.
return(" ".join(meaningful_words))
df[['title','selftext']] = df_pre[['title','selftext']].applymap(clean_text)
df.head()
A sample of our pre-cleaned data:
title | selftext | subreddit | author | selftext_len | label | |
---|---|---|---|---|---|---|
0 | questioning intentions random new girl asked b... | lovely new boyfriend told girl cig outside sch... | relationships | boterbabbelaartje | 334 | 0 |
1 | jealousy | boyfriend together almost years two beautiful ... | relationships | taramarie87 | 103 | 0 |
2 | wants sex time made known whose | lately wanting sex sex time bit back story rel... | relationships | missionblueberry | 236 | 0 |
3 | issues jealousy wife stage acting | hooo boy go wife married years coming ups down... | relationships | dcsrm | 438 | 0 |
4 | girlfriend wedgies | girlfriend together months used phone look som... | relationships | davidsardinas36 | 73 | 0 |
Post Cleaning
pd.DataFrame(data=zip(df_pre['selftext'],df['selftext']),columns=['pre','post']).head(5)
pre | post | |
---|---|---|
0 | My lovely (quite attractive) new boyfriend (M1... | lovely new boyfriend told girl cig outside sch... |
1 | My boyfriend(29m) and I(30f) have been togethe... | boyfriend together almost years two beautiful ... |
2 | Lately, I've been wanting more sex. To have se... | lately wanting sex sex time bit back story rel... |
3 | Hooo boy. Here we go. \n\nMy wife and I have b... | hooo boy go wife married years coming ups down... |
4 | My girlfriend (F 21yrs old) and I (M 32yrs old... | girlfriend together months used phone look som... |
Data Exploration
Split title and self text into two classifiers where the output of title_classifier and self_text classifier would provide indication of which subreddit the posts belong to.
#split titles, and self text into seperate df
df_title = df[['title','label']]
df_selftext = df[['selftext','label']]
def get_freq_words(sparse_counts, columns):
# X_all is a sparse matrix, so sum() returns a 'matrix' datatype ...
# which we then convert into a 1-D ndarray for sorting
word_counts = np.asarray(sparse_counts.sum(axis=0)).reshape(-1)
# argsort() returns smallest first, so we reverse the result
largest_count_indices = word_counts.argsort()[::-1]
# pretty-print the results! Remember to always ask whether they make sense ...
freq_words = pd.Series(word_counts[largest_count_indices],
index=columns[largest_count_indices])
return freq_words
# Let's use the CountVectorizer to count words for us for each class
# create mask
X_1 = df_selftext[df_selftext['label'] == 1]
X_0 = df_selftext[df_selftext['label'] == 0]
cvt = CountVectorizer(ngram_range=(1,1),stop_words='english')
X_1_all = cvt.fit_transform(X_1['selftext'])
X_0_all = cvt.fit_transform(X_0['selftext'])
columns_1 = np.array(cvt.get_feature_names()) # ndarray (for indexing below)
columns_0 = np.array(cvt.get_feature_names())
freq_words_1 = get_freq_words(X_1_all, columns_1)
freq_words_0 = get_freq_words(X_0_all, columns_0)
print('Confessions:')
display(freq_words_1[:10])
print("\n")
print('Relationships:')
display(freq_words_0[:10])
Here are some key words that appear in the Confessions data set - which would mean that the words landlord
,jeopardise
, etc. would make it more than likely for the post to be of confessions
class.
Confessions:
landlord 3063
msg 2325
jeopardise 1979
teachings 1721
eyes 1674
pur 1506
user 1438
overworking 1405
generic 1133
lacking 1109
dtype: int64
Same for relationships
:
Relationships:
like 6690
time 4761
know 4694
want 4630
really 4235
feel 4000
relationship 3744
said 3245
things 3070
told 2999
dtype: int64
Data Modeling
Train Test Split
Here, we start with our model development. Before that, we perform a train/test split to ensure that we can validate our model performance.
X_text = df_selftext['selftext']
y_text = df_selftext['label']
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(X_text,y_text,stratify=y_text)
Model Playground
We create the class LemmaTokenizer
to do both lemmatize each word of each entry. I.e. given a list of words, we
lemmatize
each word.
Firstly, we try the Naive Bayes model - MultinomialNB as there are multiple nominal features in the form of the various tokens.
classifiers = []
vectorizers = [('cvec', CountVectorizer(stop_words='english',tokenizer=LemmaTokenizer())),
('tfvec', TfidfVectorizer(stop_words='english',tokenizer=LemmaTokenizer()))]
for vectorizer in vectorizers:
bayes_pipe = Pipeline([
(vectorizer),
('mnb', MultinomialNB())
])
scores = cross_val_score(bayes_pipe, X_text_train, y_text_train,cv=5,verbose=1)
b = bayes_pipe.fit(X_text_train, y_text_train)
y_pred = b.predict(X_text_test)
print(classification_report(y_text_test, y_pred, target_names=['class 0','class 1']))
print('Cross val score for mnb classifier using {} vectorizer is {}'.format(vectorizer[0],scores))
print('Accuracy score for mnb classifier using {} vectorizer is {}'.format(vectorizer[0],bayes_pipe.score(X_text_test, y_text_test)))
precision recall f1-score support
class 0 0.77 0.95 0.85 725
class 1 0.93 0.71 0.81 725
accuracy 0.83 1450
macro avg 0.85 0.83 0.83 1450
weighted avg 0.85 0.83 0.83 1450
Cross val score for mnb classifier using cvec vectorizer is [0.80114943 0.80689655 0.86321839 0.81724138 0.79770115]
Accuracy score for mnb classifier using cvec vectorizer is 0.8289655172413793
precision recall f1-score support
class 0 0.65 0.99 0.78 725
class 1 0.98 0.46 0.63 725
accuracy 0.73 1450
macro avg 0.82 0.73 0.71 1450
weighted avg 0.82 0.73 0.71 1450
Cross val score for mnb classifier using tfvec vectorizer is [0.71149425 0.70689655 0.74712644 0.73448276 0.7045977 ]
Accuracy score for mnb classifier using tfvec vectorizer is 0.7282758620689656
Thus the recall scores for multinomial NB with countvectorizer seems to provide higher recall when compared to the tfidf vectorizer.
In the meantime, we create a function to encapsulate our evaluation process such that it returns only the false positive rate and true positive rate with a sklearn
processing pipeline.
# store predicted_proba scores for later evaluation under ROC curve
def generate_roc(pipeline):
b = pipeline.fit(X_text_train, y_text_train)
print(f"Train Score:{round(b.score(X_text_train, y_text_train),2)} / Test Score {round(b.score(X_text_test, y_text_test),2)}")
fpr, tpr, _ = roc_curve(y_text_test, b.predict_proba(X_text_test)[:,1],pos_label=1)
return [fpr,tpr]
Rewriting the CountVectorizer Naive Bayes and TF-IDF Naive Bayes into their respective pipelines:
cv_bayes_pipe = Pipeline([
(vectorizers[0]),
('mnb', MultinomialNB())
])
tfidf_bayes_pipe = Pipeline([
(vectorizers[1]),
('mnb', MultinomialNB())
])
Pipeline for Logistic Regression Baseline
pipe = Pipeline([
('cvec', CountVectorizer(stop_words='english',tokenizer=LemmaTokenizer())),
('lr', LogisticRegression(solver='saga',max_iter=300))
])
Obtain hyperparameters for our vectorizer and logistic regressor.
We can use a grid search to find the optimal hyperparameters for our pipelines:
pipe_params = {
'cvec__max_features': [2500, 3000, 3500],
'cvec__ngram_range': [(1,1), (1,2)],
'lr__penalty' : ['elasticnet'],
'lr__C' : np.arange(0.1,1,0.1),
'lr__l1_ratio' : np.arange(0,1.1,0.2)
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5,verbose=1,n_jobs=-1)
gs.fit(X_text_train, y_text_train)
print(gs.best_score_)
0.9154022988505747
gs.best_params_
{'cvec__max_features': 2500,
'cvec__ngram_range': (1, 1),
'lr__C': 0.1,
'lr__l1_ratio': 1.0,
'lr__penalty': 'elasticnet'}
The best score for our logistic regression pipeline:
gs.best_estimator_.score(X_text_test,y_text_test)
0.9186206896551724
Using the hyperparameters:
# try model on title
optimal_pipe = Pipeline([
('cvec', CountVectorizer(tokenizer=LemmaTokenizer(),max_features=2500,ngram_range=(1,1))),
('lr', LogisticRegression(solver='saga',max_iter=300,C=0.1,l1_ratio=1.0,penalty='elasticnet'))
])
X_title = df_title['title']
y_title = df_title['label']
optimal_pipe.fit(X_text_train, y_text_train)
We try the model on our title
dataset to obtain the accuracy of the model to classify the subreddit from titles alone.
y_logr_pred = optimal_pipe.predict(X_text_test)
print(classification_report(y_text_test, y_logr_pred, target_names=['class 0','class 1']))
precision recall f1-score support
class 0 0.55 1.00 0.71 725
class 1 0.99 0.18 0.31 725
accuracy 0.59 1450
macro avg 0.77 0.59 0.51 1450
weighted avg 0.77 0.59 0.51 1450
Next, we explore the use tfidfvectorizer instead of countvectorizer to account for document similarity
tfidf_pipe = Pipeline([
('tfvec', TfidfVectorizer(stop_words='english',tokenizer=LemmaTokenizer())),
('lr', LogisticRegression(solver='saga',max_iter=300))
])
tfidf_params = {
'tfvec__max_features': [2500, 3000, 3500],
'tfvec__ngram_range': [(1,1), (1,2)],
'lr__penalty' : ['elasticnet'],
'lr__C' : np.arange(0.1,1,0.1),
'lr__l1_ratio' : np.arange(0,1.1,0.2)
}
gs = GridSearchCV(tfidf_pipe, param_grid=tfidf_params, cv=3,verbose=1,n_jobs=-1)
gs.fit(X_text_train, y_text_train)
print(gs.best_score_)
0.9183908045977012
It seems that tfidf vectorizer performs best with the logistic regression model.
tfidf_best_pipe = Pipeline([
('tfvec', TfidfVectorizer(max_features=3500,ngram_range=(1,1),stop_words='english',tokenizer=LemmaTokenizer())),
('lr', LogisticRegression(solver='saga',max_iter=300,C=0.9,l1_ratio=1.0,penalty='elasticnet'))
])
# test model against test text data and rest of titles
y_text_tfidf_pred = gs.best_estimator_.predict(X_text_test)
y_title_tfidf_pred = gs.best_estimator_.predict(X_title)
print("Text Report (results based on test data) \n" +
classification_report(y_text_test, y_text_tfidf_pred, target_names=['class 0','class 1']))
print("Titles (all titles) Report \n" +
classification_report(y_title, y_title_tfidf_pred, target_names=['class 0','class 1']))
Text Report (results based on test data)
precision recall f1-score support
class 0 0.93 0.91 0.92 725
class 1 0.91 0.94 0.92 725
accuracy 0.92 1450
macro avg 0.92 0.92 0.92 1450
weighted avg 0.92 0.92 0.92 1450
Titles (all titles) Report
precision recall f1-score support
class 0 0.92 0.17 0.29 2900
class 1 0.54 0.98 0.70 2900
accuracy 0.58 5800
macro avg 0.73 0.58 0.49 5800
weighted avg 0.73 0.58 0.49 5800
While the optimised model with tfidf vectorizer performs remarkably well with high precision and recall, when used with the titles
dataset, we can see that that it is somewhat overfit, unable to classify the titles correctly.
# look at sample predictions
pd.DataFrame(data=zip(X_text_test,y_text_test,y_text_tfidf_pred),columns=['text','actual','predicted']).head(5)
text | actual | predicted | |
---|---|---|---|
0 | title says watched porn since got nasty furry ... | 1 | 1 |
1 | understand bad bad read lemon fanfic main vide... | 1 | 1 |
2 | lovely quite attractive boyfriend met girl wee... | 0 | 0 |
3 | dated briefly three months never turned someth... | 0 | 0 |
4 | background dating almost years good many fight... | 0 | 0 |
Model Evaluation & Summary
cv_log_roc = generate_roc(optimal_pipe)
tfidf_log_roc = generate_roc(tfidf_best_pipe)
cv_nb_roc = generate_roc(cv_bayes_pipe)
tfidf_nb_roc = generate_roc(tfidf_bayes_pipe)
tfidf_knn_roc = generate_roc(knn_best_pipe)
Train Score:0.59 / Test Score 0.59
Train Score:0.93 / Test Score 0.92
Train Score:0.89 / Test Score 0.83
Train Score:0.81 / Test Score 0.73
Train Score:1.0 / Test Score 0.81
# Evaluation
roc_data ={
'cv_nb' : cv_nb_roc,
'tfidf_nb_roc' : tfidf_nb_roc,
'cv_log_roc' : cv_log_roc,
'tfidf_log_roc' : tfidf_log_roc,
'tfidf_knn_roc' : tfidf_knn_roc
}
#### Plot figure
plt.figure(1,figsize=(10,5))
plt.plot([0, 1], [0, 1], 'k--')
for key,roc in roc_data.items():
plt.plot(roc[0], roc[1], label=key)
plt.xlabel('Sensitivity')
plt.ylabel('1 - Specificity')
plt.title('ROC curve')
plt.legend(loc='best')
plt.savefig("./img/roc_curve.png",dpi=300)
plt.show()
The crossvectorizer
+ logistic regression
model seems to perform similar to the tfidf
vectorizer and logistic regression
model. When looking at the accuracy score of all the models, the tfidf+ logistic regression model performs the best with an accuracy of 92% in terms of predicting if the selftext is either an r/confessions or r/relationships post.