How to Extract Data using Reddit API (AI Chatbot Data Collection)
Reddit is an American social news aggregation, web content rating, and discussion website. Some researchers commonly collect text data from Reddit for the AI chatbot project. This blog will present all the necessary steps of data extraction using the Reddit API.
Steps to obtain necessary information
Step 1: Sign up or log in using your Google account or your own email. after logging in, if you click your profile icon, you can see “User Settings”.
Step 2: It will take you to a new page, then go ahead to click “Safety & Privacy”.
Step 3: After click “Safety & Privacy”, if you scroll down and click “ Manage third-party app authorization”. On the new page left top you can see “are you a developer? create an app”, and go ahead to click it.
Step 4: In this step, we can enter the application name and choose the script. If you use python via Anaconda, you need to copy and paste the localhost address then click “create app”.
Step 5: You can obtain client id, client secret, and user agent which are not sharable.
Code in Python
First, you need to install “praw” using !pip install praw. Then we are able to import “praw” and create an instance of Reddit:
reddit = praw.Reddit(username='your reddit username',
password='your reddit password',
client_id='client id',
client_secret='client secret',
user_agent='agent_name',
)
Then, we can extract the data from Reddit API. You can choose any topic you want, and here I just simply choose the topic “python” using ‘subreddit’ and we can use the ‘limit’ parameter to configure the maximum number of instances. Because some comments may not have any reply, so you can use “try” and “except” to obtain comments which have more than one.
subreddit=reddit.subreddit("python")
hot_python = subreddit.hot(limit=3)
title_and_reply=[]
for submission in hot_python:
if not submission.stickied:
comments = submission.comments
for comment in comments:
try:
title_and_reply.append({"body":comment.body,
"reply":[reply.body for reply in comment.replies]})
except:
passtitle_and_reply[0]
>>> {'body': 'I was working on this last night actually. Open Street Maps has an API and then they have the Overpass API as well so I was trying to figure out which one to use and how to pull based on single location requests',
'reply': ['All OSM data is on bigquery']}# we can see that some comment may have several replies
len(title_and_reply[2]["reply"])
>>> 3
Here I want to mention that for a topic when a topic was presented there might be some replies and under some replies, there are also some replies, so they come to a conversation. The first thing we need to do is to separate the questions and the answers.
questions=[]
answers=[]
for i in range(len(title_reply)):
questions.append(list(title_reply)[i]['body'])
answers.append(list(title_reply)[i]["reply"])
new_questions = []
new_answers = []
for con in answers:
for i in range(len(con) - 1):
new_questions.append(con[i])
new_answers.append(con[i+1])
Q=questions+new_questions
A=answers+new_answers
Let us take a look at Q and A. The 10th elements of Q and A look like a conversation and it sounds make sense.
# take a look at the 10th topic and reply
m=10
print("Topic: "+ Q[m])
print("Reply: "+ A[m][0])
>>>Topic: What's going on here? Am I the only one who gets a strong 'fake' vibe at many comments in this thread?Reply: You're not, some of them almost seem like bots.
For more information please visit https://praw.readthedocs.io/en/latest/