Recently, I and my three team members are preparing the AI chatbot project. The first step is to obtain and collect data. There are three main data sources- Reddit API, Twitter API, and Stack Overflow.
This blog will present how to use BeautifulSoup to scrape data from the Stack Overflow website. My goal here is to obtain all the possible python questions and answers or even some replies under some comments (like short conversations).
First, let us take a look at the website. If you set up the tag as Python, you can see there are over one million pages and 15 questions on each page. So how to collect more useful and meaningful questions? The answer is to set up “show Votes”. The more people vote the hotter python questions there will be.
In the second step, please open your favorite python, then import pandas requests and BeautifulSoup.
import pandas as pd
import requests
from bs4 import BeautifulSoup
Next, let us create some useful functions to scrape the data. The function below is to obtain all the links of python questions on a single page.
def href(soup):
# get all href links from one page
href=[]
for i in soup.find_all("a",class_="question-hyperlink",href=True):
href.append(i['href'])
return href
After we obtain all the links from different pages, we need to double-check and clean those links to make sure all links are valid (not empty).
def clean_empty_hrefs(hrefs):
# remove all empty lists
list_hrefs=[]
for i in hrefs:
if i!=[]:
list_hrefs.append(i)
# merge all elemenets in one list
herfs_list=[]
for i in list_hrefs:
for j in i:
herfs_list.append(j)
return herfs_list
After that, we also need to create a function to add a prefix for some links which are like `questions/231767/topic/…..`. The valid links should start with https://stackoverflow.com/. This function below is to add prefix “https://…” for some links if they do not start with the prefix.
def add_prefix(herfs_list):
# rearrage those links who do not have 'https://stackoverflow.com' prefix new_href=[]
prefix='https://stackoverflow.com'
for h in herfs_list:
if 'https' not in h:
m=prefix+h+"answertab=votes#tab-top"
new_href.append(m)
else:
new_href.append(h+"answertab=votes#tab-top")
return new_href
Next, the last two functions are to get soup for each page via the links we obtained above and scrape each page with soup.
def single_page_scraper(url):
req=requests.get(url=url)
soup=BeautifulSoup(req.text,"html.parser")
return soupdef single_page_question_answer(url):
page=single_page_scraper(url).find_all("div",class_="s-prose js-post-body",itemprop="text") # this class may vary by the time
question=[i.find("p").get_text()for i in page][0]
answer=[i.find("p").get_text() for i in page][1:3]
return question,answer
The final step is to combine all the functions we created above:
import itertoolsdef questions_answers(start_page,end_page):
soups=[]
for page in range(start_page,end_page):
req=requests.get(url='https://stackoverflow.com/questions/tagged/python?tab=votes&page={}&pagesize=15'.format(page))
soup=BeautifulSoup(req.text,"html.parser")
soups.append(soup)
print("Soups are ready!")
# obtain all href
hrefs=[]
for soup in soups:
hrefs.append(href(soup))herfs_list=clean_empty_hrefs(hrefs)
new_hrefs_list=add_prefix(herfs_list)
print("All hrefs are ready!")quesitons=[]
answers=[]
for url in new_hrefs_list:
try:
q,a=single_page_question_answer(url)
quesitons.append(q)
answers.append(a)
except:
pass
print("quesitons and answers are ready!")
new_answers=[]
for i in range(len(answers)):
try:
new_answers.append(answers[i][0])
except:
new_answers.append(None)
print("All most done!")new_q = []
new_a = []
merge_answer=list(itertools.chain.from_iterable(answers))
for i in range(len(merge_answer) - 1):
new_q.append(merge_answer[i])
new_a.append(merge_answer[i+1])
return quesitons+new_q, new_answers+new_a
Actually, you can change the tag and choose different topics that you want to study or research. For example, The URL is url=https://stackoverflow.com/questions/tagged/python?tab=votes&page={}&pagesize=15'.format(page). You can change “python” to “statistics” or “machinelearning”.
Now, we can use the final function to extract data from page 1 to page 499:
Questions,Answers=questions_answers(1,500)
>>>
Soups are ready!
All hrefs are ready!
quesitons and answers are ready!
All most done!# randomly check 90th elements in Questions and Answers
Questions[90],Answers[90]
>>>
-- Question: This thread discusses how to get the name of a function as a string in Python:
How to get a function name as a string?
-- Answer: Using the python-varname package, you can easily retrieve the name of the variables
Conclusion
You can loop through more pages, but it will easily break if you loop too many pages. My suggestion is scraping 500 pages each time will be ideal. If you are interested in data extraction — Reddit API, please visit my previous blog.