Stack Overflow-Web Scraping with Python

3 min readAug 10, 2020

Recently, I and my three team members are preparing the AI chatbot project. The first step is to obtain and collect data. There are three main data sources- Reddit API, Twitter API, and Stack Overflow.

This blog will present how to use BeautifulSoup to scrape data from the Stack Overflow website. My goal here is to obtain all the possible python questions and answers or even some replies under some comments (like short conversations).

First, let us take a look at the website. If you set up the tag as Python, you can see there are over one million pages and 15 questions on each page. So how to collect more useful and meaningful questions? The answer is to set up “show Votes”. The more people vote the hotter python questions there will be.

In the second step, please open your favorite python, then import pandas requests and BeautifulSoup.

import pandas as pd
import requests
from bs4 import BeautifulSoup

Next, let us create some useful functions to scrape the data. The function below is to obtain all the links of python questions on a single page.

def href(soup):
    # get all href links from one page 
    href=[]
    for i in soup.find_all("a",class_="question-hyperlink",href=True):
        href.append(i['href'])
    return href

After we obtain all the links from different pages, we need to double-check and clean those links to make sure all links are valid (not empty).

def clean_empty_hrefs(hrefs):
   # remove all empty lists
    list_hrefs=[]
    for i in hrefs:
        if i!=[]:
            list_hrefs.append(i)
    # merge all elemenets in one list
    herfs_list=[]
    for i in list_hrefs:
        for j in i:
            herfs_list.append(j)
    return herfs_list

After that, we also need to create a function to add a prefix for some links which are like `questions/231767/topic/…..`. The valid links should start with https://stackoverflow.com/. This function below is to add prefix “https://…” for some links if they do not start with the prefix.

def add_prefix(herfs_list):
    # rearrage those links who do not have 'https://stackoverflow.com' prefix    new_href=[]
    prefix='https://stackoverflow.com'
    for h in herfs_list:
        if 'https' not in h:
            m=prefix+h+"answertab=votes#tab-top"
            new_href.append(m)
        else:
            new_href.append(h+"answertab=votes#tab-top")
    return new_href

Next, the last two functions are to get soup for each page via the links we obtained above and scrape each page with soup.

def single_page_scraper(url):
    req=requests.get(url=url)
    soup=BeautifulSoup(req.text,"html.parser")
    return soupdef single_page_question_answer(url):
    page=single_page_scraper(url).find_all("div",class_="s-prose js-post-body",itemprop="text") # this class may vary by the time
    question=[i.find("p").get_text()for i in page][0]
    answer=[i.find("p").get_text() for i in page][1:3]
    
    return question,answer

The final step is to combine all the functions we created above:

import itertoolsdef questions_answers(start_page,end_page):
    soups=[]
    for page in range(start_page,end_page):
        req=requests.get(url='https://stackoverflow.com/questions/tagged/python?tab=votes&page={}&pagesize=15'.format(page))
        soup=BeautifulSoup(req.text,"html.parser")
        soups.append(soup)
    
    print("Soups are ready!")
    # obtain all href
    hrefs=[]
    for soup in soups:
        hrefs.append(href(soup))herfs_list=clean_empty_hrefs(hrefs)
    new_hrefs_list=add_prefix(herfs_list)
    print("All hrefs are ready!")quesitons=[]
    answers=[]
    for url in new_hrefs_list:
        try:
            q,a=single_page_question_answer(url)
            quesitons.append(q)
            answers.append(a)
        except:
            pass
    print("quesitons and answers are ready!")
    
    new_answers=[]
    for i in range(len(answers)):
        try:
            new_answers.append(answers[i][0])
        except:
            new_answers.append(None)
    print("All most done!")new_q = []
    new_a = []
    merge_answer=list(itertools.chain.from_iterable(answers))
    for i in range(len(merge_answer) - 1):
        new_q.append(merge_answer[i])
        new_a.append(merge_answer[i+1])
            
    return quesitons+new_q, new_answers+new_a

Actually, you can change the tag and choose different topics that you want to study or research. For example, The URL is url=https://stackoverflow.com/questions/tagged/python?tab=votes&page={}&pagesize=15'.format(page). You can change “python” to “statistics” or “machinelearning”.

Now, we can use the final function to extract data from page 1 to page 499:

Questions,Answers=questions_answers(1,500)
>>> 
Soups are ready!
All hrefs are ready!
quesitons and answers are ready!
All most done!# randomly check 90th elements in Questions and Answers
Questions[90],Answers[90]
>>>
-- Question: This thread discusses how to get the name of a function as a string in Python:
How to get a function name as a string?
-- Answer: Using the python-varname package, you can easily retrieve the name of the variables

Conclusion

You can loop through more pages, but it will easily break if you loop too many pages. My suggestion is scraping 500 pages each time will be ideal. If you are interested in data extraction — Reddit API, please visit my previous blog.

Stack Overflow-Web Scraping with Python

Written by Hua Shi

Responses (4)