Data extraction with Beautiful Soup made easy

Data extraction is crucial in the NLP domain whether it be from web pages or tweets or someplace else. And that, ladies and gentlemen, is where Beautiful Soup comes in. Simply put, Beautiful Soup is a Python package for parsing HTML and XML documents.

In the rest of this blog, I shall try to explain how to use Beautiful Soup to extract data out of some URLs I’ve extracted and stored in an array so buckle up.

Before using this package, you need to import it.

from bs4 import BeautifulSoup

NOTE: I have a few URLs linking to news articles on the CNN official website in an array named url_array. I shall attempt to access each URL stored in that array and extract the text and the body of the articles to two different arrays using Beautiful Soup as shown below. So before you try this, make sure you have your own array of URLs and the necessary imports which is not covered here!

#the titles of the articles would be extracted and placed here
#Body of each article would be extracted and placed in this array
full_text_array= []

Extracting the heading of each article can be done as follows. The easiest possible way to find the title in the extracted text is by using “soup.title”.

###Extract and put all headlines to an arrayfor i in range(len(url_array)):
mainContent = requests.get(url_array[i])
soup = BeautifulSoup(mainContent.text,'lxml')
title = soup.title

That’s it. So your headline_array would contain something like this.

Unfortunately, extracting article data is not that easy. You need to have an idea of the HTML tags used to display the text in your article/webpage of choice. I have used articles from the official CNN website such as this , and if you inspect closely, you will observe that the text in the paragraphs are all displayed in the class “zn-body__paragraph” inside a “Div” tag. Hence the code is as below.

NOTE: You will need to do the necessary modifications to the code depending on your article/source.

##Scrape and put all content to an full text arrayfor i in range(len(url_array)):
mainContent = requests.get(url_array[i])
soup = BeautifulSoup(mainContent.text,'lxml')
results = soup.findAll("div", {"class" : "zn-body__paragraph"}, text=True)
for j in range (len(results)):
append_text= results[j].text.strip()
article= article+append_text

Here, the result obtained in the variable “results” contains HTML tags as shown below.

To remove all of them and clean the text, we need to use result.text.strip().

So after each section of the article is extracted from Beautiful Soup, it is stripped and cleaned. They are iteratively attached to the variable “article” to make up the full article.

Finally, it is appended to the full_text_array so that each element in the array contains an entire article as shown below.

And that’s it. Hope this helps :)

Computer Sci Undergrad at UCSC. ML enthusiast. Loves coffee :)