Extracting external links from a Tweet

Hashini Samaraweera
2 min readApr 18, 2021

To extract data from any type of tweet, you need a Twitter Developer account. You have to apply for this status from your current Twitter account and the process takes a couple of days to a week. Twitter is very thorough!

After you obtain the Developer status for your Twitter account, make sure to obtain the following four values which can be obtained from the Keys and Access Tokens Section in your Twitter settings.

CONSUMER_KEY = 'consumer_key'
CONSUMER_SECRET = 'consumer_secret'
OAUTH_TOKEN = 'oauth_token'
OAUTH_TOKEN_SECRET = 'oauth_token_secret'

To set up, you need to define these values and import the necessary packages.

import twitter
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
twitter_api = twitter.Twitter(auth=auth)

NOTE: I’m going to use the official CNN Twitter account to extract data. Hence I define its Twitter handle as the userID below. Make sure to you set the Twitter handle of the account of your choice for this parameter.

userID = "cnnbrk"

Next, we will try to extract all the data from a tweet as given below.

tweet_details = []
def get_tweets(twitter_acc):
tweets = api.user_timeline(screen_name = twitter_acc,count=200, include_rts = False, tweet_mode = 'extended')


tweet_details.extend(tweets)
get_tweets(userID)

We define an array and append all the details of each tweet received as an element in the tweet_details array, which is called by the Twitter API.

To obtain the tweet you can simply do as follows.

But our objective is to obtain the external links/URLs embedded in the tweet, directing to an external source/article. Hence simply follow the steps below.

import urllib
url_array=[]
for tweet in tweet_details:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.full_text)


for url in urls:
try:
res = urllib.request.urlopen(url)
actual_url = res.geturl()
url_array.append(actual_url)

except:
print (url)

We are extracting the external links embedded in each tweet and appending each of them to a separate array called url_array.

That’s it, folks! Hope it helps!

--

--