Data Mining Twitter with Tweepy 08-22-2016, 04:55 AM
#1
In between genetic algorithms and neural networks, I'm starting to do more crazy shit at work I thought I'd never be capable of. One of these topics is data mining, which is way easier than I thought it would be. Here I'll just go through the basics of mining twitter.
Getting API access
Our first step is registering our application. Log into Twitter and head here to register and get your keys, secrets and tokens. After creating your app, head to the "keys and access tokens" tab to get what you need to authenticate your app's usage.
First, get the Consumer Key and Consumer Secret and store them in a text file, we'll structure key storage later.
![[Image: 20f46e4fca8d488fa250265917a0ef79.png]](http://image.prntscr.com/image/20f46e4fca8d488fa250265917a0ef79.png)
Second, click the "Create Access Token" button at the bottom of the page and store your token and secret with the consumer key and secret.
![[Image: ffcc65e050024dbe8aeae3976697c3e1.png]](http://image.prntscr.com/image/ffcc65e050024dbe8aeae3976697c3e1.png)
We'll store the keys in a YAML file like so, since its syntax is easy and simple.
![[Image: f59cf43c3978407cb58822ad2a0187b1.png]](http://image.prntscr.com/image/f59cf43c3978407cb58822ad2a0187b1.png)
Before anything else, let's fetch our keys from that YAML file. Copying the below script should do the trick.
This will read and parse the file into a dictionary, and we return its values as a list.
Onto the API. First, we need to install Tweepy. Python being convenient as ever, just run pip install tweepy. After tweepy finishes installing, we'll get started with authenticating and fetching tweets.
Authenticating with Tweepy
OAuth and caching your keys is a piece of cake with this module. Just do the following, making sure to include the above key-fetching code in the file
Getting Tweets (i)
After running the previous two snippets together (shown below), we can point the Tweepy Cursor to our account's timeline and get our 10 latest tweets like so.
(we could put this in a method, but it's not totally necessary at this point)
Quick note: we might run into problems with unicode, so you may want to add the following to sanitize anything you're going to print:
Et voila! Here's my output
Getting Tweets (ii)
One of my favorite things about this module is its capacity to keep the connection open as a stream and keep the tweets coming. Plus, we can track pretty much whatever we want.
The first step is to create our Stream class, so Tweepy knows what to do with incoming data.
And now we can run the stream!
And there you have it! I'll do another tutorial (hopefully before I head out of town) on storing your data then analyzing it with Vega and Vincent.
Getting API access
Our first step is registering our application. Log into Twitter and head here to register and get your keys, secrets and tokens. After creating your app, head to the "keys and access tokens" tab to get what you need to authenticate your app's usage.
First, get the Consumer Key and Consumer Secret and store them in a text file, we'll structure key storage later.
![[Image: 20f46e4fca8d488fa250265917a0ef79.png]](http://image.prntscr.com/image/20f46e4fca8d488fa250265917a0ef79.png)
Second, click the "Create Access Token" button at the bottom of the page and store your token and secret with the consumer key and secret.
![[Image: ffcc65e050024dbe8aeae3976697c3e1.png]](http://image.prntscr.com/image/ffcc65e050024dbe8aeae3976697c3e1.png)
We'll store the keys in a YAML file like so, since its syntax is easy and simple.
![[Image: f59cf43c3978407cb58822ad2a0187b1.png]](http://image.prntscr.com/image/f59cf43c3978407cb58822ad2a0187b1.png)
Before anything else, let's fetch our keys from that YAML file. Copying the below script should do the trick.
Code:
from yaml import safe_load
def get_keys():
_dict=safe_load(open('auth.yml','r').read())
return [_dict["consumer"]["key"],_dict["consumer"]["secret"],_dict["access"]["token"],_dict["access"]["secret"]]
Onto the API. First, we need to install Tweepy. Python being convenient as ever, just run pip install tweepy. After tweepy finishes installing, we'll get started with authenticating and fetching tweets.
Authenticating with Tweepy
OAuth and caching your keys is a piece of cake with this module. Just do the following, making sure to include the above key-fetching code in the file
Code:
# we'll be using the cursor later, so you don't
# need to import it yet
from tweepy import API,Cursor,OAuthHandler
# wrap setting with method for later adaptability
def set_auth(con_key,con_secret,acc_token,acc_secret):
# declare auth and api as global variables
# so they can be reached elsewhere
global auth,api
# create the authentication handler and
# set the access tokens
auth=OAuthHandler(con_key,con_secret)
auth.set_access_token(acc_token,acc_secret)
# initialize an API endpoint using the
# authentication object
api=API(auth)
Getting Tweets (i)
After running the previous two snippets together (shown below), we can point the Tweepy Cursor to our account's timeline and get our 10 latest tweets like so.
Code:
from tweepy import API,Cursor,OAuthHandler
from yaml import safe_load
# define methods here
...
# use a star operator on the return of get_keys() to use the array
# elements as different arguments when authenticating
set_auth(*get_keys())
# for the last 10 tweets in the authenticated account's timeline...
for tweet in Cursor(api.home_timeline).items(10):
# print out the author and content
print("@%s (%s):\n%s\n"%(tweet.user.screen_name,tweet.user.name,tweet.text))
Quick note: we might run into problems with unicode, so you may want to add the following to sanitize anything you're going to print:
Code:
# unicode sanitization
uni=lambda obj: str(obj).encode('iso-8859-1',errors='backslashreplace').decode('iso-8859-1')
# example
print(uni(unicode_string_variable))
Et voila! Here's my output
Spoiler:
Getting Tweets (ii)
One of my favorite things about this module is its capacity to keep the connection open as a stream and keep the tweets coming. Plus, we can track pretty much whatever we want.
The first step is to create our Stream class, so Tweepy knows what to do with incoming data.
Code:
from tweepy.streaming import StreamListener # base Stream class
from tweepy import API,OAuthHandler,Stream # we can get rid of the Cursor class, so don't import it
from json import loads
class MyStream(StreamListener):
def on_data(self,data):
# parse the data from JSON to a dictionary
tweet=loads(data)
# print out what we did before
print("@%s (%s):\n%s\n"%(tweet["user"]["screen_name"],uni(tweet["user"]["name"]),unt(tweet["text"])))
# return True to validate the data
return True
def on_error(self,status):
print("Error on_data: "+status)
return True
And now we can run the stream!
Code:
# start the stream with our credentials
stream=Stream(auth,MyStream())
# filter by @users, #hastags, or topics
stream.filter(track=['#python','Programming','@GolangBestLang'])
And there you have it! I'll do another tutorial (hopefully before I head out of town) on storing your data then analyzing it with Vega and Vincent.