Tweet-Mining Lab (Weeks 8–9)

Due Tuesday, March 27, 2018 (before class)

For this lab, you will be mining tweets using Twitter's search API, choosing your own search terms, and analyzing the results. For most of this lab, instead of step-by-step instructions, you will have open-ended analytical questions to answer, using techniques learned in the Pandas Foundations course on DataCamp.

For this lab, you will have two weeks instead of the usual one week. As usual, you may work with a partner. But for this project, you must work with a partner with whom you have not collaborated yet in this course.

To get started, you need a Twitter developer account. To get setup, follow Steps 1–3 of Zach Whalen's How to Make a Twitter Bot instructions. Note that instead of using his Google Sheet and entering your credentials there, you'll be putting those credentials in your Python code (below).

Then you need to decide what kind of tweets to mine. Perform some searches on the Twitter website/app, looking at both the "top" and "latest" results. (The search API will return the "latest" results.) Then decide what you'd like to mine on Twitter. The search API will not return tweets more than 12 days old, so choose something that is interesting in the last week, or just generally.

Once you've hone in on a topic, choose multiple search terms for that topic, so you can make some comparisons between them. For example, "Duke" and "UNC"; "Trump" and "Putin"; etc. You'll want search terms that come up frequently enough to return interesting results, but not so often (like "Trump") that thousands of tweets only go back in time a few minutes/hours.

Then use the following code (substituting your search terms and including your API credentials) to collect tweets and save them to a file.

import tweepy
import csv
import sys

# Enter each search term inside quotes, between 'OR'.
# The whole search query should be inside a single string, which is the single item in a list.
# (I know this is weird.)
search_query = ['"Twitter" OR "@twitter" OR "#ilovehashtags"']
filename = 'tweets.csv'

# Enter your app authentication credentials here.
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''

maxTweets = 10000 # Some arbitrary large number



# You shouldn't have to change anything below this line.

# auth & api handlers
# using AppAuthHandler allows you to collect more tweets before it times out
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True,
				   wait_on_rate_limit_notify=True)

if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)

tweetsPerQry = 100  # this is the max the API permits

# If results from a specific ID onwards are reqd, set since_id to that ID.
# else default to no lower limit, go as far back as API allows
sinceId = None

# If results only below a specific ID are, set max_id to that ID.
# else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -1

# create output file and add header
with open(filename, 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    header = ['id_str','user_id_str','created_at','user_screen_name','user_statuses_count','user_followers_count','user_friends_count','text','entities_urls','retweeted_status_id']
    writer.writerow(header)

# function for adding data to csv file
def write_csv(row_data, filename):
    with open(filename, 'a', encoding = 'utf-8') as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writerow(row_data)


tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
while tweetCount < maxTweets:
    try:
        if (max_id <= 0):
            if (not sinceId):
                new_tweets = api.search(q=search_query, count=tweetsPerQry)
            else:
                new_tweets = api.search(q=search_query, count=tweetsPerQry,
                                        since_id=sinceId)
        else:
            if (not sinceId):
                new_tweets = api.search(q=search_query, count=tweetsPerQry,
                                        max_id=str(max_id - 1))
            else:
                new_tweets = api.search(q=search_query, count=tweetsPerQry,
                                        max_id=str(max_id - 1),
                                        since_id=sinceId)
        if not new_tweets:
            print("No more tweets found")
            break
        for status in new_tweets:
            if status.text[0:3] == 'RT ' and hasattr(status, 'retweeted_status'):
                try:
                    retweeted_status_text = status.retweeted_status.extended_tweet['full_text']
                except:
                    retweeted_status_text = status.text
                    retweeted_status_id = status.retweeted_status.id_str
                try:
                    write_csv([status.id_str, status.user._json['id_str'], status.created_at, status.user._json['screen_name'], status.user._json['statuses_count'], status.user._json['followers_count'],  status.user._json['friends_count'], retweeted_status_text, status.entities['urls'], retweeted_status_id], filename)
                except Exception as e:
                    print(e)
            else:
	            try:
	                write_csv([status.id_str, status.user._json['id_str'], status.created_at, status.user._json['screen_name'], status.user._json['statuses_count'], status.user._json['followers_count'],  status.user._json['friends_count'], status.extended_tweet['full_text'], status.entities['urls'], ''], filename)
	            except:
	                try:
		                write_csv([status.id_str, status.user._json['id_str'], status.created_at, status.user._json['screen_name'], status.user._json['statuses_count'], status.user._json['followers_count'],  status.user._json['friends_count'], status.text, status.entities['urls'], ''], filename)
	                except Exception as e:
	                    print(e)
        tweetCount += len(new_tweets)
        print("Downloaded {0} tweets".format(tweetCount))
        max_id = new_tweets[-1].id
#        time.sleep(0.25)
    except tweepy.TweepError as e:
        # Just exit if any error
        print("some error: " + str(e))
        break

print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, filename))

Now that you have tweets collected into a local CSV file, read them into a DataFrame, double-check the tweet content, and build a Jupyter notebook that includes both code and text (like in your midterm projects) that answer the following questions:

Who tweeted the most in your archive? (number of tweets in your archive)
Who in the archive has tweeted the most all-time? (total tweets, not just number of tweets in the archive)
Who in the archive has the most followers?
Who in the archive has the greatest ratio of following ('friends') to followers? (This is a good indication of a bot.)
Which tweet was retweeted the most?
How many tweets were published each day? each hour? every 15 minutes? (You'll have to use pd.to_datetime() on the tweet's timestamp, and then use .set_index('created_at') and .resample() for this.) Answer this one with both a table _and_ a plot (or series of plots).
Using subplots or multiple-line plots, compare and contrast tweet frequency for each of your search terms (e.g., "Trump" tweets vs. "Putin" tweets). Use whatever time scale makes the most sense for your archive (daily, hourly, etc.). (Since Twitter searches are case-insensitive, but Python is case-sensitive, use .tolower() on your tweet text to make it easier to filter out tweets containing each search term.)
Which word was used most?
Which account was mentioned most?
BONUS (if you have time): which URL was linked to most?
BONUS (if you have time): which domain was linked to most?

Submit to Canvas when finished. We'll also have some time to show off findings in class.

Have fun!