Due Tuesday, March 27, 2018 (before class)
For this lab, you will be mining tweets using Twitter's search API, choosing your own search terms, and analyzing the results. For most of this lab, instead of step-by-step instructions, you will have open-ended analytical questions to answer, using techniques learned in the Pandas Foundations course on DataCamp.
For this lab, you will have two weeks instead of the usual one week. As usual, you may work with a partner. But for this project, you must work with a partner with whom you have not collaborated yet in this course.
To get started, you need a Twitter developer account. To get setup, follow Steps 1–3 of Zach Whalen's How to Make a Twitter Bot instructions. Note that instead of using his Google Sheet and entering your credentials there, you'll be putting those credentials in your Python code (below).
Then you need to decide what kind of tweets to mine. Perform some searches on the Twitter website/app, looking at both the "top" and "latest" results. (The search API will return the "latest" results.) Then decide what you'd like to mine on Twitter. The search API will not return tweets more than 12 days old, so choose something that is interesting in the last week, or just generally.
Once you've hone in on a topic, choose multiple search terms for that topic, so you can make some comparisons between them. For example, "Duke" and "UNC"; "Trump" and "Putin"; etc. You'll want search terms that come up frequently enough to return interesting results, but not so often (like "Trump") that thousands of tweets only go back in time a few minutes/hours.
Then use the following code (substituting your search terms and including your API credentials) to collect tweets and save them to a file.
import tweepy import csv import sys # Enter each search term inside quotes, between 'OR'. # The whole search query should be inside a single string, which is the single item in a list. # (I know this is weird.) search_query = ['"Twitter" OR "@twitter" OR "#ilovehashtags"'] filename = 'tweets.csv' # Enter your app authentication credentials here. consumer_key = '' consumer_secret = '' access_key = '' access_secret = '' maxTweets = 10000 # Some arbitrary large number # You shouldn't have to change anything below this line. # auth & api handlers # using AppAuthHandler allows you to collect more tweets before it times out auth = tweepy.AppAuthHandler(consumer_key, consumer_secret) api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True) if (not api): print ("Can't Authenticate") sys.exit(-1) tweetsPerQry = 100 # this is the max the API permits # If results from a specific ID onwards are reqd, set since_id to that ID. # else default to no lower limit, go as far back as API allows sinceId = None # If results only below a specific ID are, set max_id to that ID. # else default to no upper limit, start from the most recent tweet matching the search query. max_id = -1 # create output file and add header with open(filename, 'w') as csvfile: writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) header = ['id_str','user_id_str','created_at','user_screen_name','user_statuses_count','user_followers_count','user_friends_count','text','entities_urls','retweeted_status_id'] writer.writerow(header) # function for adding data to csv file def write_csv(row_data, filename): with open(filename, 'a', encoding = 'utf-8') as csvfile: writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) writer.writerow(row_data) tweetCount = 0 print("Downloading max {0} tweets".format(maxTweets)) while tweetCount < maxTweets: try: if (max_id <= 0): if (not sinceId): new_tweets = api.search(q=search_query, count=tweetsPerQry) else: new_tweets = api.search(q=search_query, count=tweetsPerQry, since_id=sinceId) else: if (not sinceId): new_tweets = api.search(q=search_query, count=tweetsPerQry, max_id=str(max_id - 1)) else: new_tweets = api.search(q=search_query, count=tweetsPerQry, max_id=str(max_id - 1), since_id=sinceId) if not new_tweets: print("No more tweets found") break for status in new_tweets: if status.text[0:3] == 'RT ' and hasattr(status, 'retweeted_status'): try: retweeted_status_text = status.retweeted_status.extended_tweet['full_text'] except: retweeted_status_text = status.text retweeted_status_id = status.retweeted_status.id_str try: write_csv([status.id_str, status.user._json['id_str'], status.created_at, status.user._json['screen_name'], status.user._json['statuses_count'], status.user._json['followers_count'], status.user._json['friends_count'], retweeted_status_text, status.entities['urls'], retweeted_status_id], filename) except Exception as e: print(e) else: try: write_csv([status.id_str, status.user._json['id_str'], status.created_at, status.user._json['screen_name'], status.user._json['statuses_count'], status.user._json['followers_count'], status.user._json['friends_count'], status.extended_tweet['full_text'], status.entities['urls'], ''], filename) except: try: write_csv([status.id_str, status.user._json['id_str'], status.created_at, status.user._json['screen_name'], status.user._json['statuses_count'], status.user._json['followers_count'], status.user._json['friends_count'], status.text, status.entities['urls'], ''], filename) except Exception as e: print(e) tweetCount += len(new_tweets) print("Downloaded {0} tweets".format(tweetCount)) max_id = new_tweets[-1].id # time.sleep(0.25) except tweepy.TweepError as e: # Just exit if any error print("some error: " + str(e)) break print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, filename))
Now that you have tweets collected into a local CSV file, read them into a DataFrame, double-check the tweet content, and build a Jupyter notebook that includes both code and text (like in your midterm projects) that answer the following questions:
Submit to Canvas when finished. We'll also have some time to show off findings in class.
Have fun!