As we move into more real world examples, it'll be nice to access, actual, live, large datasets. We've got a fun example coming up involving eigenvalues of a matrix generated by scraping data off of the web, in fact. For the time being, let's ease into data wrangling by interfacing with Discourse to see how we're doing there. First and foremost, Discourse is a website. It's a bit more than that, though, and you can interact with it quite fully via an API or Application Programming Interface. Check out the following two pages:
The first is just a pretty standard webpage on Discourse, though you might not have looked at it before. The second contains the same information, but formatted as JSON. That highly structured format allows us to manipulate it with a computer quite easily. For example, here is a list of the publicly viewable categories currently on our Discourse, along with their id
s.
import requests
response = requests.get("http://discourse.marksmath.org/categories.json")
response_json = response.json()
categories_with_ids=[(category['id'],category['name']) for category in response_json['category_list']['categories']]
categories_with_ids.sort()
categories_with_ids
There are a couple of things going on here. First, the Requests Library (standard with Anaconda) is a great library for grabbing information off of the web. That's exactly what the requests.get() command did. Second, the response.json() command transformed the result into a Python dictionary, which JSON very closely resembles. In fact, we should probably take a quick look at JSON and Python dictionaries.
JSON (Javascript Object Notation) is a file format that is commonly used to transmit data between servers and browsers. Here's a sample:
{
"first_name": "Mark",
"last_name": "McClure",
"family": {
"wife": "Adrienne",
"kids": ["Audrey"],
"Adrienne's kids": ["Adelaide", "Amelia"]
},
"favorite_letter": "A",
"favorite_number": null
}
I've got a file named example.json
with exactly that content on my hard drive. Let's load it into Python using the json
library - Python's standard tool for manipulating JSON:
import json
fp = open("example.json", "r")
json_in = json.load(fp)
fp.close()
json_in
This result is a Python dictionary. Note that the syntax is very similar to the JSON we starte with, but not identical. It uses Python syntax and objects, rather than Javascript. The null
in the orginal JSON, for example, appears as None
in the resulting Python dictionary.
A dictionary is kind of like an array, but the elements are accessed by key, rather than by position. For example
json_in['first_name']
A more general term for this type of data structure is associative array.
The inclusion of the category ids in our first data query allows us to form URLS to dig further into those categories. For example, here are all topics in the 'Quiz or test prep' category.
response = requests.get("http://discourse.marksmath.org/c/10.json")
response_json = response.json()
topic_titles = [topic['title'] for topic in response_json['topic_list']['topics']]
topic_titles
And here are all the topics in the 'Quiz or test prep' category that have been answered, together with their id and most recent poster.
topic_titles = [(topic['title'],topic['id'],topic['last_poster_username'])
for topic in response_json['topic_list']['topics'] if topic['posts_count'] > 1
]
topic_titles
Way to take all the good ones Cornelius! Well, at least someone named CestBeau had something else to say. Let's find out what!
response = requests.get("https://discourse.marksmath.org/t/41.json")
posts = response.json()['post_stream']['posts']
[post['cooked'] for post in posts if post['username'] == 'CestBeau']
You can get your hands on a lot more info with an API key. For example, here are all the usernames in the class:
# Users and their trust levels.
api_key_string = "api_key=LONG_CRAZY_STRING&api_username=mark"
response = requests.get("https://discourse.marksmath.org/admin/users.json?" + api_key_string)
users_json = response.json()
score_data = [[user['username']] for user in users_json if not user['username'] in
['mark', 'audrey', 'system', 'discobot']]
print(str(len(score_data)) + " students")
score_data
And here are those users that have received a 'like' from me on your personalized function.
# Check the assigned topics.
# Gotta check that I gave it a like as well.
topics_to_check = [
{'topic_number':19, 'points':5}, # Personalized function
]
for topic_info in topics_to_check:
topic_number = str(topic_info['topic_number'])
response = requests.get("https://discourse.marksmath.org/t/" + topic_number + ".json")
post_ids = response.json()['post_stream']['stream']
posts = response.json()['post_stream']['posts']
page_count = 1
while len(posts) < len(post_ids):
page_count = page_count + 1
response = requests.get("https://discourse.marksmath.org/t/" + topic_number + ".json?page="+str(page_count))
posts.extend(response.json()['post_stream']['posts'])
#print(posts)
participants = []
for post in posts:
id = post['id']
likers_url = "https://discourse.marksmath.org/post_action_users.json?post_action_type_id=2"
likers_url = likers_url + '&id=' + str(id)
likers = requests.get(likers_url)
liker_ids = [liker['id'] for liker in likers.json()['post_action_users']]
if 1 in liker_ids:
participant = post['username']
if not participant in participants:
participants.append(participant)
for entry in score_data:
if entry[0] in participants:
entry.append(topic_info['points'])
else:
entry.append(0)
score_data
Unfortunately, I certainly cannot share my API key with you!!