The Discourse API

As we move into more real world examples, it'll be nice to access, actual, live, large datasets. We've got a fun example coming up involving eigenvalues of a matrix generated by scraping data off of the web, in fact. For the time being, let's ease into data wrangling by interfacing with Discourse to see how we're doing there. First and foremost, Discourse is a website. It's a bit more than that, though, and you can interact with it quite fully via an API or Application Programming Interface. Check out the following two pages:

The first is just a pretty standard webpage on Discourse, though you might not have looked at it before. The second contains the same information, but formatted as JSON. That highly structured format allows us to manipulate it with a computer quite easily. For example, here is a list of the publicly viewable categories currently on our Discourse, along with their ids.

In [1]:
import requests
response = requests.get("http://discourse.marksmath.org/categories.json")
response_json = response.json()
categories_with_ids=[(category['id'],category['name']) for category in response_json['category_list']['categories']]
categories_with_ids.sort()
categories_with_ids
Out[1]:
[(1, 'Uncategorized'),
 (3, 'Site Feedback'),
 (5, 'Meta'),
 (6, 'Questions'),
 (7, 'Problems'),
 (8, 'Assignments'),
 (9, 'Ask and Answer'),
 (10, 'Quiz or test prep')]

There are a couple of things going on here. First, the Requests Library (standard with Anaconda) is a great library for grabbing information off of the web. That's exactly what the requests.get() command did. Second, the response.json() command transformed the result into a Python dictionary, which JSON very closely resembles. In fact, we should probably take a quick look at JSON and Python dictionaries.

JSON and Dictionaries

JSON (Javascript Object Notation) is a file format that is commonly used to transmit data between servers and browsers. Here's a sample:

{
  "first_name": "Mark",
  "last_name": "McClure",
  "family": {
    "wife": "Adrienne",
    "kids": ["Audrey"],
    "Adrienne's kids": ["Adelaide", "Amelia"]
  },
  "favorite_letter": "A",
  "favorite_number": null
}

I've got a file named example.json with exactly that content on my hard drive. Let's load it into Python using the json library - Python's standard tool for manipulating JSON:

In [2]:
import json
fp = open("example.json", "r")
json_in = json.load(fp)
fp.close()
json_in
Out[2]:
{'family': {"Adrienne's kids": ['Adelaide', 'Amelia'],
  'kids': ['Audrey'],
  'wife': 'Adrienne'},
 'favorite_letter': 'A',
 'favorite_number': None,
 'first_name': 'Mark',
 'last_name': 'McClure'}

This result is a Python dictionary. Note that the syntax is very similar to the JSON we starte with, but not identical. It uses Python syntax and objects, rather than Javascript. The null in the orginal JSON, for example, appears as None in the resulting Python dictionary.

A dictionary is kind of like an array, but the elements are accessed by key, rather than by position. For example

In [3]:
json_in['first_name']
Out[3]:
'Mark'

A more general term for this type of data structure is associative array.

Digging a little further

The inclusion of the category ids in our first data query allows us to form URLS to dig further into those categories. For example, here are all topics in the 'Quiz or test prep' category.

In [4]:
response = requests.get("http://discourse.marksmath.org/c/10.json")
response_json = response.json()
topic_titles = [topic['title'] for topic in response_json['topic_list']['topics']]
topic_titles
Out[4]:
['About the Quiz or test prep category',
 'An interpolating polynomial',
 'A binary exansion',
 'Newton steps',
 'Bisection steps',
 'A floating point representation',
 'The IVT',
 'A Taylor approximation']

And here are all the topics in the 'Quiz or test prep' category that have been answered, together with their id and most recent poster.

In [5]:
topic_titles = [(topic['title'],topic['id'],topic['last_poster_username']) 
    for topic in response_json['topic_list']['topics'] if topic['posts_count'] > 1
]
topic_titles
Out[5]:
[('An interpolating polynomial', 46, 'mark'),
 ('A binary exansion', 41, 'CestBeau'),
 ('Newton steps', 43, 'Cornelius'),
 ('Bisection steps', 44, 'Cornelius'),
 ('A floating point representation', 42, 'Cornelius')]

Way to take all the good ones Cornelius! Well, at least someone named CestBeau had something else to say. Let's find out what!

In [6]:
response = requests.get("https://discourse.marksmath.org/t/41.json")
posts = response.json()['post_stream']['posts']
[post['cooked'] for post in posts if post['username'] == 'CestBeau']
Out[6]:
['<p>Way to take all the good ones cornelius</p>']

API keys

You can get your hands on a lot more info with an API key. For example, here are all the usernames in the class:

In [7]:
# Users and their trust levels.
api_key_string = "api_key=LONG_CRAZY_STRING&api_username=mark"
response = requests.get("https://discourse.marksmath.org/admin/users.json?" + api_key_string)
users_json = response.json()

score_data = [[user['username']] for user in users_json if not user['username'] in 
              ['mark', 'audrey', 'system', 'discobot']]
print(str(len(score_data)) + " students")
score_data
22 students
Out[7]:
[['funmanbobyjo'],
 ['CeasarsRevenge'],
 ['bitchin_camero'],
 ['MatheMagician'],
 ['Spin'],
 ['Lorentz'],
 ['opernie'],
 ['Sampson'],
 ['Cornelius'],
 ['Aisling'],
 ['poster'],
 ['jorho85'],
 ['theoldernoah'],
 ['kp53'],
 ['pbahls'],
 ['dumptruckman'],
 ['nathan'],
 ['anonymous_user'],
 ['Bara223'],
 ['dakota'],
 ['brian'],
 ['CestBeau']]

And here are those users that have received a 'like' from me on your personalized function.

In [8]:
# Check the assigned topics.
# Gotta check that I gave it a like as well.
topics_to_check = [
    {'topic_number':19, 'points':5}, # Personalized function
]
for topic_info in topics_to_check:
    topic_number = str(topic_info['topic_number'])
    response = requests.get("https://discourse.marksmath.org/t/" + topic_number + ".json")
    post_ids = response.json()['post_stream']['stream']

    posts = response.json()['post_stream']['posts']
    page_count = 1
    while len(posts) < len(post_ids):
        page_count = page_count + 1
        response = requests.get("https://discourse.marksmath.org/t/" + topic_number + ".json?page="+str(page_count))
        posts.extend(response.json()['post_stream']['posts'])
    
    #print(posts)
    participants = []
    for post in posts:
        id = post['id']
        likers_url = "https://discourse.marksmath.org/post_action_users.json?post_action_type_id=2"
        likers_url = likers_url + '&id=' + str(id)
        likers = requests.get(likers_url)
        liker_ids = [liker['id'] for liker in likers.json()['post_action_users']]
        if 1 in liker_ids:
            participant = post['username']
            if not participant in participants:
                participants.append(participant)
    for entry in score_data:
        if entry[0] in participants:
            entry.append(topic_info['points'])
        else:
            entry.append(0)
score_data
Out[8]:
[['funmanbobyjo', 5],
 ['CeasarsRevenge', 5],
 ['bitchin_camero', 5],
 ['MatheMagician', 5],
 ['Spin', 5],
 ['Lorentz', 0],
 ['opernie', 0],
 ['Sampson', 5],
 ['Cornelius', 5],
 ['Aisling', 5],
 ['poster', 0],
 ['jorho85', 5],
 ['theoldernoah', 5],
 ['kp53', 0],
 ['pbahls', 5],
 ['dumptruckman', 5],
 ['nathan', 5],
 ['anonymous_user', 5],
 ['Bara223', 5],
 ['dakota', 5],
 ['brian', 5],
 ['CestBeau', 5]]

Unfortunately, I certainly cannot share my API key with you!!