As we move into linear algebra, it'll be nice to access, live, large, actual datasets. We've got a fun example coming up involving eigenvalues of a matrix generated by scraping data off of the web, in fact. What could be better?
For the time being, let's ease into data wrangling by interfacing with Discourse to see how we're doing there. Discourse, of course, is first a website. Check out the following two pages:
The first is just a pretty standard webpage on discourse, though you might not have looked at it before. The second contains the same information, but formatted as JSON. That highly structed format allows us to manipulate it with a computer quite easily. For example, here is the list of categories currently on our Discourse.
import requests
response = requests.get("http://discourse.marksmath.org/categories.json")
if response.status_code == 200:
response_json = response.json()
print([category['name'] for category in response_json['category_list']['categories']])
else:
print("Uh-oh")
There's a couple of things going on here. First, the Requests Library (standard with Anaconda) is a great library for grabbing information off of the web. That's exactly what the requests.get()
command did. Second, the response.json()
command transformed the result into a Python dictionary, which JSON very closely resembles. You should probably refresh yourself on Python dictionaries, if you want to follow this well.
The corresponding category IDs, which are used in URLs, are:
[category['id'] for category in response_json['category_list']['categories']]
So, we can dig further now into the Numerical category (with id=6
) and look at the list of topics as follows:
response = requests.get("http://discourse.marksmath.org/c/6.json")
response_json = response.json()
topic_titles = [topic['title'] for topic in response_json['topic_list']['topics']]
topic_titles
Here they are sorted by popularity according to like_count
.
import numpy as np
like_counts = [topic['like_count'] for topic in response_json['topic_list']['topics']]
order = np.argsort(like_counts)
sorted_topic_titles = [topic_titles[i] for i in order]
sorted_topic_titles.reverse()
sorted_topic_titles
Great! But, can we do anything worthwhile with this? How about we take a look at where folks are when it comes to trust levels. Note that an api_key
is required for the request that follow, so not just anyone can do it. Altogether, my api_key_string
has the form:
api_key_string = "api_key=LONG_CRAZY_STRING&api_username=mark"
response = requests.get("http://discourse.marksmath.org/admin/users.json?" + api_key_string)
users_json = response.json()
len(users_json)
Not all of those users are numerical analyis students, though, so let's just grab our class:
numerical_users = []
user_base_url = "http://discourse.marksmath.org/admin/users/"
for user in users_json:
response = requests.get(user_base_url + user['username_lower'] + ".json?" + api_key_string)
if response.status_code == 200:
json = response.json()
user_groups = json['groups']
if 'Numerical' in [group['name'] for group in user_groups]:
numerical_users.append(response.json())
user_names = [user['username'] for user in numerical_users]
user_names.sort(key=lambda s: s.lower()) # Ignore case
user_names
There's quite a lot of information asssociated with each user - 55 keys in all (like the 'username'
key above. Here are the keys that are associated with activity and what the corresponding information looks like for the first user:
activity_keys = [
'badge_count','days_visited','like_count','like_given_count','post_count','posts_read_count',
'time_read','topic_count','topics_entered','trust_level'
]
activity_info = [dict([(key,user_dict[key]) for key in activity_keys]) for user_dict in numerical_users]
activity_info[0]
OK, let's plot a histogram of the trust levels.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
data = [info['trust_level'] for info in activity_info]
m = min(data)
M = max(data)
bins = np.ndarray.flatten(np.array([(i-0.45,i+0.45) for i in range(m,M+1)]))
hist_info = plt.hist(data, bins = bins)
ax = plt.gca()
ax.set_xlim(m-0.5,M+0.5)
ax.set_ylim(0,max(hist_info[0])+0.5)
ax.set_xticks([0,1,2,3]);
Since you get 10 points per user level, I got to assess Discourse grades by examining the (now supressed) output of the following.
[(user['username'], user['trust_level']) for user in numerical_users]
And, since you get 10 points for doing the Your favorite function question, I had to look at the output of the following, as well.
response = requests.get("http://discourse.marksmath.org/t/50.json")
participants = response.json()['details']['participants']
[participant['username'] for participant in participants]