I re-watched The Social Network yesterday. Dunno why, really.
Anyways, I noticed something about a wget meme at the start of the film, where he downloads some pics to have on his site. I figured I'd give it a go. Turns out it's a pretty good meme.
After reading some docs and fooling around for a while, I was looking for an actual thing to do with wget. And that's when it hit me: what if you could download up to the Nth xkcd comic?
And the game was on.
First, I was gonna do this in Python. It's the only hacky/easy language I'm fluent in and there's no way I would mess around in Java or, God forbid, C. So Python it is then.
First things first, how about just getting the goddamn memes with wget? Not so fast, buddy. Tried that. Got back a robots.txt file. Ughhh.
So for those of you who don't know what that means, basically there's often a system in place to, I guess, "block access" to certain "robots" - bots, programs, that kind of stuff. Apparently, wget falls under that category, and apparently, it also complies with the robots.txt rules, so you can't get past that. But actually, you can. You have the option to ignore them. So I just add "-e robots==off" and "--wait 1" to wget's arguments and we're set!
Ughhh. Now it doesn't download anything. What the hell?
Apparently, xkcd has set some pretty good memes, 'cause you can't just wget the whole thing. For reference, I tried just getting all the images (and overriding robots.txt of course) on another site and I did manage to get all the images (or most of them anyway).
BUT, there's a solution. It's really hacky and ugly, and I'm sure there's a better way, but here's how I did it:
I noticed that, while you didn't have access to the image from the normal xkcd page, the image url is always in the html file.
So what I did is I fetched the index.html, then ran through it to find the image url, then wget'd the image from that url. Pretty simple, right?
So this all ties together like this: you input a certain number of comics, the program runs through pages xkcd.com/1/ through xkcd.com/N/, downloads and parses the .htmls, adds the url to a list of urls, then once it's done it downloads all the images based on the urls and saves them in a folder.
The only problem I've had with the whole procedure is that some .htmls would download in code form (binary? hex? probably hex) and thus I couldn't read the image url off of them. Page 3 was giving me a lot of shit - I actually tried wget -v and it actually worked on 3 but didn't work on 2. By testing, I saw that it sometimes worked and sometimes didn't work.
So, finally, firstly because I'm just bored and secondly because it's almost 3 am and I'm really tired, I just worked around the problem by having the user input a maximum number of tries, so that wget will try to get the right page N times and if it doesn't succeed it'll just move on. I tried having it try forever, but some pages like #16 I think just won't download properly. Anyway, with something like 100 tries I found out you lose like 2-3 comics out of 30, so it's not that bad - AND they are enumerated so in the end you know which ones you lost.
Anyways, I'm off to bed now. But do try this. It's a fun exercise.
P.S.: You can download my script here (and you need to have wget and Python 2 installed, obviously).
So, anyway, this time I'm gonna introduce to you another kind of "variable": lists.
Lists are like normal variables, except that they can store more that one value simultaneously. They are close to what is known -in other programming languages, like C- as arrays. Anyway, here's a simple definition: if variables are boxes, then lists are bookcases; they can hold many things.
Here are some simple examples of lists:
mylist =  anotherlist = [1, 2, 3] yetanotherlist = [True, "ff", "Hi, 2!", 4378]
Now you can store as many things as you wish in one variable. Cool, huh?
Imagine if you were making a game, then you could easily have the player's inventory in a list, rather than making many different variables (which are also more memory-expensive).
But, before continuing, let's see some other examples of lists:
x = 4 mylist = [x, 5, 15] fancylist = [2, [3, 4], 5] # this is called a "2-dimensional list"...list-ception!
As you can see, you can put pretty much anything inside a list, including another list.
Hold on, though...this "list" thing is cool and all, but how can I access its elements? How can I add new ones, or delete the ones I don't need?
Well, that's quite simple! Let me show you:
mylist = [1, 2, 3] print mylist print mylist print mylist
And here's my output: (the $ thing is just my command line prompt, pay attention to what comes after it)
(also, from now on I will call the file I'm working on "test.py")
$ python test.py [1, 2, 3] 1 3
Whoah, just wait a second. What the hell happened at line 3? Why do I need to ask for the 0th element to get the first one?!
Well, as you saw, the syntax of getting a list element is "list[element_number]". The thing about lists, though, is that they are "zero-indexed". That means that we're naming the elements starting from the number 0 instead of 1. So, the 1st element is the 0th, the 2nd is the 1st, the 3rd is the 2nd and so on.
[Tip: To easily remember this, just subtract one from the position of list element you want to access. Example: element 3 - 1 = 2. Thus, the 3rd element is accessed by the command "mylist".]
You can also add or change elements just by assigning values to certain positions in the list:
menu = ["eggs", "tomatoes", "chicken"] print menu menu = "steak" print menu # see the difference? menu.append("potatoes") # let's add another element to the menu print menu # woohoo!
Here's what you should get:
$ python test.py ['eggs', 'tomatoes', 'chicken'] ['eggs', 'tomatoes', 'steak'] ['eggs', 'tomatoes', 'steak', 'potatoes']
So, there's some new stuff: apparently, to change the value of an element you just assign to it the value of something else (as you saw in line 3). Also, to add a new element to the list, you use a "list method". Methods are neat little functions (aka: bunch of code that is repeatedly used and does the same stuff) that can only be used on things like lists, strings etc. You can find all list methods and how to use them in the Python docs.
Anyway, all "append" does is adding another element to the list. What if you want to delete an element, though?
menu = ["eggs", "tomatoes", "chicken"] print menu menu.remove("eggs") # goodbye eggies print menu special_ingredient = menu.pop() print "What's left: %r" % menu # %r is for printing in raw format (sometimes it's useful to know what data type is the output) print "Special ingredient: %s" % special_ingredient
$ python test.py ['eggs', 'tomatoes', 'chicken'] ['tomatoes', 'chicken'] What's left: ['tomatoes'] Special ingredient: chicken
And that's how to delete elements from lists! Note that "pop()" not only removes the (last) element, but also returns it; that means we can store that element in a variable, as I did with "special_ingredient".
So, go ahead and try out some other methods! Get to know lists as much as you can!
Thank you for reading this tutorial! I'll see you next time, goodbye!