TIL How to do web scraping / crawling

Grandma (on my mother’s side) passed away this Monday. She had been ill for some time…she was getting better, but her condition took a turn for the worse, and she passed away.

Went to 연천 yesterday to bury her ashes. May she rest in peace.

Today was my first day back at work after the funeral. I was gone for three days, but boy has a lot happened in that time. Over the weekend before the funeral, I was tasked with working on Korean measure words and how to extract them from our corpus data. I tried out some regex scripts, but never really got to review them together at work.

Today I got a new task: web scraping / crawling. I’m not sure if those two words are synonymous. At any rate, I was eager to learn something new (I’m always learning…does one ever stop?)

I was given some reference material and code to learn and work off of. It was some code for scraping data off of a search query from Daum. It did take me more than half of my day to dig in and figure out what was going on with the code.

I was glad to learn about Beautiful Soup, however. I had heard and read the (weird) sounding library/module, but never had a chance to check it out. I’m learning something new everyday, and more often than not, I feel overwhelmed. But I’m working to push through to keep learning and not get too discouraged. I keep reminding myself that it hasn’t been too long since I began coding in earnest, and that it takes time to get my skills up to a decent level. I’m only starting out.

That shouldn’t be my excuse though.

I also learned about the requests module, using which I could pull and make HTML requests. It was cool to use commands like the following to easily grab the source code of HTML pages:

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.bbc.com/news/world-us-canada-40816708')

data = r.text

soup = BeautifulSoup(data, 'lxml')

for link in soup.find_all('a'):
	print(link.attrs['href']

I found myself looking through a ton of HTML code. First time in a long time. I first dabbled in HTML and CSS back in high school, when I learned a bit at school. It was cool using the Chrome Developer tools to see which parts of the HTML code corresponded to which section of the webpage.

I’ve still got some ways to go to be able to web scrape with confidence, but I’m glad to see that I’ve made at least some progress so far. I want to share some links that I’ve found helpful.

Web Scraping

This page is all in Korean, but I learned some things. He explains how the BS4 module get_text() works.
http://hurderella.tistory.com/108

Using the requests module

https://code.tutsplus.com/tutorials/using-the-requests-module-in-python–cms-28204

Multi-line print options in Python.

I saw the end='' option argument used inside the print function, but didn’t know exactly what it did. This really showed me how this works.

for i in range (3):
	print(i, end=''")

The meaning of “main
The following if statement was in the code, and some googling helped me understand how this works:

if __name__ == "__main__":

https://stackoverflow.com/questions/419163/what-does-if-name-main-do
https://docs.python.org/3/library/main.html

Use of ‘global’ keyword for variables

https://stackoverflow.com/questions/4693120/use-of-global-keyword-in-python

urllib module

https://docs.python.org/3/library/urllib.html

Use of .format

I had no idea you could do something like this! It feels quite similar to using %s.

print ("First Module's Name: {}").format(__name__)
First Module's Name: __main__

https://pyformat.info/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s