TIL how to install packages using Anaconda Navigator

To do my web crawling, I started using Selenium,which is a Python module for doing web crawling.

I installed it from my command prompt by doing ‘pip install selenium’, and Selenium was working just fine in PyCharm and Python shell. But when I tried doing ‘import selenium’ in a Jupyter notebook, I kept getting a module not found error.

It turned out to be a Python path issue. In short, Selenium had already been installed, but Jupyter could not import Selenium because it wasn’t pointing to the path where Selenium had been installed.

The usual digging led me to this helpful thread, but in the end I couldn’t get Selenium to work on my Jupyter notebook by following the instructions provided.

Fortunately, an intern told me about Anaconda Navigator, a GUI-based application that could be used to install packages for the (virtual) environment running Jupyter notebook. As long as you have anaconda installed, you just had to run the below command to install Navigator:

So I tried searching for the Selenium package on Anaconda Navigator, but the search returned no results.



After doing some digging, I came across this site that had a piece of code I could run to get Selenium:

conda install -c conda-forge selenium

After running this code in my command prompt, I got selenium to work on Jupyter notebook!

Now I could use Selenium and Chrome Driver in a Jupyter notebook just fine.


TIL how to indent within WordPress code snippet without plugin

After tons and tons of googling, and almost giving up, I finally found out how to do two things:

  1. Add code snippets / blocks to WordPress posts with syntax highlighting
  2. How to indent lines

All of this without installing any plugins!

To begin the code snippet / block, use the following code:

And to indent, add the ASCII code for <tab> which is :

The above code will be rendered as follows:

for x in y:
	print x

TIL to use replace() to eliminate commas in numbers

I learned something so simple that it boggles the mind why I didn’t get it the first time. Well, I guess it goes to show how much of a novice I still am. While doing some web scraping/crawling, I needed to grab a string of number+text shown on the webpage (the string shown in the red box in image below:

I had to grab that string because it indicated the total number of articles within that sub-directory.

I needed a way to tell my for loop to to iterate over n pages/articles, and I thought that number would make the job easier.

So I grab the string, which looks like ‘2,443건’. Now I needed to do two things:
1. Strip away the final text ‘건’, which is a measure word for counting articles, incidents, etc.
2. Remove the comma as Python cannot process the commas in numbers as we do

Being the tyro that I am, I didn’t know what would be the best way to do #1. Fortunately, I got some help from someone, who suggested that I try doing the following:

Which of course, worked!!! I had learned about [:-1] on strings to capture range of characters in string up to the final character – exclusive – but it was my first time to actually try using it in practice! But it served its purpose, and I was glad to learn something new of course.

Now on to no. #2. The above example is a three-digit number, so there are no commas, but in the real web crawling example that I did, the number returned was 2,443, which of course has a comma. I had to find a way to strip the comma and return just the number.

A quick google search led me to multiple Stack Overflow pages that addressed the very issue I had. The solution was surprisingly simple: just use the replace() string function to replace the comma with a null space.

So I went about coding it up, and the following is what I did first:

asdf = '2,443건'

But when I ran this code, the result I got was still ‘2,443건’. The comma still remained!

I didn’t get what was going on, and I enlisted the help of someone much more skilled at coding than I am. He was quick to point out that the reason why I was still getting ‘2,443건’ with the comma and all was because the return value of a replace() function must be stored in a variable. So I did the following to rectify this and it worked:

asdf = '2,443건'
asdf = asdf.replace(',','')

Voilà! It’s all in the details…

Once I managed to do #2, I got the code to work and successfully scraped exactly that number of pages/articles from the sub-directory in question. That felt good.

But when I ran the code on a sub-directory with a lot more articles, the code kept running and ended up scraping nearly double the amount of the actual articles in the sub-directory. I suppose this means that somewhere along the line, the code is scraping the same content over and over again. Need to do some code review to see where things are going wrong.

Meanwhile, I’m going to be looking into recursive call to do this kind of web scraping on a broader level.

TIL differences between reading in files using Python

In passing, a colleague at work was looking at my code and explained how the three read file modules: read(), readline(), and readlines() are different.

But I was focusing intently on the code and didn’t quite catch what the differences were. So afterwards, I did some googling and figured it out. [This page](https://www.digitalocean.com/community/tutorials/how-to-handle-plain-text-files-in-python-3) lays it out nice and clear. I’ll also post this link under Resources –> NLP and Data Science.

In short, the following are the differences:
* f.read() –> returns entire contents of the file as a single string
f.readline() –> will read a file line-by-line, and returns one line
f.readlines() –> returns a list of lines from the file, where each item in the list is a line

From that same article, something to bear in mind:

Something to keep in mind when you are reading from files, once a file has been read using one of the read operations, it cannot be read again. For example, if you were to first run days_file.read() followed by days_file.readlines() the second operation would return an empty string. Therefore, anytime you wish to read from a file you will have to first open a new file variable.

TIL How to do web scraping / crawling

Grandma (on my mother’s side) passed away this Monday. She had been ill for some time…she was getting better, but her condition took a turn for the worse, and she passed away.

Went to 연천 yesterday to bury her ashes. May she rest in peace.

Today was my first day back at work after the funeral. I was gone for three days, but boy has a lot happened in that time. Over the weekend before the funeral, I was tasked with working on Korean measure words and how to extract them from our corpus data. I tried out some regex scripts, but never really got to review them together at work.

Today I got a new task: web scraping / crawling. I’m not sure if those two words are synonymous. At any rate, I was eager to learn something new (I’m always learning…does one ever stop?)

I was given some reference material and code to learn and work off of. It was some code for scraping data off of a search query from Daum. It did take me more than half of my day to dig in and figure out what was going on with the code.

I was glad to learn about Beautiful Soup, however. I had heard and read the (weird) sounding library/module, but never had a chance to check it out. I’m learning something new everyday, and more often than not, I feel overwhelmed. But I’m working to push through to keep learning and not get too discouraged. I keep reminding myself that it hasn’t been too long since I began coding in earnest, and that it takes time to get my skills up to a decent level. I’m only starting out.

That shouldn’t be my excuse though.

I also learned about the requests module, using which I could pull and make HTML requests. It was cool to use commands like the following to easily grab the source code of HTML pages:

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.bbc.com/news/world-us-canada-40816708')

data = r.text

soup = BeautifulSoup(data, 'lxml')

for link in soup.find_all('a'):

I found myself looking through a ton of HTML code. First time in a long time. I first dabbled in HTML and CSS back in high school, when I learned a bit at school. It was cool using the Chrome Developer tools to see which parts of the HTML code corresponded to which section of the webpage.

I’ve still got some ways to go to be able to web scrape with confidence, but I’m glad to see that I’ve made at least some progress so far. I want to share some links that I’ve found helpful.

Web Scraping

This page is all in Korean, but I learned some things. He explains how the BS4 module get_text() works.

Using the requests module


Multi-line print options in Python.

I saw the end='' option argument used inside the print function, but didn’t know exactly what it did. This really showed me how this works.

for i in range (3):
	print(i, end='')

The meaning of “main
The following if statement was in the code, and some googling helped me understand how this works:

if __name__ == "__main__":


Use of ‘global’ keyword for variables


urllib module


Use of .format

I had no idea you could do something like this! It feels quite similar to using %s.

print ("First Module's Name: {}").format(__name__)
First Module's Name: __main__


TIL about .format

I was working with a Korean language corpus to extract what are called measure words.

The corpus had lines of text that consisted of product name, numbers, quantities, and measure words.

But the lines were all jumbled up, and I needed a way to add a \n at the end of each line, if only to be able to make the resulting text more human readable.

At first, I just ran a code block like the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:

BTW I don’t know why the code block is showing the div tags…

But the result was not satisfactory. So I tried the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, \n)

The output of this code was nothing.

I realized what I had done, and added single quotes around \n like so:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, '\n')

But then I got an error like the below:

TypeError: write() takes exactly one argument (2 given)

After that, I concluded that I couldn’t add another argument to the method write(), so I thought I would try the join() method.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = '\n'.join(i)

But doing that returned results like the following:




So I tried putting something else inside the quotes. Just to see what the output would be.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = ';'.join(i)
I got the following output:

a;b;c;d;e;f;g; 1;2;f;g;a;d;
So I decided to do some more digging and came across this page that talked about using the “`format()“` method!I made some changes to my code as below, and I finally got the results I wanted 🙂

with open('outfile.txt', 'w') as outf:
    for i in captured:

Someone more advanced in Python skills could have just told me what to do from the start. But I guess there are things you learn while trying different things and getting different error messages…Here’s to making lots more errors!

TIL that I need to keep up

Since my coding skills are not yet good enough, I do need to keep up and improve. The things I can currently do are pretty limited in scope, and I certainly hope to improve a lot faster. I intended to stay at the office a little after work and practice and learn, but my brother called asking when I’m going to be home. Turns out we were eating 삼겹살 (samgyeopsal) tonight, and I sure as hell wasn’t going to miss that. Again.

After dinner, I worked on the measure word extractor some more. I essentially started from scratch. I was so surprised that I could focus a lot better, working on this at home. Need to concentrate more at work. I did some googling and found out how to list the directory properly using os.listdir.

I got to the part where I needed to run a for loop on the files in the directory list, read each file, and put the lines into a list, but it just wouldn’t work. Lots of googling later, I came across an online book called Python for Informatics, which looked like a useful book.
(EDIT: There seems to be an updated version of the book. Check it out here. It referenced Think Python, another excellent free resource. In fact, I’ve been thinking about picking up a book on Python to read. I of course know that merely reading a book on coding won’t make me improve all that much. But the point is that I can’t watch stuff like Udemy on the bus to work, and when I get home, I have other stuff to do. I wanted to get an overview of Python and one way to do that is to just read through a book or two. My intention isn’t to remember everything; rather, I just want to be exposed to what Python is in its entirety. Just a sweeping overview, so I get what I need to learn further.

Well, I guess today I learned a few things about the os module. I think working through that Python for Informatics will definitely be helpful in learning more.