TIL how to indent within WordPress code snippet without plugin

After tons and tons of googling, and almost giving up, I finally found out how to do two things:

  1. Add code snippets / blocks to WordPress posts with syntax highlighting
  2. How to indent lines

All of this without installing any plugins!

To begin the code snippet / block, use the following code:

And to indent, add the ASCII code for <tab> which is :

The above code will be rendered as follows:

for x in y:
	print x

TIL to use replace() to eliminate commas in numbers

I learned something so simple that it boggles the mind why I didn’t get it the first time. Well, I guess it goes to show how much of a novice I still am. While doing some web scraping/crawling, I needed to grab a string of number+text shown on the webpage (the string shown in the red box in image below:

I had to grab that string because it indicated the total number of articles within that sub-directory.

I needed a way to tell my for loop to to iterate over n pages/articles, and I thought that number would make the job easier.

So I grab the string, which looks like ‘2,443건’. Now I needed to do two things:
1. Strip away the final text ‘건’, which is a measure word for counting articles, incidents, etc.
2. Remove the comma as Python cannot process the commas in numbers as we do

Being the tyro that I am, I didn’t know what would be the best way to do #1. Fortunately, I got some help from someone, who suggested that I try doing the following:

Which of course, worked!!! I had learned about [:-1] on strings to capture range of characters in string up to the final character – exclusive – but it was my first time to actually try using it in practice! But it served its purpose, and I was glad to learn something new of course.

Now on to no. #2. The above example is a three-digit number, so there are no commas, but in the real web crawling example that I did, the number returned was 2,443, which of course has a comma. I had to find a way to strip the comma and return just the number.

A quick google search led me to multiple Stack Overflow pages that addressed the very issue I had. The solution was surprisingly simple: just use the replace() string function to replace the comma with a null space.

So I went about coding it up, and the following is what I did first:

asdf = '2,443건'
asdf.replace(',','')
asdf

But when I ran this code, the result I got was still ‘2,443건’. The comma still remained!

I didn’t get what was going on, and I enlisted the help of someone much more skilled at coding than I am. He was quick to point out that the reason why I was still getting ‘2,443건’ with the comma and all was because the return value of a replace() function must be stored in a variable. So I did the following to rectify this and it worked:

asdf = '2,443건'
asdf = asdf.replace(',','')
asdf

Voilà! It’s all in the details…

Once I managed to do #2, I got the code to work and successfully scraped exactly that number of pages/articles from the sub-directory in question. That felt good.

But when I ran the code on a sub-directory with a lot more articles, the code kept running and ended up scraping nearly double the amount of the actual articles in the sub-directory. I suppose this means that somewhere along the line, the code is scraping the same content over and over again. Need to do some code review to see where things are going wrong.

Meanwhile, I’m going to be looking into recursive call to do this kind of web scraping on a broader level.

TIL that I’ve been living like a coward

I think too much. But I don’t know how it can be helped. I want to stop thinking sometimes. Just shut off my brain and just be in the moment. Really.

After all these years, I think I’ve just been living like a coward. I’ve always loved languages. That was always my thing. But I never envisioned that I could live a life doing languages. I would often think that my ideal kind of life would be if someone would pay me to just learn languages. But I don’t think I ever even tried to live out my dream life.

When you’re young, you can try different things and fail, but so what? You just get back up again. I remember Luca Lampariello saying that he was split between choosing a career in engineering and one in languages. After some deliberation, he ended up choosing languages. And look at what he does. I mean I don’t know how he feels about his life or if he wishes things were different, but I guess in my eyes, he’s living doing what he loves doing.

Right now, I’m working on learning NLP and data science, but in my free time, I just keep thinking about languages. Shouldn’t that mean something? Why don’t I just strike out on my own and try earning a living just from languages? All I do is dream about it, follow Instagram channels, blogs, YouTubers. Until when? Am I just going to let it ride til I’m old with more responsibilities than now??

Truth be told, I’m scared. I’m a coward. I’m scared to take on so much responsibility on my own to do something so….uncertain. How would I pay my bills? How would I save money? Won’t I make a fool of myself?

On top of that, I think there’s something in me that thinks that that kind of life or career is not a real career. I think I keep wanting to find something of a proper career, whatever that is supposed to be.

Or there’s other excuses like…I’m too old or it’s time to move onto other more important things. But seeing as how I am still into languages even after all these years, isn’t that something? At any rate, I should be realistic. Dreams don’t pay the bills. At least undeveloped ones don’t.

TIL How to do web scraping / crawling

Grandma (on my mother’s side) passed away this Monday. She had been ill for some time…she was getting better, but her condition took a turn for the worse, and she passed away.

Went to 연천 yesterday to bury her ashes. May she rest in peace.

Today was my first day back at work after the funeral. I was gone for three days, but boy has a lot happened in that time. Over the weekend before the funeral, I was tasked with working on Korean measure words and how to extract them from our corpus data. I tried out some regex scripts, but never really got to review them together at work.

Today I got a new task: web scraping / crawling. I’m not sure if those two words are synonymous. At any rate, I was eager to learn something new (I’m always learning…does one ever stop?)

I was given some reference material and code to learn and work off of. It was some code for scraping data off of a search query from Daum. It did take me more than half of my day to dig in and figure out what was going on with the code.

I was glad to learn about Beautiful Soup, however. I had heard and read the (weird) sounding library/module, but never had a chance to check it out. I’m learning something new everyday, and more often than not, I feel overwhelmed. But I’m working to push through to keep learning and not get too discouraged. I keep reminding myself that it hasn’t been too long since I began coding in earnest, and that it takes time to get my skills up to a decent level. I’m only starting out.

That shouldn’t be my excuse though.

I also learned about the requests module, using which I could pull and make HTML requests. It was cool to use commands like the following to easily grab the source code of HTML pages:

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.bbc.com/news/world-us-canada-40816708')

data = r.text

soup = BeautifulSoup(data, 'lxml')

for link in soup.find_all('a'):
	print(link.attrs['href']

I found myself looking through a ton of HTML code. First time in a long time. I first dabbled in HTML and CSS back in high school, when I learned a bit at school. It was cool using the Chrome Developer tools to see which parts of the HTML code corresponded to which section of the webpage.

I’ve still got some ways to go to be able to web scrape with confidence, but I’m glad to see that I’ve made at least some progress so far. I want to share some links that I’ve found helpful.

Web Scraping

This page is all in Korean, but I learned some things. He explains how the BS4 module get_text() works.
http://hurderella.tistory.com/108

Using the requests module

https://code.tutsplus.com/tutorials/using-the-requests-module-in-python–cms-28204

Multi-line print options in Python.

I saw the end='' option argument used inside the print function, but didn’t know exactly what it did. This really showed me how this works.

for i in range (3):
	print(i, end=''")

The meaning of “main
The following if statement was in the code, and some googling helped me understand how this works:

if __name__ == "__main__":

https://stackoverflow.com/questions/419163/what-does-if-name-main-do
https://docs.python.org/3/library/main.html

Use of ‘global’ keyword for variables

https://stackoverflow.com/questions/4693120/use-of-global-keyword-in-python

urllib module

https://docs.python.org/3/library/urllib.html

Use of .format

I had no idea you could do something like this! It feels quite similar to using %s.

print ("First Module's Name: {}").format(__name__)
First Module's Name: __main__

https://pyformat.info/

TIL about .format

I was working with a Korean language corpus to extract what are called measure words.

The corpus had lines of text that consisted of product name, numbers, quantities, and measure words.

But the lines were all jumbled up, and I needed a way to add a \n at the end of each line, if only to be able to make the resulting text more human readable.

At first, I just ran a code block like the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i)

BTW I don’t know why the code block is showing the div tags…

But the result was not satisfactory. So I tried the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, \n)

The output of this code was nothing.

I realized what I had done, and added single quotes around \n like so:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, '\n')

But then I got an error like the below:

TypeError: write() takes exactly one argument (2 given)

After that, I concluded that I couldn’t add another argument to the method write(), so I thought I would try the join() method.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = '\n'.join(i)
        outf.write(final_sent)

But doing that returned results like the following:

a
b
c
d

1
adf
gadf

.
.
.

So I tried putting something else inside the quotes. Just to see what the output would be.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = ';'.join(i)
        outf.write(final_sent)
I got the following output:

a;b;c;d;e;f;g; 1;2;f;g;a;d;
So I decided to do some more digging and came across this page that talked about using the “`format()“` method!I made some changes to my code as below, and I finally got the results I wanted 🙂

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write('{}\n'.format(i))

Someone more advanced in Python skills could have just told me what to do from the start. But I guess there are things you learn while trying different things and getting different error messages…Here’s to making lots more errors!

TIL that I need to keep up

Since my coding skills are not yet good enough, I do need to keep up and improve. The things I can currently do are pretty limited in scope, and I certainly hope to improve a lot faster. I intended to stay at the office a little after work and practice and learn, but my brother called asking when I’m going to be home. Turns out we were eating 삼겹살 (samgyeopsal) tonight, and I sure as hell wasn’t going to miss that. Again.

After dinner, I worked on the measure word extractor some more. I essentially started from scratch. I was so surprised that I could focus a lot better, working on this at home. Need to concentrate more at work. I did some googling and found out how to list the directory properly using os.listdir.

I got to the part where I needed to run a for loop on the files in the directory list, read each file, and put the lines into a list, but it just wouldn’t work. Lots of googling later, I came across an online book called Python for Informatics, which looked like a useful book.
(EDIT: There seems to be an updated version of the book. Check it out here. It referenced Think Python, another excellent free resource. In fact, I’ve been thinking about picking up a book on Python to read. I of course know that merely reading a book on coding won’t make me improve all that much. But the point is that I can’t watch stuff like Udemy on the bus to work, and when I get home, I have other stuff to do. I wanted to get an overview of Python and one way to do that is to just read through a book or two. My intention isn’t to remember everything; rather, I just want to be exposed to what Python is in its entirety. Just a sweeping overview, so I get what I need to learn further.

Well, I guess today I learned a few things about the os module. I think working through that Python for Informatics will definitely be helpful in learning more.

TIL that I’ve just been wingin’ it for too long

This thought was just floating through my head, but I wanted to capture it and put it in writing, so here it is.

In university, I liked linguistics, and I did quite well in the subject. Did I do well because I liked the subject matter?

I felt like even though I didn’t study that much, I just got it. It just made sense to me.

In contrast, when I would come across some topic or concept in my other classes, say in my business classes, that I found difficult, I wouldn’t try as hard to get it.

So this got me thinking. Did I just do better in my linguistics classes because I could wing it and still get by with it?

And I feel it was the same with many other things. I could get by with stuff. But that streak won’t last. You might think, “Duh. What did you expect?” Well, I do need a reality check I think.

At any rate, that got me thinking. I must really like to just do what I want to do. Like, I can’t be forced to do things I don’t want to do. And I don’t like to be taken out of my comfort zone and be exposed to unfamiliar material, concepts, situations, etc.

But this is something I need to work on.

TIL how to use Vim

Today I discovered an awesome interactive Vim tutorial. You can check it out here.

It’s been two days since I started learning bash script. I had to start learning it because I joined the speech recognition team at work, and this team does a lot of work on black screens every day.

I’ve been trying to understand and interpret a chunk of bash code using various online resources. I certainly am making progress, but definitely not at the rate I would like to be.

At any rate, I began playing around with Vim, and also found out about Vimium for Chrome.

I installed the plugin at the office today, and have been playing around with it since. I thought it was so cool that you could control everything with just the keyboard. I’m still trying to memorize the keyboard shortcuts, but I’m enjoying the process.

TIL difference between supervised and unsupervised learning

TIL the difference between supervised learning and unsupervised learning. It’s about time. I mean, I work for in an NLP/machine learning team at a startup that uses machine learning to make things. I ought to know it by now. I guess it helped that I heard these terms thrown about left and right. We all learn at our own pace.

I came across the Wikipedia page for statistical classification, and this short paragraph just jumped out to me:

In the terminology of machine learning,[1] classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.

Now I understood why we had used terms like ‘clustering’ at work. When we grouped together sentences, it was an unsupervised procedure – clustering – based on cosine similarity. If we had used some sort of heuristics to label a subset of the data, then train the rest of the data, that would have been supervised learning.