TIL how to indent within WordPress code snippet without plugin

After tons and tons of googling, and almost giving up, I finally found out how to do two things:

  1. Add code snippets / blocks to WordPress posts with syntax highlighting
  2. How to indent lines

All of this without installing any plugins!

To begin the code snippet / block, use the following code:

And to indent, add the ASCII code for <tab> which is :

The above code will be rendered as follows:

for x in y:
	print x
Advertisements

TIL to use replace() to eliminate commas in numbers

I learned something so simple that it boggles the mind why I didn’t get it the first time. Well, I guess it goes to show how much of a novice I still am. While doing some web scraping/crawling, I needed to grab a string of number+text shown on the webpage (the string shown in the red box in image below:

I had to grab that string because it indicated the total number of articles within that sub-directory.

I needed a way to tell my for loop to to iterate over n pages/articles, and I thought that number would make the job easier.

So I grab the string, which looks like ‘2,443건’. Now I needed to do two things:
1. Strip away the final text ‘건’, which is a measure word for counting articles, incidents, etc.
2. Remove the comma as Python cannot process the commas in numbers as we do

Being the tyro that I am, I didn’t know what would be the best way to do #1. Fortunately, I got some help from someone, who suggested that I try doing the following:

Which of course, worked!!! I had learned about [:-1] on strings to capture range of characters in string up to the final character – exclusive – but it was my first time to actually try using it in practice! But it served its purpose, and I was glad to learn something new of course.

Now on to no. #2. The above example is a three-digit number, so there are no commas, but in the real web crawling example that I did, the number returned was 2,443, which of course has a comma. I had to find a way to strip the comma and return just the number.

A quick google search led me to multiple Stack Overflow pages that addressed the very issue I had. The solution was surprisingly simple: just use the replace() string function to replace the comma with a null space.

So I went about coding it up, and the following is what I did first:

asdf = '2,443건'
asdf.replace(',','')
asdf

But when I ran this code, the result I got was still ‘2,443건’. The comma still remained!

I didn’t get what was going on, and I enlisted the help of someone much more skilled at coding than I am. He was quick to point out that the reason why I was still getting ‘2,443건’ with the comma and all was because the return value of a replace() function must be stored in a variable. So I did the following to rectify this and it worked:

asdf = '2,443건'
asdf = asdf.replace(',','')
asdf

Voilà! It’s all in the details…

Once I managed to do #2, I got the code to work and successfully scraped exactly that number of pages/articles from the sub-directory in question. That felt good.

But when I ran the code on a sub-directory with a lot more articles, the code kept running and ended up scraping nearly double the amount of the actual articles in the sub-directory. I suppose this means that somewhere along the line, the code is scraping the same content over and over again. Need to do some code review to see where things are going wrong.

Meanwhile, I’m going to be looking into recursive call to do this kind of web scraping on a broader level.

TIL that I’ve been living like a coward

I think too much. But I don’t know how it can be helped. I want to stop thinking sometimes. Just shut off my brain and just be in the moment. Really.

After all these years, I think I’ve just been living like a coward. I’ve always loved languages. That was always my thing. But I never envisioned that I could live a life doing languages. I would often think that my ideal kind of life would be if someone would pay me to just learn languages. But I don’t think I ever even tried to live out my dream life.

When you’re young, you can try different things and fail, but so what? You just get back up again. I remember Luca Lampariello saying that he was split between choosing a career in engineering and one in languages. After some deliberation, he ended up choosing languages. And look at what he does. I mean I don’t know how he feels about his life or if he wishes things were different, but I guess in my eyes, he’s living doing what he loves doing.

Right now, I’m working on learning NLP and data science, but in my free time, I just keep thinking about languages. Shouldn’t that mean something? Why don’t I just strike out on my own and try earning a living just from languages? All I do is dream about it, follow Instagram channels, blogs, YouTubers. Until when? Am I just going to let it ride til I’m old with more responsibilities than now??

Truth be told, I’m scared. I’m a coward. I’m scared to take on so much responsibility on my own to do something so….uncertain. How would I pay my bills? How would I save money? Won’t I make a fool of myself?

On top of that, I think there’s something in me that thinks that that kind of life or career is not a real career. I think I keep wanting to find something of a proper career, whatever that is supposed to be.

Or there’s other excuses like…I’m too old or it’s time to move onto other more important things. But seeing as how I am still into languages even after all these years, isn’t that something? At any rate, I should be realistic. Dreams don’t pay the bills. At least undeveloped ones don’t.

TIL differences between reading in files using Python

In passing, a colleague at work was looking at my code and explained how the three read file modules: read(), readline(), and readlines() are different.

But I was focusing intently on the code and didn’t quite catch what the differences were. So afterwards, I did some googling and figured it out. [This page](https://www.digitalocean.com/community/tutorials/how-to-handle-plain-text-files-in-python-3) lays it out nice and clear. I’ll also post this link under Resources –> NLP and Data Science.

In short, the following are the differences:
* f.read() –> returns entire contents of the file as a single string
f.readline() –> will read a file line-by-line, and returns one line
f.readlines() –> returns a list of lines from the file, where each item in the list is a line

From that same article, something to bear in mind:

Something to keep in mind when you are reading from files, once a file has been read using one of the read operations, it cannot be read again. For example, if you were to first run days_file.read() followed by days_file.readlines() the second operation would return an empty string. Therefore, anytime you wish to read from a file you will have to first open a new file variable.

TIL How to do web scraping / crawling

Grandma (on my mother’s side) passed away this Monday. She had been ill for some time…she was getting better, but her condition took a turn for the worse, and she passed away.

Went to 연천 yesterday to bury her ashes. May she rest in peace.

Today was my first day back at work after the funeral. I was gone for three days, but boy has a lot happened in that time. Over the weekend before the funeral, I was tasked with working on Korean measure words and how to extract them from our corpus data. I tried out some regex scripts, but never really got to review them together at work.

Today I got a new task: web scraping / crawling. I’m not sure if those two words are synonymous. At any rate, I was eager to learn something new (I’m always learning…does one ever stop?)

I was given some reference material and code to learn and work off of. It was some code for scraping data off of a search query from Daum. It did take me more than half of my day to dig in and figure out what was going on with the code.

I was glad to learn about Beautiful Soup, however. I had heard and read the (weird) sounding library/module, but never had a chance to check it out. I’m learning something new everyday, and more often than not, I feel overwhelmed. But I’m working to push through to keep learning and not get too discouraged. I keep reminding myself that it hasn’t been too long since I began coding in earnest, and that it takes time to get my skills up to a decent level. I’m only starting out.

That shouldn’t be my excuse though.

I also learned about the requests module, using which I could pull and make HTML requests. It was cool to use commands like the following to easily grab the source code of HTML pages:

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.bbc.com/news/world-us-canada-40816708')

data = r.text

soup = BeautifulSoup(data, 'lxml')

for link in soup.find_all('a'):
	print(link.attrs['href'])

I found myself looking through a ton of HTML code. First time in a long time. I first dabbled in HTML and CSS back in high school, when I learned a bit at school. It was cool using the Chrome Developer tools to see which parts of the HTML code corresponded to which section of the webpage.

I’ve still got some ways to go to be able to web scrape with confidence, but I’m glad to see that I’ve made at least some progress so far. I want to share some links that I’ve found helpful.

Web Scraping

This page is all in Korean, but I learned some things. He explains how the BS4 module get_text() works.
http://hurderella.tistory.com/108

Using the requests module

https://code.tutsplus.com/tutorials/using-the-requests-module-in-python–cms-28204

Multi-line print options in Python.

I saw the end='' option argument used inside the print function, but didn’t know exactly what it did. This really showed me how this works.

for i in range (3):
	print(i, end='')

The meaning of “main
The following if statement was in the code, and some googling helped me understand how this works:

if __name__ == "__main__":

https://stackoverflow.com/questions/419163/what-does-if-name-main-do
https://docs.python.org/3/library/main.html

Use of ‘global’ keyword for variables

https://stackoverflow.com/questions/4693120/use-of-global-keyword-in-python

urllib module

https://docs.python.org/3/library/urllib.html

Use of .format

I had no idea you could do something like this! It feels quite similar to using %s.

print ("First Module's Name: {}").format(__name__)
First Module's Name: __main__

https://pyformat.info/

TIL about .format

I was working with a Korean language corpus to extract what are called measure words.

The corpus had lines of text that consisted of product name, numbers, quantities, and measure words.

But the lines were all jumbled up, and I needed a way to add a \n at the end of each line, if only to be able to make the resulting text more human readable.

At first, I just ran a code block like the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i)

BTW I don’t know why the code block is showing the div tags…

But the result was not satisfactory. So I tried the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, \n)

The output of this code was nothing.

I realized what I had done, and added single quotes around \n like so:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, '\n')

But then I got an error like the below:

TypeError: write() takes exactly one argument (2 given)

After that, I concluded that I couldn’t add another argument to the method write(), so I thought I would try the join() method.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = '\n'.join(i)
        outf.write(final_sent)

But doing that returned results like the following:

a
b
c
d

1
adf
gadf

.
.
.

So I tried putting something else inside the quotes. Just to see what the output would be.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = ';'.join(i)
        outf.write(final_sent)
I got the following output:

a;b;c;d;e;f;g; 1;2;f;g;a;d;
So I decided to do some more digging and came across this page that talked about using the “`format()“` method!I made some changes to my code as below, and I finally got the results I wanted 🙂

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write('{}\n'.format(i))

Someone more advanced in Python skills could have just told me what to do from the start. But I guess there are things you learn while trying different things and getting different error messages…Here’s to making lots more errors!