A different take on the future: 未來

Coming into the office today, this thought – or rather, this word – popped into my head.

The word future in Korean – 미래 – is merely the surface representation of the Hanja (Chinese characters) 未來 (みらい in Japanese; wèilái in Mandarin Chinese). And I started thinking about the meaning of the individual characters.

The first character 未 is used often as the first syllable in a two-syllable word in Korean to denote ‘not’ or ‘un-‘, effectively negating or diminishing the effect or meaning of the following syllable. I venture a guess to say that the character is used similarly in Japanese and Chinese, although I cannot say with absolute certainty as I am not well-read enough nor a native speaker in those languages.

Some examples of this character in use are as follows:
미흡 (未洽) – inadequate, insufficient
미숙 (未熟) – immature

And the second character of our word at the outset is 來, meaning ‘to come’. Now comes the kicker. In my mind, I got thinking about the semantic communion of these two characters, side by side. Perhaps a literal translation could be rendered as ‘un-come’ or ‘not-come’. A more convenient construal – and philosophical – could be ‘that which has not yet come’.

This is the more natural meaning that occurred in my mind as I was walking on a quiet alleyway to the office during a Thursday during Chuseok (Korean Thanksgiving) week.

And I kind of liked it because it has something of a philosophical meaning to it. Perhaps even a deterministic ring to it. The future, in the Chinese, is something that has not come yet. In time, it will come.

As soon as I got into the office, I looked up the etymology of the word ‘future’. Dictionary.com and the online Oxford English dictionaries traced the etymology of the now English word future back up to Latin fūtūrus, which apparently means ‘about to be’, coming from the future participle of the verb ‘to be’ – esse. The online Oxford English dictionary also noted that the future participle form of esse comes from the stem -fu, which ultimately came from a base meaning of ‘grow, become‘. So in that sense, the connotation of the word future in English can be construed as being ‘about to grow’ or ‘about to become’. Of course, these are all my own conjectures and interpretations, which are all subservient to any rectification or adjustment.

In all, it intrigues me to ponder in what different – yet similar – ways the Latin and Chinese captured into writing that concept of what comes next in the changing states of our lives. The Latin seems to say that with the passing of time, we grow or we become: we change with the times, and hopefully for the better. And the Chinese seems to say that the future is something that is not yet come, but inevitably will, so long as the time continues its relentless march. The Latin kind of has a sort of optimistic glow or outlook on what the future holds, while I feel that the Chinese gives off a sense of mystery and curious pondering of what that which will befall me going forward. I like this kind of thing. It feels to me like etymology with sensibilities. And etymology is after all, a history of a people’s outlook on life, so long as words can be said to capture the sense of what people thought and felt at the time.

Advertisements

TIL how to install packages using Anaconda Navigator

To do my web crawling, I started using Selenium,which is a Python module for doing web crawling.

I installed it from my command prompt by doing ‘pip install selenium’, and Selenium was working just fine in PyCharm and Python shell. But when I tried doing ‘import selenium’ in a Jupyter notebook, I kept getting a module not found error.

It turned out to be a Python path issue. In short, Selenium had already been installed, but Jupyter could not import Selenium because it wasn’t pointing to the path where Selenium had been installed.

The usual digging led me to this helpful thread, but in the end I couldn’t get Selenium to work on my Jupyter notebook by following the instructions provided.

Fortunately, an intern told me about Anaconda Navigator, a GUI-based application that could be used to install packages for the (virtual) environment running Jupyter notebook. As long as you have anaconda installed, you just had to run the below command to install Navigator:

So I tried searching for the Selenium package on Anaconda Navigator, but the search returned no results.

 

 

After doing some digging, I came across this site that had a piece of code I could run to get Selenium:

conda install -c conda-forge selenium

After running this code in my command prompt, I got selenium to work on Jupyter notebook!

Now I could use Selenium and Chrome Driver in a Jupyter notebook just fine.

TIL great explanation of named entity recognition (NER)

I came across a great explanation of named entity recognition (NER).

Since my startup is building a chatbot as well, I’ve heard the term NER being tossed around, and this link helped to clarify what exactly NER is.

Below is a definition of NER from the site:

NER is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

I figured that NER will be very important for enabling a chatbot to derive semantic information from a conversation.

This hierarchy below shows a possible categorization of entities:

The four general types of entities in the image are as follows:

  • numerals: numbers

  • patterns: strings that have predictable patterns and thus can be captured using regular
    expressions (e.g. email addresses, phone numbers, etc.)

  • temporal: time and date

  • textual: strings that have been pre-defined in a dictionary (e.g. names of cities, user-location, etc.)

Some important parameters required for entity detection:

  • message: the actual message from user

  • entity name: If a dictionary is used, this entity name is important.

  • structured value: value grabbed from structured text

  • fallback value: this is the value that is returned in the event that detection logic fails to grab any value from either structured value or message.

  • bot message: previous message from the bot or agent.

TIL how to indent within WordPress code snippet without plugin

After tons and tons of googling, and almost giving up, I finally found out how to do two things:

  1. Add code snippets / blocks to WordPress posts with syntax highlighting
  2. How to indent lines

All of this without installing any plugins!

To begin the code snippet / block, use the following code:

And to indent, add the ASCII code for <tab> which is :

The above code will be rendered as follows:

for x in y:
	print x

TIL to use replace() to eliminate commas in numbers

I learned something so simple that it boggles the mind why I didn’t get it the first time. Well, I guess it goes to show how much of a novice I still am. While doing some web scraping/crawling, I needed to grab a string of number+text shown on the webpage (the string shown in the red box in image below:

I had to grab that string because it indicated the total number of articles within that sub-directory.

I needed a way to tell my for loop to to iterate over n pages/articles, and I thought that number would make the job easier.

So I grab the string, which looks like ‘2,443건’. Now I needed to do two things:
1. Strip away the final text ‘건’, which is a measure word for counting articles, incidents, etc.
2. Remove the comma as Python cannot process the commas in numbers as we do

Being the tyro that I am, I didn’t know what would be the best way to do #1. Fortunately, I got some help from someone, who suggested that I try doing the following:

Which of course, worked!!! I had learned about [:-1] on strings to capture range of characters in string up to the final character – exclusive – but it was my first time to actually try using it in practice! But it served its purpose, and I was glad to learn something new of course.

Now on to no. #2. The above example is a three-digit number, so there are no commas, but in the real web crawling example that I did, the number returned was 2,443, which of course has a comma. I had to find a way to strip the comma and return just the number.

A quick google search led me to multiple Stack Overflow pages that addressed the very issue I had. The solution was surprisingly simple: just use the replace() string function to replace the comma with a null space.

So I went about coding it up, and the following is what I did first:

asdf = '2,443건'
asdf.replace(',','')
asdf

But when I ran this code, the result I got was still ‘2,443건’. The comma still remained!

I didn’t get what was going on, and I enlisted the help of someone much more skilled at coding than I am. He was quick to point out that the reason why I was still getting ‘2,443건’ with the comma and all was because the return value of a replace() function must be stored in a variable. So I did the following to rectify this and it worked:

asdf = '2,443건'
asdf = asdf.replace(',','')
asdf

Voilà! It’s all in the details…

Once I managed to do #2, I got the code to work and successfully scraped exactly that number of pages/articles from the sub-directory in question. That felt good.

But when I ran the code on a sub-directory with a lot more articles, the code kept running and ended up scraping nearly double the amount of the actual articles in the sub-directory. I suppose this means that somewhere along the line, the code is scraping the same content over and over again. Need to do some code review to see where things are going wrong.

Meanwhile, I’m going to be looking into recursive call to do this kind of web scraping on a broader level.

TIL that I’ve been living like a coward

I think too much. But I don’t know how it can be helped. I want to stop thinking sometimes. Just shut off my brain and just be in the moment. Really.

After all these years, I think I’ve just been living like a coward. I’ve always loved languages. That was always my thing. But I never envisioned that I could live a life doing languages. I would often think that my ideal kind of life would be if someone would pay me to just learn languages. But I don’t think I ever even tried to live out my dream life.

When you’re young, you can try different things and fail, but so what? You just get back up again. I remember Luca Lampariello saying that he was split between choosing a career in engineering and one in languages. After some deliberation, he ended up choosing languages. And look at what he does. I mean I don’t know how he feels about his life or if he wishes things were different, but I guess in my eyes, he’s living doing what he loves doing.

Right now, I’m working on learning NLP and data science, but in my free time, I just keep thinking about languages. Shouldn’t that mean something? Why don’t I just strike out on my own and try earning a living just from languages? All I do is dream about it, follow Instagram channels, blogs, YouTubers. Until when? Am I just going to let it ride til I’m old with more responsibilities than now??

Truth be told, I’m scared. I’m a coward. I’m scared to take on so much responsibility on my own to do something so….uncertain. How would I pay my bills? How would I save money? Won’t I make a fool of myself?

On top of that, I think there’s something in me that thinks that that kind of life or career is not a real career. I think I keep wanting to find something of a proper career, whatever that is supposed to be.

Or there’s other excuses like…I’m too old or it’s time to move onto other more important things. But seeing as how I am still into languages even after all these years, isn’t that something? At any rate, I should be realistic. Dreams don’t pay the bills. At least undeveloped ones don’t.

TIL differences between reading in files using Python

In passing, a colleague at work was looking at my code and explained how the three read file modules: read(), readline(), and readlines() are different.

But I was focusing intently on the code and didn’t quite catch what the differences were. So afterwards, I did some googling and figured it out. [This page](https://www.digitalocean.com/community/tutorials/how-to-handle-plain-text-files-in-python-3) lays it out nice and clear. I’ll also post this link under Resources –> NLP and Data Science.

In short, the following are the differences:
* f.read() –> returns entire contents of the file as a single string
f.readline() –> will read a file line-by-line, and returns one line
f.readlines() –> returns a list of lines from the file, where each item in the list is a line

From that same article, something to bear in mind:

Something to keep in mind when you are reading from files, once a file has been read using one of the read operations, it cannot be read again. For example, if you were to first run days_file.read() followed by days_file.readlines() the second operation would return an empty string. Therefore, anytime you wish to read from a file you will have to first open a new file variable.

TIL How to do web scraping / crawling

Grandma (on my mother’s side) passed away this Monday. She had been ill for some time…she was getting better, but her condition took a turn for the worse, and she passed away.

Went to 연천 yesterday to bury her ashes. May she rest in peace.

Today was my first day back at work after the funeral. I was gone for three days, but boy has a lot happened in that time. Over the weekend before the funeral, I was tasked with working on Korean measure words and how to extract them from our corpus data. I tried out some regex scripts, but never really got to review them together at work.

Today I got a new task: web scraping / crawling. I’m not sure if those two words are synonymous. At any rate, I was eager to learn something new (I’m always learning…does one ever stop?)

I was given some reference material and code to learn and work off of. It was some code for scraping data off of a search query from Daum. It did take me more than half of my day to dig in and figure out what was going on with the code.

I was glad to learn about Beautiful Soup, however. I had heard and read the (weird) sounding library/module, but never had a chance to check it out. I’m learning something new everyday, and more often than not, I feel overwhelmed. But I’m working to push through to keep learning and not get too discouraged. I keep reminding myself that it hasn’t been too long since I began coding in earnest, and that it takes time to get my skills up to a decent level. I’m only starting out.

That shouldn’t be my excuse though.

I also learned about the requests module, using which I could pull and make HTML requests. It was cool to use commands like the following to easily grab the source code of HTML pages:

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.bbc.com/news/world-us-canada-40816708')

data = r.text

soup = BeautifulSoup(data, 'lxml')

for link in soup.find_all('a'):
	print(link.attrs['href'])

I found myself looking through a ton of HTML code. First time in a long time. I first dabbled in HTML and CSS back in high school, when I learned a bit at school. It was cool using the Chrome Developer tools to see which parts of the HTML code corresponded to which section of the webpage.

I’ve still got some ways to go to be able to web scrape with confidence, but I’m glad to see that I’ve made at least some progress so far. I want to share some links that I’ve found helpful.

Web Scraping

This page is all in Korean, but I learned some things. He explains how the BS4 module get_text() works.
http://hurderella.tistory.com/108

Using the requests module

https://code.tutsplus.com/tutorials/using-the-requests-module-in-python–cms-28204

Multi-line print options in Python.

I saw the end='' option argument used inside the print function, but didn’t know exactly what it did. This really showed me how this works.

for i in range (3):
	print(i, end='')

The meaning of “main
The following if statement was in the code, and some googling helped me understand how this works:

if __name__ == "__main__":

https://stackoverflow.com/questions/419163/what-does-if-name-main-do
https://docs.python.org/3/library/main.html

Use of ‘global’ keyword for variables

https://stackoverflow.com/questions/4693120/use-of-global-keyword-in-python

urllib module

https://docs.python.org/3/library/urllib.html

Use of .format

I had no idea you could do something like this! It feels quite similar to using %s.

print ("First Module's Name: {}").format(__name__)
First Module's Name: __main__

https://pyformat.info/

TIL about .format

I was working with a Korean language corpus to extract what are called measure words.

The corpus had lines of text that consisted of product name, numbers, quantities, and measure words.

But the lines were all jumbled up, and I needed a way to add a \n at the end of each line, if only to be able to make the resulting text more human readable.

At first, I just ran a code block like the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i)

BTW I don’t know why the code block is showing the div tags…

But the result was not satisfactory. So I tried the following:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, \n)

The output of this code was nothing.

I realized what I had done, and added single quotes around \n like so:

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write(i, '\n')

But then I got an error like the below:

TypeError: write() takes exactly one argument (2 given)

After that, I concluded that I couldn’t add another argument to the method write(), so I thought I would try the join() method.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = '\n'.join(i)
        outf.write(final_sent)

But doing that returned results like the following:

a
b
c
d

1
adf
gadf

.
.
.

So I tried putting something else inside the quotes. Just to see what the output would be.

with open('outfile.txt', 'w') as outf:
    for i in captured:
        final_sent = ';'.join(i)
        outf.write(final_sent)
I got the following output:

a;b;c;d;e;f;g; 1;2;f;g;a;d;
So I decided to do some more digging and came across this page that talked about using the “`format()“` method!I made some changes to my code as below, and I finally got the results I wanted 🙂

with open('outfile.txt', 'w') as outf:
    for i in captured:
        outf.write('{}\n'.format(i))

Someone more advanced in Python skills could have just told me what to do from the start. But I guess there are things you learn while trying different things and getting different error messages…Here’s to making lots more errors!