I learned something so simple that it boggles the mind why I didn’t get it the first time. Well, I guess it goes to show how much of a novice I still am. While doing some web scraping/crawling, I needed to grab a string of number+text shown on the webpage (the string shown in the red box in image below:
I needed a way to tell my for loop to to iterate over n pages/articles, and I thought that number would make the job easier.
So I grab the string, which looks like ‘2,443건’. Now I needed to do two things:
1. Strip away the final text ‘건’, which is a measure word for counting articles, incidents, etc.
2. Remove the comma as Python cannot process the commas in numbers as we do
Being the tyro that I am, I didn’t know what would be the best way to do #1. Fortunately, I got some help from someone, who suggested that I try doing the following:
Which of course, worked!!! I had learned about [:-1] on strings to capture range of characters in string up to the final character – exclusive – but it was my first time to actually try using it in practice! But it served its purpose, and I was glad to learn something new of course.
Now on to no. #2. The above example is a three-digit number, so there are no commas, but in the real web crawling example that I did, the number returned was 2,443, which of course has a comma. I had to find a way to strip the comma and return just the number.
A quick google search led me to multiple Stack Overflow pages that addressed the very issue I had. The solution was surprisingly simple: just use the
replace() string function to replace the comma with a null space.
So I went about coding it up, and the following is what I did first:
asdf = '2,443건' asdf.replace(',','') asdf
But when I ran this code, the result I got was still ‘2,443건’. The comma still remained!
I didn’t get what was going on, and I enlisted the help of someone much more skilled at coding than I am. He was quick to point out that the reason why I was still getting ‘2,443건’ with the comma and all was because the return value of a
replace() function must be stored in a variable. So I did the following to rectify this and it worked:
asdf = '2,443건' asdf = asdf.replace(',','') asdf
Voilà! It’s all in the details…
Once I managed to do #2, I got the code to work and successfully scraped exactly that number of pages/articles from the sub-directory in question. That felt good.
But when I ran the code on a sub-directory with a lot more articles, the code kept running and ended up scraping nearly double the amount of the actual articles in the sub-directory. I suppose this means that somewhere along the line, the code is scraping the same content over and over again. Need to do some code review to see where things are going wrong.
Meanwhile, I’m going to be looking into recursive call to do this kind of web scraping on a broader level.