Migrating copying blog from QQ Qzone to WordPress

Recently I made a website for Sophie (smoosophie.com) and scraped her blog from QQ’s QZone.
QQ has an API that can be used for QZone, but it is hard to develop for. Mainly the API is poorly documented and hard to use for a native English speaker. Although I could read Chinese (slowly), reading professional terms is something I have to work on.

Workflow screenshots

Anyhow, I decided to practice javascript and manually scrape it. Here are the steps I took:

Screen Shot 2016-04-03 at 18.43.17

This is what the QZone page looks like.

Screen Shot 2016-04-03 at 18.45.12

This is running the fetching script.

Screen Shot 2016-04-03 at 18.44.22

After running the fetching script, it all of the blog links as individual iframes onto the page. Because these loads are network requests, I manually wait until all of them are loaded. The javascript console will stop scrolling and printing things if they’ve all completely loaded.

Screen Shot 2016-04-03 at 18.45.31

Here are all the frames after being loaded. Then I run the scraper javascript, and it scrapes the iframe for their blog content and post them to the current screen.

Screen Shot 2016-04-03 at 18.45.28

I manually copy paste them into csv files.

Screen Shot 2016-04-03 at 18.46.21

This is the first line of a csv entry.

Screen Shot 2016-04-03 at 18.46.40

Then I wrote a ruby script to add these csv to a wordpress blog.

Screen Shot 2016-04-03 at 18.48.05

Here’s what running it looks like.

Screen Shot 2016-04-03 at 18.48.43

Source code

Run the fetcher.js on the blog page.

 

// fetcher.js
// Javascript to paste into chrome console to fetch the posts.

var ul = document.createElement("UL") // Create a <ul> node
ul.setAttribute("id", "myList")
document.body.appendChild(ul);
var s = "|^@^|"; // seperator
var iframesArray = [];
var messagesArray = [];
var posts = document.getElementById("tblog").contentWindow.document.getElementById("listArea").getElementsByTagName("li")
for (var i = 0; i < posts.length /**/; i++) {
 var postTitle = posts[i].getElementsByClassName("article")[0].textContent;
 var postLink = posts[i].getElementsByClassName("article")[0].getElementsByTagName("a")[0].href
 // http://user.qzone.qq.com/765591203/blog/1400775512
 var entryDate = posts[i].getElementsByClassName("list_op")[0].childNodes[0].textContent;
 var readCount = posts[i].getElementsByClassName("list_op")[0].childNodes[2].textContent;
 var message = i +s+ postTitle +s+ entryDate +s+ readCount +s+ postLink +s;
 messagesArray.push(message);
 var iframe = document.createElement('iframe')
 iframe.src = postLink;
 document.body.appendChild(iframe)
 iframesArray.push(iframe);
}

After you manually determine the time to wait, run this to scrape all the iframes and add to the current html. There you can copy paste into a csv file.

// scraper.js
// Execute this when everything seems to have laoded
for (var i = 0; i < iframesArray.length /**/; i++) {
 var contWin = iframesArray[i].contentWindow;
 var win = contWin.document;
 // Scraping
 var blogsection = win.getElementById("tblog").contentWindow.document;
 var postTexts = blogsection.getElementById('blogDetailDiv').children;
 var postString = "";

 // Stupid check because sometimes if there are no childrens.
 if (postTexts.length == 0) {
 postString = blogsection.getElementById('blogDetailDiv').innerHTML;
 }
 else if (postTexts[0].tagName == "BR") {
 var brString = blogsection.getElementById('blogDetailDiv').innerHTML;
 postString = brString; // .replace(/<br>/g, "\n")
 }
 else {
 for (var p = 0; p < postTexts.length; p++) {
 postString += "<p>" + postTexts[p].innerHTML + "\n&nbsp;</p>";
 }
 }

 postString = postString.trim();
 if (postString.length == 0) {
 postString = "ERROR parsing!";
 }

 var message = messagesArray[i] + postString;
 console.log(message)
 var node = document.createElement("LI") // Create a <li> node
 var textnode = document.createTextNode(message) // Create a text node
 node.appendChild(textnode) // Append the text to <li>
 document.getElementById("myList").appendChild(node) // Append <li> to <ul> with id="myList"
 // document.body.removeChild(iframe)
 // win.body.parentNode.removeChild(win.body)
}

Once the csv files are created, post it to the wordpress blog.

#poster.rb
require 'rubypress'
require "csv"
require 'sanitize'
wp = Rubypress::Client.new(:host => "smoosophie.com",
 :username => "sophie",
 :password => "1Smoosophie!",
 # :use_ssl => true,
 :retry_timeouts => true)
Dir["*.csv"].each do |name|
 puts "============= #{name} ============="
 csv = CSV.read(name, {:col_sep => "|^@^|", :quote_char => "" }) # whatahack 
 csv.each do |row|
 id, title, pubdate, readcount, link, text = row
 cleantext = Sanitize.clean(text, :elements => ['br','p', 'a'])
 puts("#{id}-#{title}")
 # puts("#id #{id}, title #{title}, pubdate #{pubdate}, readcount #{readcount}, link #{link}, text #{text}")
 begin
 retries ||= 0
 puts "try ##{ retries }"
 sleep(1)
 wp.newPost( :blog_id => "0", # 0 unless using WP Multi-Site, then use the blog id
 :content => {
 :post_status => "publish",
 :post_date => Time.parse(pubdate),
 :post_content => cleantext,
 :post_title => title,
 :post_author => 1, # 1 if there only is admin user, otherwise user's id
 :terms_names => {:category => ['QQZone'], :post_tag => ['QQZone'] }
 }
 )
 rescue
 retry if (retries += 1) < 3
 end
 end
end

 

Neural Style

Inspiration comes from https://github.com/jcjohnson/neural-style.
Because installing all the required toolchains on OS X 10.11.3 is a bit challenging, I here are my installation steps.

 cd workspace/
 git clone git@github.com:jcjohnson/neural-style.git
 git clone https://github.com/torch/distro.git ~/torch --recursive
 cd ~/torch; bash install-deps;

This will fail a few times if you have already installed them (but different versions). I needed fiddle around unlinking things.

 brew unlink qt
 brew linkapps qt
 brew link --overwrite wget
 bash install-deps;
 brew unlink cmake
 bash install-deps;
 brew unlink imagemagick
 brew unlink brew-cask
 bash install-deps;

Anyhow, make sure the install-deps script doesn’t error out, otherwise you’ll be missing dependencies.

 ./install.sh

This succeeds. It tells you to activate, but I’m using non-standard shell (fish shell), so I mess with the fish config.

 . /Users/jason/torch/install/bin/torch-activate
 th #checking this exists in path, and it doesn't
 luarocks install image
 source ~/.profile
 . ~/.profile
 vim ~/.bashrc
 subl /Users/jason/torch/install/bin/torch-activate
 subl ~/.config/fish/config.fish
 th #now it exists

The change into fish.config that was necessary (for my user, my paths) was:

# . /Users/jason/torch/install/bin/torch-activate
set LUA_PATH '/Users/jason/.luarocks/share/lua/5.1/?.lua;/Users/jason/.luarocks/share/lua/5.1/?/init.lua;/Users/jason/torch/install/share/lua/5.1/?.lua;/Users/jason/torch/install/share/lua/5.1/?/init.lua;./?.lua;/Users/jason/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua'
set LUA_CPATH '/Users/jason/.luarocks/lib/lua/5.1/?.so;/Users/jason/torch/install/lib/lua/5.1/?.so;./?.so;/usr/local/lib/lua/5.1/?.so;/usr/local/lib/lua/5.1/loadall.so'
set PATH /Users/jason/torch/install/bin $PATH
set LD_LIBRARY_PATH /Users/jason/torch/install/lib $LD_LIBRARY_PATH
set DYLD_LIBRARY_PATH /Users/jason/torch/install/lib $DYLD_LIBRARY_PATH
set LUA_CPATH '/Users/jason/torch/install/lib/?.dylib;'$LUA_CPATH

Convert the bash syntax to fish syntax by replacing “export” with “set” and “:” with ” “.
I have a Nvidia graphics card, so I download and install CUDA.

Cuda Preferences

Then we can continue installing dependencies.

 brew install protobuf
 luarocks install loadcaffe
 luarocks install torch
 luarocks install nn

I found that having Xcode 7 means the clang compiler is too new and not supported by cutorch and cunn. The error you would see is this:

nvcc fatal   : The version ('70002') of the host compiler ('Apple clang') is not supported

Sometimes the error messages are garbled. Concurrency build, I presume. I downloaded Xcode 6.4, and replaced my Xcode 7:

cd /Applications
sudo mv Xcode.app/ Xcode7.app
sudo mv Xcode\ 2.app/ Xcode.app # this is Xcode 6.4 when you install it
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer
clang -v 
#Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)
#Target: x86_64-apple-darwin15.3.0
#Thread model: posix

But now I get another fatal issue:

/usr/local/cuda/include/common_functions.h:65:10: fatal error: 'string.h' file not found
#include 

Seems like this is an issue people have, cutorch issue 241. Can be resolved by doing

xcode-select --install

This gets me a little further, now the issue seems related to torch.

make[2]: *** No rule to make target `/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Accelerate.framework', needed by `lib/THC/libTHC.dylib'.  Stop.
make[2]: *** Waiting for unfinished jobs....
[ 64%] Building C object lib/THC/CMakeFiles/THC.dir/THCGeneral.c.o
[ 69%] Building C object lib/THC/CMakeFiles/THC.dir/THCAllocator.c.o
[ 71%] Building C object lib/THC/CMakeFiles/THC.dir/THCStorage.c.o
[ 76%] Building C object lib/THC/CMakeFiles/THC.dir/THCTensorCopy.c.o
[ 76%] Building C object lib/THC/CMakeFiles/THC.dir/THCStorageCopy.c.o
[ 76%] Building C object lib/THC/CMakeFiles/THC.dir/THCTensor.c.o
/tmp/luarocks_cutorch-scm-1-3748/cutorch/lib/THC/THCGeneral.c:633:7: warning: absolute value function 'abs' given an
      argument of type 'long' but has parameter of type 'int' which may cause truncation of value [-Wabsolute-value]
  if (abs(state->heapDelta) < heapMaxDelta) { ^ /tmp/luarocks_cutorch-scm-1-3748/cutorch/lib/THC/THCGeneral.c:633:7: note: use function 'labs' instead if (abs(state->heapDelta) < heapMaxDelta) {
      ^~~
      labs
1 warning generated.
make[1]: *** [lib/THC/CMakeFiles/THC.dir/all] Error 2
make: *** [all] Error 2

Error: Build error: Failed building.

Since I just updated to OS X 10.11, I presume frameworks in 10.10 should be ok. So this hack should be ok as well, to make Accelerate.framework appear.

ln -s "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk" "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk"

Finally I try to install cutorch, with success.

luarocks install cutorch
luarocks install cunn #this installs fine as well, it didn't before

Everything should be good to go. But nothing works smooth.

jason@jmbp15-nvidia ~/w/neural-style (master)> 
th neural_style.lua -style_image IMG_2663.JPG -content_image IMG_2911.JPG 
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded models/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5715/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory
/Users/jason/torch/install/bin/luajit: /Users/jason/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-5715/cutorch/lib/THC/generic/THCStorage.cu:40
stack traceback:
	[C]: in function 'resize'
	/Users/jason/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
	/Users/jason/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
	/Users/jason/torch/install/share/lua/5.1/nn/Module.lua:123: in function 'type'
	/Users/jason/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
	/Users/jason/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
	/Users/jason/torch/install/share/lua/5.1/nn/Module.lua:123: in function 'cuda'
	neural_style.lua:76: in function 'main'
	neural_style.lua:500: in main chunk
	[C]: in function 'dofile'
	...ason/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x0109a0dd50

I guess I ran out of GPU memory? Seems to be an issue here https://github.com/jcjohnson/neural-style/issues/150. My pictures aren’t that small, I guess. I’ll just resize them.

jason@jmbp15-nvidia ~/w/neural-style (master)> ls -lh
total 5480
-rw-r--r--@ 1 jason  staff   2.4M 11 Mar 21:39 IMG_2663.JPG
-rw-r--r--@ 1 jason  staff   235K 11 Mar 21:38 IMG_2911.JPG
-rw-r--r--  1 jason  staff   9.1K 11 Mar 20:34 INSTALL.md
-rw-r--r--  1 jason  staff   1.1K 11 Mar 20:34 LICENSE
-rw-r--r--  1 jason  staff    16K 11 Mar 20:34 README.md
drwxr-xr-x  4 jason  staff   136B 11 Mar 20:34 examples
drwxr-xr-x  8 jason  staff   272B 12 Mar 06:24 models
-rw-r--r--  1 jason  staff    16K 11 Mar 20:34 neural_style.lua
jason@jmbp15-nvidia ~/w/neural-style (master)> sips IMG_2663.JPG -Z 680
/Users/jason/workspace/neural-style/IMG_2663.JPG
 [ (kCGColorSpaceDeviceRGB)] ( 0 0 0 1 )
  /Users/jason/workspace/neural-style/IMG_2663.JPG
jason@jmbp15-nvidia ~/w/neural-style (master)> sips IMG_2911.JPG  -Z 680
/Users/jason/workspace/neural-style/IMG_2911.JPG
 [ (kCGColorSpaceDeviceRGB)] ( 0 0 0 1 )
  /Users/jason/workspace/neural-style/IMG_2911.JPG
jason@jmbp15-nvidia ~/w/neural-style (master)> ls -lh
total 2224
-rw-r--r--  1 jason  staff   105K 12 Mar 06:27 IMG_2663.JPG
-rw-r--r--  1 jason  staff   64K 12 Mar 06:28 IMG_2911.JPG
-rw-r--r--  1 jason  staff   9.1K 11 Mar 20:34 INSTALL.md
-rw-r--r--  1 jason  staff   1.1K 11 Mar 20:34 LICENSE
-rw-r--r--  1 jason  staff    16K 11 Mar 20:34 README.md
drwxr-xr-x  4 jason  staff   136B 11 Mar 20:34 examples
drwxr-xr-x  8 jason  staff   272B 12 Mar 06:24 models
-rw-r--r--  1 jason  staff    16K 11 Mar 20:34 neural_style.lua
jason@jmbp15-nvidia ~/w/neural-style (master)> 

Well resizing didn’t work, still runs out of memory. Default `th` uses nn, so I will try using cudnn, but not working.

jason@jmbp15-nvidia ~/w/neural-style (master)> 
th neural_style.lua -style_image IMG_2663.JPG -content_image IMG_2911.JPG  -backend cudnn
nil	
/Users/jason/torch/install/bin/luajit: /Users/jason/torch/install/share/lua/5.1/trepl/init.lua:384: /Users/jason/torch/install/share/lua/5.1/trepl/init.lua:384: /Users/jason/torch/install/share/lua/5.1/cudnn/ffi.lua:1279: 'libcudnn (R4) not found in library path.
Please install CuDNN from https://developer.nvidia.com/cuDNN
Then make sure files named as libcudnn.so.4 or libcudnn.4.dylib are placed in your library load path (for example /usr/local/lib , or manually add a path to LD_LIBRARY_PATH)

stack traceback:
	[C]: in function 'error'
	/Users/jason/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
	neural_style.lua:64: in function 'main'
	neural_style.lua:500: in main chunk
	[C]: in function 'dofile'
	...ason/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x0106769d50

So I go to https://developer.nvidia.com/rdp/cudnn-download and download cudnn-7.0-osx-x64-v4.0-prod.tgz and follow their install guide:

PREREQUISITES

    CUDA 7.0 and a GPU of compute capability 3.0 or higher are required.

ALL PLATFORMS

    Extract the cuDNN archive to a directory of your choice, referred to below as .
    Then follow the platform-specific instructions as follows.

LINUX

    cd 
    export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH

    Add  to your build and link process by adding -I to your compile
    line and -L -lcudnn to your link line.

OS X

    cd 
    export DYLD_LIBRARY_PATH=`pwd`:$DYLD_LIBRARY_PATH

    Add  to your build and link process by adding -I to your compile
    line and -L -lcudnn to your link line.

WINDOWS

    Add  to the PATH environment variable.

    In your Visual Studio project properties, add  to the Include Directories 
    and Library Directories lists and add cudnn.lib to Linker->Input->Additional Dependencies.

Opening the tgz file gives me cuda folder. I just need to add this my path

jason@jmbp15-nvidia ~/workspace> cd cuda/
jason@jmbp15-nvidia ~/w/cuda> set DYLD_LIBRARY_PATH (pwd) $DYLD_LIBRARY_PATH
jason@jmbp15-nvidia ~/w/cuda> echo $DYLD_LIBRARY_PATH
/Users/jason/workspace/cuda /Users/jason/torch/install/lib
jason@jmbp15-nvidia ~/w/cuda [127]> tree 
.
├── cd
├── include
│   └── cudnn.h
└── lib
    ├── libcudnn.4.dylib
    ├── libcudnn.dylib -> libcudnn.4.dylib
    └── libcudnn_static.a

2 directories, 5 files
jason@jmbp15-nvidia ~/w/cuda> set LD_LIBRARY_PATH (pwd)/lib/ $LD_LIBRARY_PATH
jason@jmbp15-nvidia ~/w/cuda> echo $LD_LIBRARY_PATH
/Users/jason/workspace/cuda/lib/ /Users/jason/torch/install/lib

Couchsurfing

Recently I hosted two couchsurfers. It has been a very rewarding experience and I think opened me to meeting new people. Couchsurfing is a tool that solves the problem of not meeting enough interesting people. In our day to day grind, it is difficult to meet new friends.

This summarizes my initial experience with hosting couchsurfing guests.

I had two guests. Satoshi and Sophie. Both of these guests I selected for their entrepreneurial drive and could be considered seasoned travellers. I’ve had a couchsurfing account since 2012 and never done it before. I feared that it would be unsafe. The reason being if as a travelling couchsurfer, not being in your native country and on a time schedule, there is little that you can do if something goes wrong. Every legal due process would take extremely long. I decided to start as a host because I have the option to turn down guests and being in my familiar territory. I thought this would make a good transition into becoming a full fledged couchsurfer. Anyways, I selected these guests based on their profile. I don’t just host anyone, but rather only people whom I see have a high chance of being friends. This reduces the safety risk and is a more efficient use of time than hosting random strangers.

Satoshi comes from Tokyo and travels a lot. Currently taking time off from work and school, he has 6 more months left of his undergraduate in business. He’s been across a large number of situations such as being homeless. I admire his courage. He’s travelled good parts of the world. He’s also tried and failed to create a a startup. This tells me he’s a safe choice to host and there’ll be many commonalities we share. I’d love to make a friend in Tokyo to visit in the future. I learned a lot about Japanese culture and can’t wait to actually go experience it one day.

Sophie comes from Shanghai and is taking this trip as a vacation. She is normally working full time in Shanghai and is a headhunter. She has a wonderful cheerful personality that I could best describe as 小燕子 in 还珠格格. Both in looks and personality. I admire her bravery even more than Satoshi because she travels as a single girl, and at her height, she’s probably 1.56m. Here I am worrying about whether if it’d safe for me to couchsurf. Silly. I found her to be very driven and goal oriented personality. She has such a lively personality that would be of a great asset to any company. There are a lot of traits to be admired.

This is just the start of the couchsurfing experience for me. I’m sure there are many adventures that await me ahead and can’t wait to spend more time meeting interesting people around the world!

Projects I want to work on

Lately, Feb 16 2016, I have had a few projects in mind I want to dig deeper into. I just returned from a trip to China, for Chinese New Year.

  • Family Tree – Document the best to my knowledge who I am from the perspective of who my families were.
  • Violin – A friend of my Father gifted me a violin when I went on trip to China. I intend to start practicing this. Perhaps look into going to local Meetups.
  • Shipping Info – Elementary school friend of my Father runs a shipping company in China. He intends on doing IPO (Initial Public Offering). In order to make the company worth more, he desires to build a ecommerce website so it would be more competitive. I need to explore similar web platforms and provide some references.
  • Make Friends – Since moving to California full time, I have mostly spent time settling down and getting used to the flow. It is as if I was in first year university, going into an environment I’m not comfortable with. It is as if I was in Switzerland, too. Difference here is that there are relatively less networking events.

Computer Science Topics

Background

I’ve been keeping track a list of interesting articles to read. Right now they are scattered across Google Docs, Pocket, bookmarks, and pdf files I’ve downloaded. I’ll combine them all into this list, without any explicit ordering – although the topics on top should be more interesting. I’ll also explain why I have them on my list.

List of projects I want to build

Software Architecture

  • Different GUI architectures.
    • http://martinfowler.com/eaaDev/uiArchs.html
    • Unlike school assignments, anytime one builds a customer facing product, the user interface needs to be considered. Building the UI is no trivial work, so knowing good design patterns is important. This article provides some fast overview.

Machine Learning

  • Deep Learning for Computer Vision
    • https://getpocket.com/a/read/1094687772
    • Wouldn’t it be cool to feed your computer a video and have it learn about things that appears in it? You can train the computer to do this. GPU technology has advanced quite a bit since its invention during the 90s. It just so happens that neural networks are coming back into popularity. This article blogs about leveraging the two of the best platforms we have today: Matlab and Nvidia cuDNN (cuda Deep Neural Network).
  • Question Answering on a computer
    • http://googleresearch.blogspot.ca/2015/11/computer-respond-to-this-email.html
    • This article came around as I was working on a project with my professor for artificial intelligence. Our goal was to try and recommend previously given answers based on a similar sounding question.

How this list came to be

My usual conversations with friends borders somewhat about computer science topics. They are just interesting problems discuss. In particular, I have a few friends who I always strive to learn from, and I admire how they keep talking about these things. I must do my part and keep up.

Why I need this list

Computer science is a wide field, I’ve only opened the door and had a glimpse into the world. I have a bunch of interesting things I want to build, and I’ve been keeping them on the back burner for far too long. This list will motivate me by serving as a reminder of just how much of exploration is possible.