Transit Wordpress content to Hugo

2021-04-26 1289 words 7 minutes

Contents

I transit some of my past Wordpress articles to lesson pages in this site by Hugo.

What I did are:

Setup ExitWP tool
Export wordpress data
Data preprocessing
Convert wordpress files to markdown
Locate image files to hugo content
Convert hugo meta descriptions in articles

1. Setup ExitWP tool

Gohugo page sugests some options to migrate wordpress to hugo. I firstly tried wordpress-to-hugo-exporter, but it looks I have to do many on AWS bitnami wordpress environment. Instead, ExitWP tool worked fine to my environment.

I setup ExitWP tool basically following Getting Started instruction on ExitWP.

1.1 Clone ExitWP

1
2


$ cd ~/repo
$ git clone https://github.com/wooni005/exitwp-for-hugo.git

1.2. Create python2 virtualenv

As this tool only supports python2 and my client PC has too many dependencies on python2/3 virtual environment, I newly created the new virtual environment.

1
2


$ sudo pip2 install virtualenv
$ python2 -m virtualenv ~/env_py27

1.3. Pip install

1
2
3
4
5
6
7


$ cd ~/repo/exitwp-for-hugo
$ source ~/env_py27/bin/activate
(env_py27) $ pip install -r pip_requirements.txt 
...
Successfully built html5lib beautifulsoup4 PyYAML html2text
Installing collected packages: six, html5lib, beautifulsoup4, PyYAML, html2text
Successfully installed PyYAML-3.10 beautifulsoup4-4.2.0 html2text-3.200.3 html5lib-1.0b1 six-1.15.0

2. Export wordpress data

At wordpress administration page, Tool > Export > All content

Then you can download xml file named like hommalab.WordPress.2021-04-23.xml.

Copy xml file to target directory for tool reference.

1

cp -p hommalab.WordPress.2021-04-23.xml ~/repo/exitwp-for-hugo/wordpress-xml

3. Data preprocessing

Check xml format by xmlint command.

1
2


(env_py27)$ cd ~/repo/exitwp-for-hugo
(env_py27)$ xmllint ./wordpress-xml/hommalab.WordPress.2021-04-23.xml

In my case, there are some errors. I corrected tag or remove entire line if that is not necessarily.

parser error : CData section not finished
parser error : Opening and ending tag mismatch: encoded line xxx and ul (or li)
- They are by converted xml tag that is unfinished or mismatch
parser error : PCDATA invalid Char value 8
Extra content at the end of the document
- they are by format mismatch on language.

If error is resolved, xmlint write out entire xml content to standard output.

4. Convert wordpress files to markdown

4.1. Run tool

Run ./exitwp.py

1
2
3
4
5
6
7
8


 (env_py27)$ pwd
 ~/repo/exitwp-for-hugo
 (env_py27)$ exitwp.py
...
writing............................................................................
done
(env_py27)$ deactivate
$ 

4.2. Check outcomes

After the command finished, you can check build files at build directory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


$ pwd
~/repo/exitwp-for-hugo
$ cd build/hugo/www.homma-lab.com/
$ ls
_posts		about		lesson		png-eps-and-svg	sample-page
$ ls
2020-05-08-unit01-introduction.markdown
2020-05-19-unit02-analog-digital.markdown
2020-05-27-unit03-e382b3e383b3e38394e383a5e383bce382bfe381aee6a78be68890.markdown
2020-06-23-unit04-e68385e5a0b1e9809ae4bfa1e3838de38383e38388e383afe383bce382af.markdown
2020-07-22-unit05-e382a4e383b3e382bfe383bce3838de38383e38388e381a8e382bbe382ade383a5e383aae38386e382a3.markdown
2020-07-29-unit06-e89197e4bd9ce6a8a9.markdown
2020-08-25-unit07-1.markdown
2020-08-28-unit07-2.markdown
2020-08-31-unit07-3.markdown
2020-09-03-unit07-4.markdown
2020-09-14-unit07-5.markdown
2020-09-23-unit7_image.markdown
2020-09-30-unit07-6.markdown
2020-09-30-unit07-7.markdown
2020-10-06-unit07-8.markdown
2020-10-07-unit07-9.markdown
2020-10-07-unit7-10.markdown
2020-10-21-unit8-01.markdown
2020-11-02-unit8-02.markdown
2020-11-24-unit8-03.markdown
2020-11-25-2ndsemesterexam.markdown
2021-01-12-unit09.markdown
2021-01-20-unit10-small-computer.markdown
2021-01-27-unit11.markdown

At the head of each .markdown file, there are header for hugo post articles.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


$ head 2020-05-08-unit01-introduction.markdown
---
author: user
date: 2020-05-08 11:14:42+00:00
draft: false
title: Unit01 Introduction
type: post
url: /2020/05/08/unit01-introduction/
categories:
- lesson
---

5. Locate image files to hugo content

As there are no image file exported from export function at Wordpress admin screen, I got all the image from AWS.

5.1. SSH to AWS EC2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


ssh -i "xxx.pem" ubuntu@"xxxxxxx".ap-northeast-1.compute.amazonaws.com
Welcome to Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-1105-aws x86_64)
*** System restart required ***
     ___ _ _                   _
    | _ |_) |_ _ _  __ _ _ __ (_)
    | _ \ |  _| ' \/ _` | '  \| |
    |___/_|\__|_|_|\__,_|_|_|_|_|

*** Welcome to the Bitnami WordPress 5.3-0 ***
  *** Documentation:  https://docs.bitnami.com/aws/apps/wordpress/ ***
  ***                 https://docs.bitnami.com/aws/ ***
  *** Bitnami Forums: https://community.bitnami.com/ ***

#######################################################
###    For frequently used commands, please run:    ###
###         sudo /opt/bitnami/bnhelper-tool         ###
#######################################################

Last login: Fri May  8 06:50:17 2020 from 113.156.93.49
bitnami@xxxxx:~$ ls
apps  bitnami_credentials  htdocs  stack  tools

5.2. Archive uploaded image files

1
2
3
4
5
6
7


bitnami@xxx $ cd ~/apps/wordpress/htdocs/wp-content/uploads/
bitnami@xxx $ tar cvf wp-images.tar 2019 2020 2021
bitnami@xxx $ zip wp-images.tar.zip wp-images.tar
bitnami@xxx $ ls wp-images.tar.zip 
wp-images.tar.zip 
bitnami@xxx $ exit
$

You should do check archived file size by like ls -lart

5.3. Download files and unzip

Download archived image files from local client.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


$ cd ~/Downloads
$ mkdir uploaded_images
$ cd uploaded_images
$ scp -i "xxx.pem" ubuntu@"xxxxxxx".ap-northeast-1.compute.amazonaws.com:/home/bitnami/apps/wordpress/htdocs/wp-content/uploads/wp-images.tar.zip ./                        100%   80MB   6.4MB/s   00:12 
$ ls wp-images.tar.zip
wp-images.tar.zip
$ unzip wp-images.tar.zip  
$ tar xvf wp-images.tar
$ ls
2019			2020			2021

5.4. Convert image files' path and rename

Consolidate files location.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


$ pwd
~/repo/exitwp-for-hugo
$ mkdir images
$ mkdir images/wp_kg
$ cd ~/Downloads/uploaded_images
$ mv 2019 2020 2021 ~/repo/exitwp-for-hugo/images/wp_kg/
$ ls
_posts			images			png-eps-and-svg		about			lesson			sample-page
$ cd _posts
$ mkdir lesson
$ mv *unit07* *unit09* *unit10* *unit11* ./lesson/

From .markdown file,

Converted wordpress files(.markdown) image url: http://www.homma-lab.com/wp-content/uploads/
My hugo site image files(.md) url: ../../images/lesson/

So, I converted above by sed, also named converted file by .md extension by awk.

1
2
3
4


$ for v1 in $(ls -1 *markdown)
> do
> sed -e 's/http:\/\/www.homma-lab.com\/wp-content\/uploads/..\/images\/lesson/g' ${v1} > $(echo ${v1} | awk -F . '{print $1 ".md"}')
> done

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


$ ls
2020-08-25-unit07-1.markdown			2020-09-14-unit07-5.md				2020-10-07-unit07-9.markdown
2020-08-25-unit07-1.md				2020-09-23-unit07_image.markdown		2020-10-07-unit07-9.md
2020-08-28-unit07-2.markdown			2020-09-23-unit07_image.md			2021-01-12-unit09.markdown
2020-08-28-unit07-2.md				2020-09-30-unit07-6.markdown			2021-01-12-unit09.md
2020-08-31-unit07-3.markdown			2020-09-30-unit07-6.md				2021-01-20-unit10-small-computer.markdown
2020-08-31-unit07-3.md				2020-09-30-unit07-7.markdown			2021-01-20-unit10-small-computer.md
2020-09-03-unit07-4.markdown			2020-09-30-unit07-7.md				2021-01-27-unit11.markdown
2020-09-03-unit07-4.md				2020-10-06-unit07-8.markdown			2021-01-27-unit11.md
2020-09-14-unit07-5.markdown			2020-10-06-unit07-8.md

5.5. locate files to hugo directory

Then I moved files to hugo content directory.

1

mv *md ~/repo/gitlab/site/content/lessons/

As wordpress generates multiple files in many ranged sized like below, I need to pick the file that is actually used and locate it to hugo image directory.

1
2
3
4
5
6
7
8
9


$ ls -l スクリーンショット-2020-10-21-10.29.36*
-rw-r--r--  1 tato  staff   13984 10 21  2020 スクリーンショット-2020-10-21-10.29.36-100x100.png
-rw-r--r--  1 tato  staff   10222 10 21  2020 スクリーンショット-2020-10-21-10.29.36-120x68.png
-rw-r--r--  1 tato  staff   30514 10 21  2020 スクリーンショット-2020-10-21-10.29.36-150x150.png
-rw-r--r--  1 tato  staff   19454 10 21  2020 スクリーンショット-2020-10-21-10.29.36-160x90.png
-rw-r--r--  1 tato  staff   61625 10 21  2020 スクリーンショット-2020-10-21-10.29.36-300x165.png
-rw-r--r--  1 tato  staff   74914 10 21  2020 スクリーンショット-2020-10-21-10.29.36-320x180.png
-rw-r--r--  1 tato  staff  193237 10 21  2020 スクリーンショット-2020-10-21-10.29.36-768x423.png
-rw-r--r--  1 tato  staff  117284 10 21  2020 スクリーンショット-2020-10-21-10.29.36.png

The used files' paths(e.g. (../../images/**/**/png)) are derived by command like egrep 'png|jpg|gif' ../../_posts/lesson/*md | grep images | awk '{match($0, /$.*$/); url=substr($0, RSTART+1, RLENGTH-2); print url}'

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


$ egrep 'png|jpg|gif' ../../_posts/lesson/*md | grep images | awk '{match($0, /\(.*\)/); url=substr($0, RSTART+1, RLENGTH-2); print url}'
../../images/lesson/2020/08/jdk_install.png
../../images/lesson/2020/08/スクリーンショット-2020-08-26-14.29.44-988x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-26-15.16.54-1021x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-20.23.17.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-20.23.32-1024x576.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-20.23.45-1024x566.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-21.32.41-994x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-22.22.41-984x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-1.34.33-993x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-1.41.35-983x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-1.55.08-980x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-2.04.49-994x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-30-11.42.51.png
../../images/lesson/2020/08/スクリーンショット-2020-08-31-14.54.12-988x1024.png
../../images/lesson/2020/09/clock_animation.gif
../../images/lesson/2020/09/joho_06_clock.gif
../../images/lesson/2020/09/export.gif
../../images/lesson/2020/09/export-1.gif
../../images/lesson/2020/09/export-2.gif
../../images/lesson/2020/09/スクリーンショット-2020-09-14-15.22.12-983x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-14-15.26.24-993x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-23-10.20.40-1024x722.png
../../images/lesson/2020/09/スクリーンショット-2020-09-23-10.22.00.png
../../images/lesson/2020/09/export-3.gif
../../images/lesson/2020/09/スクリーンショット-2020-09-23-10.45.42-990x1024.png
../../images/lesson/2020/09/export-4.gif
../../images/lesson/2020/09/export-5.gif
../../images/lesson/2020/09/スクリーンショット-2020-09-30-13.13.02-983x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-30-13.09.34.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-9.35.58.png
../../images/lesson/2020/10/Anpanman_baikinman.gif
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.00.28.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.00.37.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.00.46.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.25.18.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.54.45.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-11.01.31.png
../../images/lesson/2020/10/export-1.gif
../../images/lesson/2020/10/スクリーンショット-2020-10-07-11.42.35-1024x538.png
../../images/lesson/2021/01/スクリーンショット-2021-01-12-17.12.39-1.png
../../images/lesson/2021/01/スクリーンショット-2021-01-12-17.20.43.png
../../images/lesson/2021/01/スクリーンショット-2021-01-12-17.26.30.png
../../images/lesson/2021/01/ProjectName.jpg

So, I copied them by following command.

1
2
3
4


$ for v1 in $(egrep 'png|jpg|gif' ../_posts/lesson/*md | grep images | awk '{match($0, /\(.*\)/); url=substr($0, RSTART+1, RLENGTH-2); print url}')
> do
> cp ${v1} ~/repo/gitlab/site/content/images/lesson/
> done

6. Convert hugo meta descriptions in articles

Converted URL for the articles followed wordpress blog url rule like /2020/05/08/unit01-introduction/.

I wanted to make it to /lessons/unit01-introduction/. I make it just by manually this time just for 13 files with consolidating titles and tags.

Finally, lessons page are safely transited from Wordpress and run on Hugo 😉