I transit some of my past Wordpress articles to lesson pages in this site by Hugo.
What I did are:
- Setup ExitWP tool
- Export wordpress data
- Data preprocessing
- Convert wordpress files to markdown
- Locate image files to hugo content
- Convert hugo meta descriptions in articles
Gohugo page sugests some options to migrate wordpress to hugo. I firstly tried wordpress-to-hugo-exporter, but it looks I have to do many on AWS bitnami wordpress environment. Instead, ExitWP tool worked fine to my environment.
I setup ExitWP tool basically following Getting Started instruction on ExitWP.
1.1 Clone ExitWP
1
2
|
$ cd ~/repo
$ git clone https://github.com/wooni005/exitwp-for-hugo.git
|
1.2. Create python2 virtualenv
As this tool only supports python2 and my client PC has too many dependencies on python2/3 virtual environment, I newly created the new virtual environment.
1
2
|
$ sudo pip2 install virtualenv
$ python2 -m virtualenv ~/env_py27
|
1.3. Pip install
1
2
3
4
5
6
7
|
$ cd ~/repo/exitwp-for-hugo
$ source ~/env_py27/bin/activate
(env_py27) $ pip install -r pip_requirements.txt
...
Successfully built html5lib beautifulsoup4 PyYAML html2text
Installing collected packages: six, html5lib, beautifulsoup4, PyYAML, html2text
Successfully installed PyYAML-3.10 beautifulsoup4-4.2.0 html2text-3.200.3 html5lib-1.0b1 six-1.15.0
|
2. Export wordpress data
At wordpress administration page, Tool > Export > All content
Then you can download xml file named like hommalab.WordPress.2021-04-23.xml
.
Copy xml file to target directory for tool reference.
1
|
cp -p hommalab.WordPress.2021-04-23.xml ~/repo/exitwp-for-hugo/wordpress-xml
|
3. Data preprocessing
Check xml format by xmlint
command.
1
2
|
(env_py27)$ cd ~/repo/exitwp-for-hugo
(env_py27)$ xmllint ./wordpress-xml/hommalab.WordPress.2021-04-23.xml
|
In my case, there are some errors. I corrected tag or remove entire line if that is not necessarily.
- parser error : CData section not finished
- parser error : Opening and ending tag mismatch: encoded line xxx and ul (or li)
- They are by converted xml tag that is unfinished or mismatch
- parser error : PCDATA invalid Char value 8
- Extra content at the end of the document
- they are by format mismatch on language.
If error is resolved, xmlint
write out entire xml content to standard output.
4. Convert wordpress files to markdown
Run ./exitwp.py
1
2
3
4
5
6
7
8
|
(env_py27)$ pwd
~/repo/exitwp-for-hugo
(env_py27)$ exitwp.py
...
writing............................................................................
done
(env_py27)$ deactivate
$
|
4.2. Check outcomes
After the command finished, you can check build files at build
directory.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
$ pwd
~/repo/exitwp-for-hugo
$ cd build/hugo/www.homma-lab.com/
$ ls
_posts about lesson png-eps-and-svg sample-page
$ ls
2020-05-08-unit01-introduction.markdown
2020-05-19-unit02-analog-digital.markdown
2020-05-27-unit03-e382b3e383b3e38394e383a5e383bce382bfe381aee6a78be68890.markdown
2020-06-23-unit04-e68385e5a0b1e9809ae4bfa1e3838de38383e38388e383afe383bce382af.markdown
2020-07-22-unit05-e382a4e383b3e382bfe383bce3838de38383e38388e381a8e382bbe382ade383a5e383aae38386e382a3.markdown
2020-07-29-unit06-e89197e4bd9ce6a8a9.markdown
2020-08-25-unit07-1.markdown
2020-08-28-unit07-2.markdown
2020-08-31-unit07-3.markdown
2020-09-03-unit07-4.markdown
2020-09-14-unit07-5.markdown
2020-09-23-unit7_image.markdown
2020-09-30-unit07-6.markdown
2020-09-30-unit07-7.markdown
2020-10-06-unit07-8.markdown
2020-10-07-unit07-9.markdown
2020-10-07-unit7-10.markdown
2020-10-21-unit8-01.markdown
2020-11-02-unit8-02.markdown
2020-11-24-unit8-03.markdown
2020-11-25-2ndsemesterexam.markdown
2021-01-12-unit09.markdown
2021-01-20-unit10-small-computer.markdown
2021-01-27-unit11.markdown
|
At the head of each .markdown file, there are header for hugo post articles.
1
2
3
4
5
6
7
8
9
10
11
|
$ head 2020-05-08-unit01-introduction.markdown
---
author: user
date: 2020-05-08 11:14:42+00:00
draft: false
title: Unit01 Introduction
type: post
url: /2020/05/08/unit01-introduction/
categories:
- lesson
---
|
5. Locate image files to hugo content
As there are no image file exported from export function at Wordpress admin screen, I got all the image from AWS.
5.1. SSH to AWS EC2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
ssh -i "xxx.pem" ubuntu@"xxxxxxx".ap-northeast-1.compute.amazonaws.com
Welcome to Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-1105-aws x86_64)
*** System restart required ***
___ _ _ _
| _ |_) |_ _ _ __ _ _ __ (_)
| _ \ | _| ' \/ _` | ' \| |
|___/_|\__|_|_|\__,_|_|_|_|_|
*** Welcome to the Bitnami WordPress 5.3-0 ***
*** Documentation: https://docs.bitnami.com/aws/apps/wordpress/ ***
*** https://docs.bitnami.com/aws/ ***
*** Bitnami Forums: https://community.bitnami.com/ ***
#######################################################
### For frequently used commands, please run: ###
### sudo /opt/bitnami/bnhelper-tool ###
#######################################################
Last login: Fri May 8 06:50:17 2020 from 113.156.93.49
bitnami@xxxxx:~$ ls
apps bitnami_credentials htdocs stack tools
|
5.2. Archive uploaded image files
1
2
3
4
5
6
7
|
bitnami@xxx $ cd ~/apps/wordpress/htdocs/wp-content/uploads/
bitnami@xxx $ tar cvf wp-images.tar 2019 2020 2021
bitnami@xxx $ zip wp-images.tar.zip wp-images.tar
bitnami@xxx $ ls wp-images.tar.zip
wp-images.tar.zip
bitnami@xxx $ exit
$
|
You should do check archived file size by like ls -lart
5.3. Download files and unzip
Download archived image files from local client.
1
2
3
4
5
6
7
8
9
10
|
$ cd ~/Downloads
$ mkdir uploaded_images
$ cd uploaded_images
$ scp -i "xxx.pem" ubuntu@"xxxxxxx".ap-northeast-1.compute.amazonaws.com:/home/bitnami/apps/wordpress/htdocs/wp-content/uploads/wp-images.tar.zip ./ 100% 80MB 6.4MB/s 00:12
$ ls wp-images.tar.zip
wp-images.tar.zip
$ unzip wp-images.tar.zip
$ tar xvf wp-images.tar
$ ls
2019 2020 2021
|
5.4. Convert image files' path and rename
Consolidate files location.
1
2
3
4
5
6
7
8
9
10
11
|
$ pwd
~/repo/exitwp-for-hugo
$ mkdir images
$ mkdir images/wp_kg
$ cd ~/Downloads/uploaded_images
$ mv 2019 2020 2021 ~/repo/exitwp-for-hugo/images/wp_kg/
$ ls
_posts images png-eps-and-svg about lesson sample-page
$ cd _posts
$ mkdir lesson
$ mv *unit07* *unit09* *unit10* *unit11* ./lesson/
|
From .markdown file,
- Converted wordpress files(
.markdown
) image url: http://www.homma-lab.com/wp-content/uploads/
- My hugo site image files(
.md
) url: ../../images/lesson/
So, I converted above by sed, also named converted file by .md
extension by awk.
1
2
3
4
|
$ for v1 in $(ls -1 *markdown)
> do
> sed -e 's/http:\/\/www.homma-lab.com\/wp-content\/uploads/..\/images\/lesson/g' ${v1} > $(echo ${v1} | awk -F . '{print $1 ".md"}')
> done
|
1
2
3
4
5
6
7
8
9
10
|
$ ls
2020-08-25-unit07-1.markdown 2020-09-14-unit07-5.md 2020-10-07-unit07-9.markdown
2020-08-25-unit07-1.md 2020-09-23-unit07_image.markdown 2020-10-07-unit07-9.md
2020-08-28-unit07-2.markdown 2020-09-23-unit07_image.md 2021-01-12-unit09.markdown
2020-08-28-unit07-2.md 2020-09-30-unit07-6.markdown 2021-01-12-unit09.md
2020-08-31-unit07-3.markdown 2020-09-30-unit07-6.md 2021-01-20-unit10-small-computer.markdown
2020-08-31-unit07-3.md 2020-09-30-unit07-7.markdown 2021-01-20-unit10-small-computer.md
2020-09-03-unit07-4.markdown 2020-09-30-unit07-7.md 2021-01-27-unit11.markdown
2020-09-03-unit07-4.md 2020-10-06-unit07-8.markdown 2021-01-27-unit11.md
2020-09-14-unit07-5.markdown 2020-10-06-unit07-8.md
|
5.5. locate files to hugo directory
Then I moved files to hugo content directory.
1
|
mv *md ~/repo/gitlab/site/content/lessons/
|
As wordpress generates multiple files in many ranged sized like below, I need to pick the file that is actually used and locate it to hugo image directory.
1
2
3
4
5
6
7
8
9
|
$ ls -l スクリーンショット-2020-10-21-10.29.36*
-rw-r--r-- 1 tato staff 13984 10 21 2020 スクリーンショット-2020-10-21-10.29.36-100x100.png
-rw-r--r-- 1 tato staff 10222 10 21 2020 スクリーンショット-2020-10-21-10.29.36-120x68.png
-rw-r--r-- 1 tato staff 30514 10 21 2020 スクリーンショット-2020-10-21-10.29.36-150x150.png
-rw-r--r-- 1 tato staff 19454 10 21 2020 スクリーンショット-2020-10-21-10.29.36-160x90.png
-rw-r--r-- 1 tato staff 61625 10 21 2020 スクリーンショット-2020-10-21-10.29.36-300x165.png
-rw-r--r-- 1 tato staff 74914 10 21 2020 スクリーンショット-2020-10-21-10.29.36-320x180.png
-rw-r--r-- 1 tato staff 193237 10 21 2020 スクリーンショット-2020-10-21-10.29.36-768x423.png
-rw-r--r-- 1 tato staff 117284 10 21 2020 スクリーンショット-2020-10-21-10.29.36.png
|
The used files' paths(e.g. (../../images/**/**/png)
) are derived by command like egrep 'png|jpg|gif' ../../_posts/lesson/*md | grep images | awk '{match($0, /\(.*\)/); url=substr($0, RSTART+1, RLENGTH-2); print url}'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
|
$ egrep 'png|jpg|gif' ../../_posts/lesson/*md | grep images | awk '{match($0, /\(.*\)/); url=substr($0, RSTART+1, RLENGTH-2); print url}'
../../images/lesson/2020/08/jdk_install.png
../../images/lesson/2020/08/スクリーンショット-2020-08-26-14.29.44-988x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-26-15.16.54-1021x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-20.23.17.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-20.23.32-1024x576.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-20.23.45-1024x566.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-21.32.41-994x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-28-22.22.41-984x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-1.34.33-993x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-1.41.35-983x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-1.55.08-980x1024.png
../../images/lesson/2020/08/スクリーンショット-2020-08-29-2.04.49-994x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-30-11.42.51.png
../../images/lesson/2020/08/スクリーンショット-2020-08-31-14.54.12-988x1024.png
../../images/lesson/2020/09/clock_animation.gif
../../images/lesson/2020/09/joho_06_clock.gif
../../images/lesson/2020/09/export.gif
../../images/lesson/2020/09/export-1.gif
../../images/lesson/2020/09/export-2.gif
../../images/lesson/2020/09/スクリーンショット-2020-09-14-15.22.12-983x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-14-15.26.24-993x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-23-10.20.40-1024x722.png
../../images/lesson/2020/09/スクリーンショット-2020-09-23-10.22.00.png
../../images/lesson/2020/09/export-3.gif
../../images/lesson/2020/09/スクリーンショット-2020-09-23-10.45.42-990x1024.png
../../images/lesson/2020/09/export-4.gif
../../images/lesson/2020/09/export-5.gif
../../images/lesson/2020/09/スクリーンショット-2020-09-30-13.13.02-983x1024.png
../../images/lesson/2020/09/スクリーンショット-2020-09-30-13.09.34.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-9.35.58.png
../../images/lesson/2020/10/Anpanman_baikinman.gif
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.00.28.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.00.37.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.00.46.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.39.03.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.25.18.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-10.54.45.png
../../images/lesson/2020/10/スクリーンショット-2020-10-07-11.01.31.png
../../images/lesson/2020/10/export-1.gif
../../images/lesson/2020/10/スクリーンショット-2020-10-07-11.42.35-1024x538.png
../../images/lesson/2021/01/スクリーンショット-2021-01-12-17.12.39-1.png
../../images/lesson/2021/01/スクリーンショット-2021-01-12-17.20.43.png
../../images/lesson/2021/01/スクリーンショット-2021-01-12-17.26.30.png
../../images/lesson/2021/01/ProjectName.jpg
|
So, I copied them by following command.
1
2
3
4
|
$ for v1 in $(egrep 'png|jpg|gif' ../_posts/lesson/*md | grep images | awk '{match($0, /\(.*\)/); url=substr($0, RSTART+1, RLENGTH-2); print url}')
> do
> cp ${v1} ~/repo/gitlab/site/content/images/lesson/
> done
|
6. Convert hugo meta descriptions in articles
Converted URL for the articles followed wordpress blog url rule like /2020/05/08/unit01-introduction/
.
I wanted to make it to /lessons/unit01-introduction/
. I make it just by manually this time just for 13 files with consolidating titles and tags.
Finally, lessons page are safely transited from Wordpress and run on Hugo 😉
References