I have a process where I need to copy all the images from a web page. I used to run this process with xmllint
, which will process an XML or HTML file and print out the entries you specify. But when my server host provider upgraded their systems, they didn’t include xmllint
. So I had to find another way to extract a list of images from an HTML page. It turns out you can do this in Bash.
You may not think Bash can parse data files, but it can with some clever thinking. Bash, like other UNIX shells before it, can parse lines one at a time from a file via the built-in read
statement.
By default, the read
statement scans a line of data and splits it into fields. Usually, read
splits fields using spaces and tabs, with newlines ending each line, but you can change this behavior by setting the Internal Field Separator (IFS
) value and the end-of-line delimiter (-d
).
To parse an HTML file using read
, set the IFS
to a greater-than symbol (>
) and the delimiter to a less-than symbol (<
). Each time Bash scans a line, it parses up to the next <
(the start of an HTML tag) then splits that data at each >
(the end of an HTML tag). This sample code takes a line of input and splits the data into the TAG
and VALUE
variables:
local IFS='>' read -d '<' TAG VALUE
Let’s explore how this works. Consider this simple HTML file:
<img src="https://www.cloudsavvyit.com/logo.png" alt="My logo" /> <p>some text</p>
The first time read
parses this file, it stops at the first <
symbol. Since <
is the first character of this sample input, that means Bash finds an empty string. The resulting TAG
and VALUE
strings are also empty. But that’s fine for my use case.
The next time Bash reads the input, it gets img src="https://www.cloudsavvyit.com/logo.png"↲alt="My logo" />↲
with a newline right before the alt, and stops before the <
symbol on the next line. Then read
splits the line at the >
symbol, which leaves TAG
with img src="https://www.cloudsavvyit.com/logo.png"↲alt="My logo" /
and VALUE
with an empty newline.
The third time read
parses the HTML file, it gets p>some text
. Bash splits the string at the >
resulting in TAG
containing p
and VALUE
with some text
.
Now that you understand how to use read
, it’s easy to parse a longer HTML file with Bash. Start with a Bash function called xmlgetnext
to parse the data using read
, since you’ll be doing this again and again in the script. I named my function xmlgetnext
to remind me this is a replacement for the Linux xmllint
program, but I could have just as easily named it htmlgetnext
.
xmlgetnext () { local IFS='>' read -d '<' TAG VALUE }
Now call that xmlgetnext
function to parse the HTML file. This is my complete htmltags
script:
#!/bin/sh # print a list of all html tags xmlgetnext () { local IFS='>' read -d '<' TAG VALUE } cat $1 | while xmlgetnext ; do echo $TAG ; done
The last line is the key. It loops through the file using xmlgetnext
to parse the HTML, and prints out only the TAG
entries. And because of how echo
operates with the standard field separators, any lines like img src="https://www.cloudsavvyit.com/logo.png"↲alt="My logo" /
that contain a newline get printed on a single line, as img src="https://www.cloudsavvyit.com/logo.png" alt="My logo" /
.
To fetch just the list of images, I run the output of this script through grep
to only print the lines that have an img
tag at the start of the line.