Using a Perl script turn Word files into HTML.

So you have written something in Word and want to turn it into an HTML document. What do you do? Word permits you to save the file "as HTML", thus returning it as an HTML document. However, if you have footnotes and endnotes included in your text, these would not show up in your new HTML document. Thus, a new approach is needed if you want your HTML to include footnotes or endnotes.

One approach is to save the Word file as a text file, then to use a Perl script to replace each newline character \n with either break or paragraph tags.

To do this, you will need a Perl interpreter and to compose a Perl script. In my earlier Implementing Style Sheets with Perl, I introduce a Perl script that can perform a substitution in an array of HTML files. Here, I will use an identical Perl script with one small change. To read my attempt to explain the workings of the Perl script, by all means go to Implementing Style Sheets with Perl. As it is explained there, here I will simply provide instructions in carrying out this particular plan.

1- Open your Word file. Click the upper menu bar's File option. Scroll down to and select Save As. The Save As dialog box appears, at the bottom of which you will find the option windows File_name and Save as_type. In Save_as_type, select Text Only. Following this, press the Save button on the top right part of the dialog box.

2- Repeat this process for as many of the other files as you want. As the Perl script allows you to treat an array of files, you can make Text versions of as many Word files as you want. Always note the names of your files with their .txt extensions and note the full path of each file so that the Perl script can find these files when it executes

3-Open your Perl interpreter and enter this script:

#!/usr/bin/perl

@file = ("C:/file/file1.html",
"C:/file/file2.html",
"C:/file/file3.html"
) ;

for ($i=0; $i<@file; $i++)

{

unless (open(INPUT, "$file[$i]")) {
die ("Cannot open file $file[$i]\n");
@input = <INPUT>;
foreach $line(@input) {
$line =~ s/\n /<p>/;
}
close (INPUT);
unless (open(OUTPUT, ">$file[$i]")) {
die ("Cannot open file OUTPUT\n");
}

print OUTPUT "@input";
close(OUTPUT);
}

4- This script is identical to that in Implementing Style Sheets with Perl with the only difference being the regular expression

$line =~ s/\n /<p>/;

This tells the script to replace each newline character, the \n, with paragraph tags, the <p>. As a Word file saved as a text file marks paragraphs with newline characters, this is the obvious solution. Newline characters are not seen in the text file as such, they are known simply by the fact that there is...a new line.

If you want, you can keep the newlines in the HTML file with the following regular expression;

 

$line =~ s/\n /\n<p>/;

This means for every newline, replace it with a newline plus a paragraph tag.

5- The last step in this plan is to take all your newly formatted text files and add the necessary HTML tags; the head tags, the body tags, the title tags, headings tags and any other formatting. The most important work has already been done; that the paragraphs are marked as are the endnotes. The endnotes all begin with a number that appears in the text as intended. Then, after this is finished, save these text files as HTML files.