Fangorn for text files

1. Background and documentation

Fangorn was written in Basic by Hugo Besemer (PUDOC, Wageningen, the Netherlands) and Paul Nieuwenhuysen (VUB, Brussels), UNESCO sponsored it. Its purpose was to enable conversions of downloads from online database or CD-ROMs to the ISO 2709 record structure, which can be imported in CDS/ISIS.

There are two versions, one from 1989 and one from 1992. The latter is superior: it solves a few bugs and it can also convert comma delimited files. It can be obtained from many national or regional CDS/ISIS distributors.

Normally, the program comes with a extensive manual, in form of a ASCII file, "ENGLISH.MAN", and a few examples of input files and specification files. The program may be mentioned in many articles on CDS/ISIS, but there is only one evaluation of it (in Dutch). In The CDS/ISIS handbook by A. Buxton and A. Hopkinson the program is described (p. 112-115) but the example on p. 113 of an input file is not the usual kind of input file.

2. General evaluation

Plus Minus

3. Preliminary work

Before the actual conversion you must analyse the input file as described in 2.2.

As Fangorn only accepts ASCII texts or comma delimited files, all other sorts of input must be transformed to one of these two formats. This means that word processor files must be exported in ASCII text form and that dBase files must be copied to a comma delimited file. We shall deal with dBase files later (see chapter 6).

They are many reasons why a text file is in a word processor format and not in plain ASCII:

  1. Some CD-ROMS offer the possibility to download in WordPerfect or Word format.
  2. Some input material will consist of a list written by a word processor. This can be the case when you have asked publishers for a copy of their printed catalogue. You could also work with the text file which was used to produce a biblography on paper.
  3. OCR scanners may transform the text in word processor format.
  4. You may also have transformed an ASCII download in a word precessor form to make some preliminary alterations.
Beware of three pitfalls:
  1. Make sure the word processor has not split up words at the end of a line. These splittings must be undone first or they will occur in your CDS/ISIS database, where they will be difficult to remove.
  2. Make sure there are no artificial page markings, so called "hard pages", in the text. They will occur in the CDS/ISIS database as .
  3. Some characters, expecially of languages which use accents or non-standard charactersets, could be transformed into unusual ASCII signs, like • or or .. If you are using that kind of material, check this out and make a list. You will need this list after the conversion to locate those signs and to transform them back.
It is advisable to put the program Fangorn, all specification files and all ISO 2709 files Fangorn created, in one directory, e.g.
  C:\FANGORN
In doing so, you will save yourself a lot of time typing path instructions or searching for the right path.

Never test the conversion with very long files (e.g. 1,000 records). As Fangorn is rather slow, this will take a lot of time. Choose a few records from the download to test.

As an example we will use here a download from the ERIC database on CD- ROM (Silver Platter interface). To simplify things we take a record describing a journal article:

   3 of 220  Complete Record Tagged
EN- EJ476110|
AN- HE532132|
TI- What Size Libraries for 2010?|
AU- Matier, Michael^Sidle, C. Clinton|
JN- Planning for Higher Education, ^v21 n4 p9-15 Sum |
PY- 1993|
SN- 0736-0983|
AL- UMI|
LA- English|
DT- EVALUATIVE REPORT (142)^ POSITION PAPER (120)^ JOURNAL ARTICLE (080)|
JA- CIJMAY94|
TA- Administrators^Practitioners|
AB- In planning for the year 2010, Cornell University (New York) found that, 
    although space for library collections must increase, advancing technology      would slow the increase required, with most of the collection growth after      2000 managed through wider technology or remote storage.^Consolidation of 
    special libraries is also considered.^(MSE)|
DE- Case Studies^College Buildings^*College Libraries^College Planning^
    *Facility Expansion^*Facility Requirements^Higher Education^Information 
    Storage^*Library Collections^*Library Planning^Long Range Planning^Space 
    Utilization^Special Libraries^*Technological Advancement|
ID- *Cornell University NY||
First, print a record like this and number the fields in the margin. Then analyse the record as shown in paragraph 2.2.

4. Executing the program

The general procedure to convert with Fangorn takes 3 steps:
  1. Instruct Fangorn to make an empty converion table, a so-called specification file.
  2. Fill in this specification file with an editor.
  3. Go back to Fangorn and instruct the program to convert the input file according to the instructions of the specification file.
First, start the program. You will see this screen:
                                                                               
                                                                                
               FRAN«AIS       => F                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               ENGLISH         => E                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               DOS             => X                                               
                                                                                
                                                                                
                                                                                
                    ?                                                                 
                                                                                
Choose the English version by typing E. This will bring you to this screen:

Note the date of the version in the right upper corner. It is possible that the right hand vertical line looks like a line of ž-characters. This is due to the character set of your computer: set 437 (English) produces a line; set 850 (multilingual) produces the ž's. The character set is specified in this line of your AUTOEXEC.BAT:

  MODE CON CODEPAGE SELECT=437
Do not worry: this is only a matter of screen lay-out; it has no effect on the program. Do not change the AUTOEXEC.BAT for the sake of this line; this may effect other programs.

The first time you are going to convert a specific kind of input file, you must create a new specification file by typing "new". This will produce a new question:

Fill in the maximum number of fields a records has and hit . Now, the next question appears. Fill in the name you want to give to your specification file, f.i.

All kind of names will be supported, as long as they are within the limits of MS DOS specifications (8 characters before and 3 after the decimal), but it may be good practice to make them as transparent as possible. In the example, the name means: specification file (extension ".SPE" !) for LISA, with the Silver Platter interface (there could be a LISABS.SPE, which would mean: for LISA with the Bawker-Saur interface). You could use other systems, like:

etc. This can come in handy when you are not making the downloads yourself. After you have given the name of the specification file, the program will say it can be filled in. If you press a key, you will get this screen:

Now you can type X to leave Fangorn or you can make a "visit" to DOS by typing T. You could use this option in order to start an editor to fill in the specification file. The command EXIT will bring you back to Fangorn. Do not do this; leave the program and restart it afterwards. If you use the option T there is a real possibility that your computer will "hang" when you want to start the conversion.

                        SPECIFICATION FILE
Which conversion is specified         :ERIC on CD-ROM (Silver Platter)
Number of fields                      :15
Text indicating start/end of record   :Complete Record Tagged
Tag indicating start/end of record    :
Line indicating start/end of record   :
Two texts indicating start/end of rec.:

Tag in incoming file                  :EN-
Tag in ISO-2709 file                  :1
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :AN-
Tag in ISO-2709 file                  :2
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :TI-
Tag in ISO-2709 file                  :3
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :AU-
Tag in ISO-2709 file                  :4
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :,
Subfield delimiter ISO-2709 file      :ab
Delimiter occurrences incoming file   :^
Texts to replace                      :

Tag in incoming file                  :JN-
Tag in ISO-2709 file                  :5
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :, ^
Subfield delimiter ISO-2709 file      :ab
Delimiter occurrences incoming file   :
Texts to replace                      : Sum@@

Tag in incoming file                  :PY-
Tag in ISO-2709 file                  :6
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :SN-
Tag in ISO-2709 file                  :7
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :AL-
Tag in ISO-2709 file                  :8
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :LA-
Tag in ISO-2709 file                  :9
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :DT-
Tag in ISO-2709 file                  :10
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :JA-
Tag in ISO-2709 file                  :11
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :TA-
Tag in ISO-2709 file                  :12
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :

Tag in incoming file                  :AB-
Tag in ISO-2709 file                  :13
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :^@@ ~~

Tag in incoming file                  :DE-
Tag in ISO-2709 file                  :14
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :^
Texts to replace                      :*@@

Tag in incoming file                  :ID-
Tag in ISO-2709 file                  :15
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :
 
The specification file consists of two parts: In the first block, one line will be filled in:
  Number of fields                      :15
You may fill in the first line. This will help you later to identify which specification file this is, but it is not necessary:
  Which conversion is specified         :ERIC on CD-ROM (Silver Platter)  
On the other hand, you must fill in one of the next lines:
  Text indicating start/end of record   :
  Tag indicating start/end of record    :
  Line indicating start/end of record   :
  Two texts indicating start/end of rec.:

1) Text indication start/end of record

In some downloads the first line of each record contains a message like:
  1 of 220 Complete Record Tagged
In this case you can fill in:
  Text indicating start/end of record   :Complete Record Tagged
You could fill in more strings which occur at the top of every record, separated by "~~". You would do so if you were to use one specification file for more than one kind of incoming file. The manual gives this example:
  .00/00~~/4/
This would mean: the first line contains ".00/00" OR "/4/". You can use up to five different strings this way. (This would also mean that the field structure of those files is the same...)

2) Tag indicating start/end of record

In the example we do not use the upper line of the record, but the first tag: TI: This will probably be the only way to indicate the record in some files and it is by far the most secure one, as will be explained in the chapter 5, on pitfalls.

The reason why we use it in the example, is that we want to omit everything from field 10 (CC:) up to "TI:" in the following record.

3) Line indication start/end of record

This is meant for files which e.g. end with a special sign between records, like:
  §
Nothing else may occur on the line. You can fill in 5 different signs, separated by "~~". Personally, I have never seen a file where this would be the case.

4) Two texts indicating start/end of rec.

For a download, where every record begins with
  1 of 220 Complete Record Tagged
You can use:
  of~~Complete Record Tagged  
This indicates that both text should appear at the top of each record.

The rest of the text blocks treat the fields, one by one. The simplest form is:

Tag in incoming file                  :EN-
Tag in ISO-2709 file                  :1
(The following is not mandatory)
Spaces in continuation lines          :
Subfield delimiter in incoming file   :
Subfield delimiter ISO-2709 file      :
Delimiter occurrences incoming file   :
Texts to replace                      :
This means: the field begins with the tag "TI:" and it should be numbered as "1" in the ISO 2709 file.

The next line:

  Spaces in continuation lines          :
is meant to skip those fields you do not want to convert. In the example you could eliminate the field ID- be filling in that 4 spaces must come in front of the content of field DE-:

DE- Case Studies^College Buildings^*College Libraries^College Planning^
    *Facility Expansion^*Facility Requirements^Higher Education^Information 
    Storage^*Library Collections^*Library Planning^Long Range Planning^Space 
    Utilization^Special Libraries^*Technological Advancement|
ID- *Cornell University NY||

Spaces in continuation lines          :4
in the specification for "Descreption" this would mean : every field which does not begin with 19 blanks is rubbish. So, if the next field you specify is "Frequency", this would leave out the field "Language". Personnaly, I find it easier to convert all fields, although this may take more time; I omit fields I do not want in the FST for the import the data.

The next line deals with subfields. In field 4, the author's family name and first name are separeted by means of a comma. We can instruct Fangorn to put one in subfield a and one in subfield b this way:


Subfield delimiter in incoming file   :,
Subfield delimiter ISO-2709 file      :ab
The result will be:
  ^aMatier^bMichael       
The field JN- contains the name of the journal and the bibliographic location on the journal. Both are separated by ", ^". So we fill in for field 5:
  Subfield delimiter in incoming file   :, ^
  Subfield delimiter ISO-2709 file      :ab
Repeatable fields are treated in the next line:
  Delimiter occurrences incoming file   :
The source file contains author fields like this:
  AU- Matier, Michael^Sidle, C. Clinton| 
So we fill in "^" as the delimiter which signals a next author:
  Delimiter occurrences incoming file   :^
Sientific articles with many authors may cause a problem when the names take more than one line. In this example (which I take from the Fangorn manual):
Au:  Zdybiewska, Maria; Trawinski, Miroslaw; Niederlinska-Sryczek, 
        Maria; Jarawk, Marek; Andrzejuk, Zdzislawa
the first line ends with ";". This will produce an output like this:
^aNiederlinska-Stryczek
^aMaria
^aJarawk^bMarek
^aAndrzejuk^bZdzislawa
This can be avoided when you specify the delimeter as:
   Delimiter occurrences incoming file   :;~~S
which means that the field can be split up. This will not be necessary if every line ends with the delimeter:

The last line allows you to replace strings:

Texts to replace                      :

The syntax the replacement commands is:

text to be replaced @@ replacement text ~~ next text to be replace @@ replacement text
etc. You can replace up to 5 strings this way.

If you only want to omit a string, it is sufficient to use this command:

   Texts to replace                      : Sum@@
which means: omit " Sum". If you want to change only one string, you could write:
   Texts to replace                      : Sum@@With abstract
In field 13 (AB-) we want to replace all the ^ by spaces. Therefore, you must end the command by ~~, otherwise Fangorn just will omit ^ and paste the preceeding and following words together:
   Texts to replace                      :^@@ ~~
which means: replace ^ with a space.

When all fields have been specified, restart Fangorn. Fill in

In the left bottom corner the program will indicate which record is being treated. When the end of the file is reached, three more lines will appear at the bottom of the screen and a little music tune will play:

If an error occurs in the specification file, Fangorn will display an error message, indicating the number of the corrupt line:

Leave Fangorn and correct this line of the specification file with you editor.

5. Pitfalls

Fangorn is by far the best conversion program for CDS/ISIS, but there are still a few bugs in it:
  1. When you have indicated more than 5 strings to be replaced in the specification of one field, the program will run for minutes without any result.
  2. Files with many fields, which contain a lot of strings to be changed may cause the program to "hang".
  3. Some error messages are totally incomprehensible. They appear to be Basic error messages, like "Error 5" etc. This is e.g. the case when you forget to add the "@@" after the string to be replaced in the line "Texts to replace".
  4. If you hit another key than the ones at the bottom of the screen, the program starts the conversion all over again. This can occur when you use a screen saver. When the screen saver is active, the conversion is over; only hit X.
  5. If you have made a visit to DOS, using the option T, to fill in your specification file, only use option "A (other spec.)" to begin the conversion. You will have to fill in the name of the specification file again. If you use option "M (conversion with spec.)", you will get "Error 64" and the computer will "hang". (Option M is meant to do a new conversion with the same specification file.)
  6. If you use the option "Text indicating star/end of record", Fangorn will produce one record more than there are in the input file. This empty record may be filled partially by your import FST, which can fill in some default values. You can eliminate it in the ISO 2709 file by means of an editor. It is the short series of numbers ending on "#~" in the first line of the file. An alternative method is to delete it after import in the test database.
  7. Remember to alter the record separator in the import worksheet to ~ (Alt- 126) in stead of #. Otherwise CDS/ISIS will produces a nonsense record at best.
  8. When you have changed the specification file a lot, it is possible that Fangorn produces an error messages you could not explain at all. Most probably the specification file then is corrupted. Start all over again: make a new specification file and fill it in.
  9. You could add a text block, meant to describe one field to the specification file, to the specification file using the copy-and-past techniques of your editor. This will probably be the case when you discover that you have miscounted the num ber of fields, while you are filling in the specification file. When you do so, make sure you change the number of fields in the first block, describing the record, otherwise Fangorn will only treat as many fields as specified there.
  10. If you change "n" or "on" into something that also ends on "n" or "on" the computer hangs or produces "Error 14". E.g. if you would want to change:
    
       JN- Planning for Higher Education, ^v21 n4 p9-12 Sum | 
    
       into:
    
       JN- Planning for Higher Education, ^v21, n4 p9-12 Sum | 
    
       by this command:
    
       
       Texts to replace                      : n@@, n
    
    the computer will hang.
  11. Be aware of the order in which you want Fangorn to replace strings. In this field:
       AU- John Smith and  Alan Jones
    you cannot at the same time change the space into "^a" and " and " into "%".

6. Recommendations

Fangorn has some special features which you can use to your advantage:
  1. It was the intention of the authors of Fangorn that the line:
       Tag in ISO-2709 file  
    : would contain the tag number of the target CDS/ISIS file. Do not fill in the specifiction file that way; number the fields one by one as they occur in the input file. You will not always be able to use the target numbering for every field. The result is a mexture of source and target numbers, which is very confusing. You can deal with the target numbering in the FST for importing the ISO 2709 file in your test database, e.g.
       200 0 "^a"v2
    However, if you have an input file of which the structure matches the structure of you database, you could use the tag numbering of your database directly.
  2. Fangorn allows you to create comment lines in the specification file. the comment line can be placed between 2 other lines; it even can take more than one line. It must always begin with /* end end with */, e.g.
       Tag in incoming file                  :TI:
       /* this is the translated title; 
       the original is in the next field */
       Tag in ISO-2709 file                  :1
       (The following is not mandatory)
       Spaces in continuation lines          :
       Subfield delimiter in incoming file   :
       Subfield delimiter ISO-2709 file      :
       Delimiter occurrences incoming file   :
       Texts to replace                      :
    
    Never begin after another line, like this:
       Tag in incoming file                  :TI: /* translated title */
    
  3. Years ago it was standard to place two spaces after a full stop. This is not done anymore and it makes abstracts unnessarilly long. Fangorn can eliminate the second space. Put three "~" signs (Alt 126) after the line "Subfield delimeter incoming file":
       Tag in incoming file                  :AB:
       Tag in ISO-2709 file                  :7
       (The following is not mandatory)
       Spaces in continuation lines          :
       Subfield delimiter in incoming file   :~~~
       Subfield delimiter ISO-2709 file      :
       Delimiter occurrences incoming file   :
       Texts to replace                      :
    
    Do not worry about leading or trailing blanks; they will be omitted, as well as "|" (ASCII character 124).
  4. Some databases list keywords or descriptors one under the other, without any "delimeter", e.g.
       KW: databases
           information retrieval
            library software
    
    You can tell Fangorn to consider every line as a new occurrence; the program must take the "carriage return" as a delimeter:
       Delimeter occurrences incoming file    :~CR~
    
  5. In this example, where
       SO: Library-and-Archival-Security; 11 (1) 1991, 1-42. illus. refs
    should be converted into
       Library and Archival Security^b11^c1^d1991^e1-42^fillus. refs
    by means of this statement:
       Texts to replace                      :; @@^b~~(@@^c~~) @@^d~~, @@^e~~. @@^f
    the first subfield will have no subfield delimeter. You can add a subfield delimeter by stating this way:
       Tag in incoming file                  :SO:
       Tag in ISO-2709 file                  :4
       (The following is not mandatory)
       Spaces in continuation lines          :
       Subfield delimiter in incoming file   :
       Subfield delimiter ISO-2709 file      :a
       Delimiter occurrences incoming file   :
       Texts to replace                      :; @@^b~~(@@^c~~) @@^d~~, @@^e~~. @@^f
    The field will now begin with subfield delimeter ^a:
       ^aLibrary and Archival Security^b11^c1^d1991^e1-42^fillus. refs
    Knowing that a FST can distinguish the first subfield, without naming it, this is not necessary. The FST can contain these lines:
       1200 0 "^a"v4^*
    instead of:
       1200 0 "^a"v4^a
    Do not fill in the line "Subfield delimiter incoming file" without specifying the delimiter in the ISO 2709 file, because this will only put "^" between the subfields.
  6. Fangorn allows you to run the program from the command line, i.e. without the menu. To do so you must type this command at the DOS prompt:
       fangorn l:e s:eric.spe i:eric.txt o:eric.iso
    which means:
       - l:e            : use the English version
       - s:eric.spe     : specification file "eric.spe"
       - i:eric.txt     : input file "eric.txt"
       - o:eric.iso     : output file "eric.iso"
    
    You may also use path statements in the command line, f.i.
       fangorn l:e s:eric.spe i:\down\eric.txt o:\isis\work\eric.iso
    This will produce this screen for the duration of the conversion:

© Piet de Keyser, 1998

Piet de Keyser's Manual Collection