1. Introduction

Notwithstanding the venerable old age, the Sprach- und Sachatlas Italiens und der Südschweiz (AIS) (Linguistic and Ethnographic Atlas of Italy and Southern Switzerland), is still one of the most useful instruments in linguistic geography, and in study of Romance and Italian dialectology. Moreover, the atlas turns out to be an inexhaustible mine of ethnographic information.
The AIS was conceived by the Swiss linguists Karl Jaberg and Jakob Jud, and published in eight volumes between 1928 and 1940 [1], with an introductory book [2], and followed only in 1960 by an index [3], rather difficult to be found today. On the field, the data collection was carried out, between 1919 and 1925, in 306 places of southern Switzerland, and north and central Italy, by Paul Scheuermeier, in 81 places of southern Italy by Gerhard Rohlfs, and in 20 Sardinian places by Max Leopold Wagner. In the following years Scheuermeier did continue the work in north Italy in different stages until 1935, to improve the AIS ethnographic aspects [4].
NavigAIS is the first high resolution digital version of the AIS. The elaboration and interface programs were written in Matlab© version R2006a.
The digital atlas is provided with a search and navigation software, indispensable to explore in a quick and comfortable way the 1705 maps contained in the atlas (Fig. 1).

The software was planned as a stand alone application, but also as a preliminary tool for an ISTC current project, aiming to acquire all the about 1 M dialectal words in the atlas.

Figure 1: AIS original page. On the right a magnified detail (see the digital version on Fig. 3).

2. AIS Acquisition

There are some good reasons to want a digital acquisition of AIS. First of all, the obvious necessity to exploit adequately the enormous quantity of information contained in the atlas. The index volume published in 1960 doesn’t allow to search a lemma, or part of a lemma, in a convenient way. Indeed, for the normal time and space publication limits, it could list only some prototypical dialectal forms, but not all the 680000 words present in the AIS. Moreover the AIS maps suffer from a lack of legibility, because the words are only indexed with an identifier number without any place name. The retrieval of a single lemma in 8 volumes and 1705 pages could require an eternity in comparison with a digital search engine.

On the other hand the digital acquisition of the atlas is not so easy, for the page dimensions of the AIS volumes are quite large, 44x58 cm, i.e. 17.3x22.8 in (about A2 size) (Fig. 1).

Only just some years ago, it was not possible to find devices able to scan documents of A1 or A2 size with the required resolution, and it was also not so easy to transfer the data in a quick and suitable way. On the other hand, it wasn’t imaginable to plan the acquisition of 1705 maps, on a normal A4, or even A3, scanner, and then rearrange the different parts of the maps.

So, it was necessary for the author to spend some weeks in the search for a possible solution, asking, without results, publishers, pressmen and photographs.

By a lucky chance, a friend told the author that the Padua municipal archives had at disposal a Zeutschel OS 10000 colour scanner, which could work at 600 dpi, supported the A1 format and provided a book cradle (Fig. 2) (http://zeutschel.de): the perfect machine for the AIS project!

This kind of scanner can acquire a double A2 page, placing the book on 2 balanced plates which give the pages an uniform compression against the scan glass surface. At 600 dpi resolution, the time for acquiring, running a mask contrast filter, and storing the data, was about 4 m. The total time spent was about 100 hours, which were divided into 20 working days.

Figure 2: Zeutschel OS 10000 colour scanner.

3. AIS Image Elaboration

The entire elaboration was divided in 5 steps, so that we could process the entire AIS in one shot or as single separate stages:

Image rotation (3.1).
Image enhancement (3.2).
Image cropping (3.3).
Text/background separation and noise reduction (3.4).
Foreground and background components saving (3.5).

3.1 Image Rotation

The first step in the elaboration process must try to correct the page rotation inevitable in the scanning process. This is important for the map visualization, but more essential for the following text recognition task. To achieve this job in an optimal way, the program exploit the orange borders present in all the AIS pages (except the prefaces) (Fig. 1). First of all, we must separate, on the orange colour basis, the background and rectangle frame from the text. Then we extract the image edges using the Roberts method of approximation to the derivative. The edges are defined at the points where the gradient of input matrix is maximum. Then, the rotation angles of the frame sides can be computed with a Radon transform, which has the remarkable capacity to extract lines and curves from very noisy images. The Radon transform works projecting (i.e. summing up) the image intensity on a line, which inclination angle varies in a specific range (in our case between -2° and +2° with an increment of .02°).

Figure 3: NavigAIS - In the main window we can see the dialectal lemmas (black), the identification numbers of AIS points (red), the regional border lines (red). The names of the investigation places are overwritten in blue. On the left top box, the Overview window which allows to move in the whole map. The top toolbar shows the zoom and print buttons, and the buttons to move from a point to another. On the right, the word and point search window.

3.2 Image Enhancement

We need then to prepare the map for the next step, adjusting the image contrast. In this case the intensity values of the rotated image are weighted to lower values to produce darker colours.

3.3 Image Cropping

The procedure automatically cut out the image to reduce the size to the minimum possible, with the sufficient intelligence to avoid the inclusion of a part of the contiguous page. Moreover all the maps are aligned in the same way, so that, in a future version of the navigator, we can use one unique background for all the different maps.

Figure 4: Left side detail of an AIS captured frame.

3.4 Separating the Foreground and Background Image Components

In this stage the orange background is isolated and then subtracted to the whole image. We run a median filter on the resulting image, i.e. a nonlinear transformation used to reduce the so called “salt and pepper” noise. The advantage of a median filter is that it is more effective than other algorithms (for ex., convolution), when the aim is to reduce the noise and at the same time preserve the edges. We repeat the same process on the output image, to obtain the final foreground component.

3.5 Saving the compressed Foreground and Background Image Components

The two definitive matrices, containing the image text and the background, are logical masks made only of 0 and 1. Matlab© for some historical reasons stores a logical variable in 8 bits instead of one only, wasting a lot of space. So, it was necessary to write a compression routine that could force the memorization of this kind of data in only one bit. With this expedient, the entire AIS shrinks to 2.72 GB (about 270 times less than the size of the scanned images) and can be contained in one only DVD.

4. NavigAIS Navigation Software

As the image elaboration tools, also the navigation software, called NavigAIS, was written in Matlab©, according to the Graphical User Interfaces (GUI).

NavigAIS is composed of 3 windows (Fig. 3). The main window does display the AIS maps at the desired magnification ratio. The dialectal words are black colored on a white background, while the identification number of AIS points and the regional borders are in red. The names of the investigation places are, optionally, overwritten in blue. To move in the AIS map, we provide an overview window, which is a miniature of the entire map (left box of Fig. 3 & 5): a blue rectangle signals the map position and can be dragged to visualize the wanted zone in the main window. On top of both these windows, there is a toolbar with some pushbutton. They allow: a) the zoom in and zoom out of the map, b) the movement from a point to another in sequential or predefined order, and c) the image printing.

The third window offers some search facilities on Italian index words and on the place names (right box of Fig. 3). It is possible to select the desired list of points to explore them sequentially. This functionality was created for the next OCR step, which requires to correctly identify each lemma.

The criteria used in the software programming allow to display a map in less than 3 s on a Intel Core Duo 2.26 GHz (Photoshop© does employ 14 s for the same sized image).

5. Future Developments

As told before, we plan to acquire the entire AIS. This is not a trivial task, for the number of AIS lemmas (about 1 M words), but also for the complexity of AIS diacritic levels, which prevent to use the current OCRs without a long, accurate training phase. If we consider only the (simple?) task to check (and sometimes correct) the 1 M lemmas, at a very approximate rate of 10 s/lemma, the total time required could amount to a 2700 working hours!

6. Acknowledgements

The author would like to thank Alberto Zamboni and Maria Teresa Vigolo of the “Dipartimento di Discipline Linguistiche, Comunicative e dello Spettacolo”, Padua University, for providing assistance and the AIS volumes, and Valeria Pavone and Alessandro Businaro, Padua Municipality, for kindly placing at disposal the Zeutschel scanner. Many thanks to Alberto Benin for site creation assistance.

7. References

[1]          Jaberg, K. and Jud, J., “Sprach- und Sachatlas Italiens und der Südschweiz”, Vol.1-8, Zofingen, Bern, 1928-1940. [Reprint edition by: Kraus Reprint, Nendeln, Liechtenstein, 1971-1981. Kraus Reprint, New York].
[2]          Jaberg, K. and Jud, J., “Der Sprachatlas als Forschungsinstrument. Kritische Grundlegung und Einführung in den Sprach- und Sachatlas Italiens und der Südschweitz”, Halle, 1928. [It. translation: “Ais - Atlante Linguistico ed Etnografico dell'Italia e della Svizzera Meridionale. Vol. 1: Fondamenti Critici e Introduzione. Vol. 2: Scelta di Carte Commentate”, Unicopli, Milano, 1988.
[3]          Jabert, K., “Index zum Sprach- und Sachatlas Italiens und der Südschweiz: Ein propädeutisches etymologisches Wöterterbuch der italienischen Mundarten”, Bern, Stämpfli, 1960.
[4]          Scheuermeier, P., “Bauernwerk in Italien, der italienischen und rätoromanischen Schweiz”, Vol. 1-2, Erlenbach-Zürich, Eugen Rentsch Verlag, 1943, [It. Translation: Scheuermeier P., “Il lavoro dei contadini”, Longanesi, Milano, 1980.]