Notwithstanding the venerable old age, the Sprach- und Sachatlas Italiens und der Südschweiz (AIS) (Linguistic and Ethnographic Atlas of Italy and Southern Switzerland), is still one of the most useful instruments in linguistic geography, and in study of Romance and Italian dialectology. Moreover, the atlas turns out to be an inexhaustible mine of ethnographic information.
Figure 1: AIS original page. On the right a magnified detail (see the digital version on Fig. 3).
There are some good reasons to want a digital acquisition of AIS. First of all, the obvious necessity to exploit adequately the enormous quantity of information contained in the atlas. The index volume published in 1960 doesn’t allow to search a lemma, or part of a lemma, in a convenient way. Indeed, for the normal time and space publication limits, it could list only some prototypical dialectal forms, but not all the 680000 words present in the AIS. Moreover the AIS maps suffer from a lack of legibility, because the words are only indexed with an identifier number without any place name. The retrieval of a single lemma in 8 volumes and 1705 pages could require an eternity in comparison with a digital search engine.
Figure 2: Zeutschel OS 10000 colour scanner.
The entire elaboration was divided in 5 steps, so that we could process the entire AIS in one shot or as single separate stages:
The first step in the elaboration process must try to correct the page rotation inevitable in the scanning process. This is important for the map visualization, but more essential for the following text recognition task. To achieve this job in an optimal way, the program exploit the orange borders present in all the AIS pages (except the prefaces) (Fig. 1). First of all, we must separate, on the orange colour basis, the background and rectangle frame from the text. Then we extract the image edges using the Roberts method of approximation to the derivative. The edges are defined at the points where the gradient of input matrix is maximum. Then, the rotation angles of the frame sides can be computed with a Radon transform, which has the remarkable capacity to extract lines and curves from very noisy images. The Radon transform works projecting (i.e. summing up) the image intensity on a line, which inclination angle varies in a specific range (in our case between -2° and +2° with an increment of .02°).
Figure 3: NavigAIS - In the main window we can see the dialectal lemmas (black), the identification numbers of AIS points (red), the regional border lines (red). The names of the investigation places are overwritten in blue. On the left top box, the Overview window which allows to move in the whole map. The top toolbar shows the zoom and print buttons, and the buttons to move from a point to another. On the right, the word and point search window.
We need then to prepare the map for the next step, adjusting the image contrast. In this case the intensity values of the rotated image are weighted to lower values to produce darker colours.
The procedure automatically cut out the image to reduce the size to the minimum possible, with the sufficient intelligence to avoid the inclusion of a part of the contiguous page. Moreover all the maps are aligned in the same way, so that, in a future version of the navigator, we can use one unique background for all the different maps.
Figure 4: Left side detail of an AIS captured frame.
In this stage the orange background is isolated and then subtracted to the whole image. We run a median filter on the resulting image, i.e. a nonlinear transformation used to reduce the so called “salt and pepper” noise. The advantage of a median filter is that it is more effective than other algorithms (for ex., convolution), when the aim is to reduce the noise and at the same time preserve the edges. We repeat the same process on the output image, to obtain the final foreground component.
The two definitive matrices, containing the image text and the background, are logical masks made only of 0 and 1. Matlab© for some historical reasons stores a logical variable in 8 bits instead of one only, wasting a lot of space. So, it was necessary to write a compression routine that could force the memorization of this kind of data in only one bit. With this expedient, the entire AIS shrinks to 2.72 GB (about 270 times less than the size of the scanned images) and can be contained in one only DVD.
As the image elaboration tools, also the navigation software, called NavigAIS, was written in Matlab©, according to the Graphical User Interfaces (GUI).
As told before, we plan to acquire the entire AIS. This is not a trivial task, for the number of AIS lemmas (about 1 M words), but also for the complexity of AIS diacritic levels, which prevent to use the current OCRs without a long, accurate training phase. If we consider only the (simple?) task to check (and sometimes correct) the 1 M lemmas, at a very approximate rate of 10 s/lemma, the total time required could amount to a 2700 working hours!
The author would like to thank Alberto Zamboni and Maria Teresa Vigolo of the “Dipartimento di Discipline Linguistiche, Comunicative e dello Spettacolo”, Padua University, for providing assistance and the AIS volumes, and Valeria Pavone and Alessandro Businaro, Padua Municipality, for kindly placing at disposal the Zeutschel scanner. Many thanks to Alberto Benin for site creation assistance.
 Jaberg, K. and Jud, J., “Sprach- und Sachatlas Italiens und der Südschweiz”, Vol.1-8, Zofingen, Bern, 1928-1940. [Reprint edition by: Kraus Reprint, Nendeln, Liechtenstein, 1971-1981. Kraus Reprint, New York].