Aligning Images and Text in a Digital Library

Jack Hessel & David Mimno

Image data from digital libraries has increased significantly and offers a promising new domain for both computer vision and digital humanities. Computer vision offers the possibility to extend DH beyond text, while images derived from cultural heritage materials offer more complicated training data for CV than standard labeled-image datasets. In this work, we train machine learning algorithms to match images from book scans with the text in the pages surrounding those images.

Schermafbeelding 2017-06-24 om 16.59.55

Using 400K images collected from 65K volumes published between the 14th and 20th centuries released to the public domain by the British Library,1 we build information retrieval systems capable of performing cross-modal retrieval, i.e., searching images using text, and vice-versa. To achieve high cross-modal retrieval performance, algorithms must learn correspondences between text and images. We refer to this process as aligning the data. We encountered several issues in constructing the dataset. First, the volumes spanned many languages, so we used an automatic tool2 to keep only the English text/image pairs. Second, a small number of volumes contained a very large proportion of the images, so we sampled at most ten images from each volume. Finally, to ensure our evaluations were difficult and fair, we employed volume-level holdout, so that all test-time images/text were sampled from unseen volumes. We evaluate over 10 cross-validation splits. Each split contains 69K image/text training pairs, and 5K image/text testing pairs, where each of the 5K test pairs are sampled uniquely from 5K heldout volumes.

Schermafbeelding 2017-06-24 om 17.00.55

To analyze images, we used deep learning features from a pretrained 50-layer residual convolutional model [3], which outperform simple color/edge detection features. To analyze text, we compared a variety of competitive methods, ranging from unigram indicator vectors to paragraph vectors [4] (PV) and latent Dirichlet allocation [2] (LDA). To align text and images, we compare a nearest-neighbor baseline (NN) to parametric approaches including a least-squares mapping from image to text features (IM2TX) and a version of deep canonical correlation analysis (DCCA) [1, 5]. We are able to achieve significant performance gains over random, demonstrating the ability to map images and their associated text into a shared space (Table 1). Furthermore, the parametric approaches improve over the nearest neighbor baselines, indicating that interesting patterns can be learned by performing multimodal learning, rather than processing modalities independently. This approach may enable scholars to search for concepts across and between textual and visual spaces.

1 https://data.bl.uk/digbks/

2 https://github.com/shuyo/language-detection

References

[1] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML (3), pages 1247–1255, 2013.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.

[4] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196, 2014.

[5] W. Wang, R. Arora, K. Livescu, and N. Srebro. Stochastic optimization for deep cca via nonlinear orthogonal iterations. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 688–695. IEEE, 2015.