The Defence Science and Technology Laboratory (Dstl) has launched a new, free version of the Baleen tool for building data processing pipelines.
It said that Baleen 3 can process a folder with thousands of Word documents and PDFs to extract all e-mail addresses and phone numbers in those documents and store them in a database.
Its predecessor, Baleen 2, was one of the first open source projects by Dstl, an executive agency of the Ministry of Defence. It offered users the ability to search, process and collate data, and has been used across government, and by industry and academia.
The tool has enabled the creation of a bespoke chain of ‘processors’ to extract information from unstructured data such as text documents and images.
Baleen 3 can also find and extract images within those documents, perform optical character recognition to find text within those images, translate that text into English, and then run machine learning models to find mentions of people within those images.
New use cases
It supports components developed within the Annot8 framework, and as a result it is easy to extend and develop further to cover new use cases and provide additional functionality. There are already a large number of components available for use within the Annot8 framework, including some previously developed by Dstl.
The organisation said that support for the existing Baleen 2 project will be withdrawn, and it is encouraging all users to move to using Baleen 3 where possible.
It said the new version is built on top of newer technologies and will be easier to maintain and deploy.
Baleen 3 was co-developed by Dstl staff on the Augmenting the Analyst project and Committed Software, and is available to download from GitHub.
Image from iStock, Traffic Analyzer