Taxonomy development and classification

Improve search with automatic classification augmented by rich taxonomies

Improve search with automatic classification and taxonomies

Key Benefits

The client obtained greatly enhanced search results, that benefitted thousands of users.
The index was improved with tags from Flare and client taxonomies.

Project Outline Challenge

A large organisation that specialised in the management and distribution of journals, conference papers and technical papers required its content to be mapped to a new custom taxonomy. The taxonomy was complex in nature, and the large number of records made the task difficult to undertake manually. In addition, the client requested that an evergreen process be established to classify new documents as they became available.

Flare’s Approach

The client supplied over 87,000 pdf and xml files of conference proceedings and journal articles to Flare for auto-classification using its ’AutoTag’ tool. The pdf files, containing the full text of the documents, and the xml files, containing the abstract titles, authors and selected metadata, were analysed using Flare’s AutoTag tool — designed and refined over five years to automatically classify documents based on their contents. The client documents were crawled and then routed through the AutoTag tool to produce outputs in the form of the client taxonomic terms. The taxonomic terms were also weighted so that only the most relevant terms were taken forward into the client’s external facing website.

Special Note on AutoTag

AutoTag is an automated tagging system designed to examine a set of documents, analyse their contents and tag them with relevant metadata. The tool is speeds up the long-winded process of examining a huge mass of materials (reports, maps, etc.) then tagging this material according to the linked taxonomies. The current version of AutoTag can interrogate many file types including common MS Office files, pdf as well as images files such as jpg, gif and tif.

Text is extracted from each file and an algorithm is employed to search for keywords described in the linked taxonomies. In addition to the text contained within each file, any additional potential sources of metadata are also interrogated, e.g. file paths or metadata held in MS office files. In the case of images, AutoTag uses an OCR module to extract text.

The power of AutoTag is now available in O365 through Flare’s Ava plug-in, or our clients can access AutoTag via the Sirus API