Taxonomy development and classification

Posted on Posted in Advise and Guide, Services

Taxonomy Development & Auto-classification

Project Description/Issue

A large organisation that specialises in managing and making available thousands of journals, conference papers and technical papers required its content to be mapped to a new specialised taxonomy. The taxonomy was complex in nature and undertaking this task manual wasn’t an option due to the number of records. In addition, the client also requested that a process be established to classify new documents as they became available.

Flare’s Approach

The client supplied pdf and xml files for over 87,000 documents, conference proceedings and journal articles to Flare to auto-classify using it’s ’AutoCat’ tool. The pdf files contained the full text of the documents and the xml files contained the abstract titles, authors and selected other metadata. Flare’s AutoCat tool (see later notes for a fuller description) has been designed and refined over 5 years to automatically classify documents based on their contents. The client documents were crawled and then routed through the AutoCat tool to produce outputs in the form of the client taxonomic terms. The taxonomic terms were also weighted so that only the most relevant terms were taken forward into the client’s external facing website.

Benefits/Outcomes

The resulting outputs contained both Flare and client taxonomic terms above a weighted threshold value. After being checked for consistency and accuracy the results have been incorporated into the client’s website so that users can search via keywords or phrases. This marks a step change in the way that users can search for relevant information on their website.

Special Note on AutoCat

AutoCat is an automated cataloguing system designed to examine a series of documents, analyse their contents and then catalogue them accordingly. The goal is to speed up the long-winded process of examining a huge mass of materials (reports, maps, etc.) then catalogue this material according to the linked taxonomies. The current version of AutoCat can interrogate many file types including common MS Office files, pdf as well as images files such as jpg, gif and tif. Text is extracted from each file and an algorithm is employed to search for keywords described in the linked taxonomies. In addition to the text contained within each file, any additional potential sources of metadata are also interrogated, e.g. file paths or metadata held in MS office files. In the case of images, AutoCat uses an OCR module to extract text.