Keywise: Gain rapid data insight using auto-extracted keywords and key-phrases

In this digital age vast quantities of data are generated and stored on company systems. With even the most robust filing systems in place, dark and unstructured data builds up and knowledge is effectively “lost”.  Traditional data management approaches that prescribe the generation of numerous fields of contextual metadata for input to a database or content services platform remain the gold standard.  These methods are costly and time consuming and are therefore an insurmountable barrier for many.  In an attempt to find a pragmatic solution, Merlin Datawise have developed Keywise, a robust, intelligent, in-house Datawise utility to extract keywords and phrases from any collection of files.  It mines unstructured text data creating knowledge and context which are key ingredients for improved insight which in turn provides competitive advantage, and drive better decisions and value generation. Used appropriately, it is simple and pragmatic with a better cost/value ratio than full indexing.

All businesses face the challenge of dark and unstructured data. Dark data are unanalysed, used once only, or unknown data assets generated and stored during regular business activities that exist without context and metadata. Unstructured data has no pre-defined data model or is not organized in a pre-defined manner. It is typically text-heavy with unexpected irregularities and ambiguities making it difficult to analyse using traditional programs. Unstructured data comprises up to c.80% of all data in enterprises – of which, 90% remains unanalysed and dark. These large sums of data are effectively useless knowledge and unlocking and shedding light into this dark unstructured mass of data is a daunting challenge. 

In order to find the inspiration for a pragmatic solution, Merlin Datawise turned to the internet, the biggest Data Ocean on the planet.  The internet allows users to recover information using search engines – simple, effective, and driven by using simple and sensible words & terms – no complex mapping of metadata to categories etc. – and it works! QED, the internet works on keywording, so why can’t our content systems? They can! Apart from a few mandatory bibliographic metadata (author, title, year), the best metadata we can add to data assets in the absence of anything else are keywords. Therefore, extracting keywords and keyphrases from file documents is THE place to start unlocking the value in unstructured data.  

In a fictitious scenario a new project investigating hydrocarbon potential in limestone reservoirs requires a data-gathering exercise. The organisation holds a large amount of uncatalogued and unstructured data in a shared filesystem – a “bag of files”. Keywise can be run against this “bag of files” and the results reviewed for useful words & terms.  An unimaginatively named file called 895786.pdf buried ten directories deep has been highlighted.  In order to optimise keyword identification Keywise can run multiple different identification methodologies and Table 1 shows the keywords/keyphrases results returned from this (real*) file. Although noise is present within the results, 68-92% of the returned words/phrases reveal critical knowledge about the file content.

As can been seen from the above example Keywise provides rapid retrieval of meaningful information from a ‘bag of files’ where file location or existence is unknown and file naming cannot be relied upon.  Who would have known that a documented resource about fractured, vuggy and brecciated limestone reservoirs was hiding 10 directories deep in a file called 895786.pdf …?  Is it any better than windows search (or similar) …yes! Windows search is blind to the importance of a word/term in a document, Keywise results are there because they are algorithmically important to the document.  In addition, Keywise also identifies and logs the location of .pdf files that contain image only data and as such are potential candidates for OCR and release of even more useful contextual insight. This .pdf image discovery process is, in itself, something that would take many man-days to execute traditionally. 
If you are interested in knowing more, visit our Keywise webpage, contact requests@merlin-datawise.co.uk or call +44 (0)1684 540091. We can even process a couple of documents as a taster of the value you can release.

Table 1:  Keywords returned from ‘895786.pdf’ using three of the numerous extraction methods available in Keywise. Useful terms grouped and highlighted in green, and then sorted by calculated importance. 

*Barros-Galvis, Nelson & Villaseñor, Pedro & Samaniego, Fernando. (2015). Analytical Modeling and Contradictions in Limestone Reservoirs: Breccias, Vugs, and Fractures. Journal of Petroleum Engineering. 2015. 1-28. 10.1155/2015/895786.