With the growing volume of scanned, user entered and online unstructured text freely available (and not so freely available), the ability to extract quality meaning from the content can make the difference between a viable business and a sideshow amusement.
The traditional initial approach to extracting terms from a document has been through the use of Regular Expressions. If the documents are consistently formatted and structured, this can be adequate for extraction of data. However, if the document is unstructured and/or accuracy in the solution of the problem lies at the core of the business then Regular Expressions will prove inadequate.
The classic example of this challenge is the extraction of person names from a document. A good developer will then seek help from the community, with an appeal like the following:
Although those threads are somewhat dated it can be observed from the commentary that there are (were) no clear answers at the time the question was posed.
Enter Natural Language Processing (NLP). NLP is a rapidly expanding field where tools and datasets which were traditionally proprietary and expensive now have open-source alternatives. This means that when confronted with language parsing problems that are core to the business that NLP is now a viable alternative.
NLP uses statistical methods in addition to pattern matching to obtain better matches. Statistical methods, while not guaranteeing exact matches, promote better accuracy in larger datasets. This is because the heuristics employed in statistical matching can identify matches that pattern matching may not have anticipated given a smaller data set.
To facilitate training statistical models for large data sets, Ground Truth corpra are employed. These are collections of documents assembled and annotated by the NLP community to provide a baseline for the development of a statistical models.
An example of such a corpra, commonly used for identifying persons, locations and organizations in English-based documents is the CoNLL dataset which is based on a collection of Wall Street Journal documents. The closer the writing style of the Ground Truth is to the target documents, the more effective the model will be.
In conjunction with the Ground Truth corpra, it is also important that a subset of the target documents be analysed (annotated) by hand for tuning the effectiveness of the model. Multiple iterations can then occur with new terms identified by the each iteration feeding back into the model.
Much more comprehensive (some might say overwhelming) resources can be found on the Association for Computational Linguistics home page.
Both projects have great introductory pages and manuals:
Higher level techniques to extract meaning from documents have traditionally been the realm of expensive commercial software packages and PhDs and taken years to implement with significant consulting expenditure.
Whilst specialized training in Computational Linguistics is still highly valuable, the gap between traditional regular expression parsing and NLP techniques has narrowed sufficiently that projects with zero licensing cost and a timeframe of months rather than years are now viable.