Building Document Library Keywords with Natural Language Processing

No Need to read all the documents

Posted by Aaron Christmas on May 30, 2018

Build your Enterpise Term store with Natural Language Processing

Sharepoint is a great tool for storing documents because it’s so easy to share and store documents. So easy in fact that it’s common to build up huge document library with years’ worth of information. Generally, this knowledge is promptly “forgotten” about after initial usage.

Well this sad situation was exactly the case with a site collection used by a department within my company. To make a long story short I had to start utilizing this “knowledge” more and more. I needed to be able to quickly find the applicable info in the mass of proposals and solicitations the company had created for over a decade. As anyone who has read some of my past articles knows, natural language processing has become a new tool for me to play with. So of course, I thought, let’s build some metadata in 3 steps…

Step 1: Parse the document

I was actually writing for an up-to-date Sharepoint Online instance (given my normal client this is not the case) and I was wondering if there would be more integration with document content. Since there was none that I could find, I ended up using this set of libs to parse the document. Here is one of the main functions of interest using a set of jszip.js libraries:

 JSZipUtils.getBinaryContent(this.url, function (error, content) {
                        if (error) {
                            docThis.error = error;
                            return;
                        }

                        docThis.zipContent = new JSZip(content);
                        var documentxml = docThis.zipContent.file("word/document.xml");
                        var strDocumentxml = documentxml.asText();
                        docThis.file.contentAsString = strDocumentxml
                        docThis.file.contentAsXml = $(strDocumentxml);

                        docThis.docxtemplater = new Docxtemplater();
                        docThis.docxtemplater.loadZip(docThis.zipContent);

                        loaded();
                    });
                    

Step 2: Process the text

For me this ended up being a restful interface to a language processing engine that would read documents and respond with semantical significant terms and keywords for the document. This was accomplished with some open natural language processing tools, modeling, and a scoring algorithm. The NLP engine utilized some information extraction techniques particularly Named Entity Recognition and parts or speech parsing as well as a lexical database to figure out what the text ‘means’. I am still experimenting with what makes for the best way of dissecting meaning full keywords from the text, or course I was erroring towards speed of response.

Step 3: Add to Term store and List:

Utilizing the Office 365 js APP, I added the terms and phrases from step 2. The part that took some time for me to work out was the incantation necessary to add phrases to edit a multi-value enterprise metadata list column.

 var value = item.get_item(fieldName);
                           var terms = new Array();

                           if (txField.get_allowMultipleValues()) {

                               var enumerator = value.getEnumerator();
                               while (enumerator.moveNext()) {
                                   var tv = enumerator.get_current();
                                   terms.push(tv.get_wssId() + ";#" + tv.get_label() + "|" + tv.get_termGuid());
                               }

                               terms.push("-1;#" + term + "|" + termId);
                               termValueString = terms.join(";#");
                               termValues = new SP.Taxonomy.TaxonomyFieldValueCollection(context, termValueString, txField);
                               txField.setFieldValueByValueCollection(item, termValues);