Natural Language Processing and SharePoint

Natural Language Processing and SharePoint

Wait, what? As in reading what people are saying on my SharePoint site?

Yes, Natural Language Processing (NLP) refers to how computers and humans interact; more specifically, how a computer program can come to understand what we humans are saying, whether we’re writing a sonnet or a scathing review on Netflix.

Many companies turn to NLP techniques in order to get a better understanding of the needs, motivations and problems of their users. Netflix, for example, is well-known for its movie recommendation system, which uses NLP to determine what people are saying about a film or television show. Other sites and applications use NLP to help identify shills, so they can better hide reviews from people who post artificial praise about a product in order to boost the ranking of that product artificially.

Let me whet your appetite, and show you what sorts of things you can do when you apply some of the latest techniques and tools devoted to improving the state of NLP.

Tools

First, the software. There are two primary languages used in most NLP tasks: Java and Python. That’s not to say there’s no C# or Swift or anything like that – there’s plenty – but the overwhelming majority of libraries meant to provide NLP functionality is designed to be consumed from Java or from Python.

Snakes on a … program

I’ll be using Python for today’s post. Why’s that, you say? Four out of five data scientists may agree that statistics are mostly made up on the spot, but many of them use Python (and a really amazing library called Jupyter Notebooks, formerly known as IPython Notebooks) when they’re doing data science.

The fastest, simplest way to get set up for Windows, Mac and Linux users is a distribution of Python called Anaconda. It’s pretty amazing, actually. It bundles together hundreds of the most popular Python packages used for science, analytics, engineering and mathematics, and sets up all the necessary plumbing to run Python with those packages either from the command line or via an interactive web-based editor called a Jupyter Notebook. In fact, this whole article was written as a Jupyter Notebook, so once you have it installed, you can even try some of these examples out yourself.

It’s words … in SPACE

While Anaconda comes with so many amazing Python packages for doing science, there’s one more NLP-specific package that I’ve been using for over a year now called spaCy. It’s super-fast for processing English language text, and extracting lots of relevant information about it. If only you could’ve had this when you were stuck doing sentence diagrams back in your school days:
sentence diagram

Let’s get digital

First, I’m going to get some code up in here. After spaCy is installed and a Jupyter Notebook is running, I fire off the first, all-important line of code:

# Hello, world!

Just kidding.

from IPython.core.display import display, HTML
import spacy
import pandas as pd
pd.set_option('display.max_colwidth', -1)

First, because this post is also available as one of those aforementioned Jupyter Notebook files, I import some modules that let my Python code write HTML to the screen.

The second line? That tells Python that I’m going to be using a module called spacy, which oddly enough, is where all of spaCy‘s magic happens.

The third line? Who doesn’t like pandas? Just kidding. It’s actually a Python package called Pandas, a sort of data-scientist-super-Excel for Python users. But that’s a topic for another post. I don’t like typing pandas all the time, so I gave it the shorthand pd. I also set an option on pandas so that it doesn’t truncate super-long strings in its output.

Speaking of magic, it’s time to get my magic wand ready…

nlp = spacy.English()

This one line can take a little while to run. It’s loading in the brain of the spaCy package, and calling the brain “nlp”. Like, almost literally, a brain. A bunch of memories, anyway. Memories that the developer has trained using huge bodies of text in order to extract meaning, parts of speech, named entities (think: proper nouns, among other things like dates, ordinals, etc.), word association, and root forms, among other things. Luckily, the brain only needs to be loaded in at the beginning of the program.

Now, let’s casually gloss over (for now) how we got this data out of SharePoint (because, honestly, there are so many ways you could’ve done that, and everyone has their favorites, and besides, the sequel to this post will give you a real, live Python package to make it easy), but let’s just say that your AdventureWorks business has really taken off, and your SharePoint-based e-commerce site has been collecting reviews of your products. What might a typical review look like? Glad you asked! Here’s one I slightly altered in order to show off just how cool this whole spaCy thing is.

review = '''The Road-550-W from Adventure Works Cycles is everything it's advertised to be. Finally, a quality bike that is actually built for a woman and provides control and comfort in one neat package. The top tube is shorter, the suspension is weight-tuned and there's a much shorter reach to the brake levers. All this adds up to a great mountain bike that is sure to accommodate any woman's anatomy. In addition to getting the size right, the saddle is incredibly comfortable. Attention to detail is apparent in every aspect from the frame finish to the careful design of each component. Each component is a solid performer without any fluff. The designers clearly did their homework and thought about size, weight, and funtionality throughout. And at less than 19 pounds, the bike is manageable for even the most petite cyclist. We had 5 riders, including my good friend Dr. Joseph A. Bicycle, and his wife, Mrs. Jane Bicycle, take the bike out for a spin and really put it to the test. The results were consistent and very positive. Our testers loved the manuverability and control they had with the redesigned frame on the 550-W. A definite improvement over the 2002 design. Four out of five testers listed quick handling and responsivness were the key elements they noticed. Technical climbing and on the flats, the bike just cruises through the rough. Tight corners and obstacles were handled effortlessly. The fifth tester was more impressed with the smooth ride. The heavy-duty shocks absorbed even the worst bumps and provided a soft ride on all but the nastiest trails and biggest drops. The shifting was rated superb and typical of what we've come to expect from Adventure Works Cycles. On descents, the bike handled flawlessly and tracked very well. The bike is well balanced front-to-rear and frame flex was minimal. In particular, the testers noted that the brake system had a unique combination of power and modulation.  While some brake setups can be overly touchy, these brakes had a good amount of power, but also a good feel that allows you to apply as little or as much braking power as is needed. Second is their short break-in period. We found that they tend to break-in well before the end of the first ride; while others take two to three rides (or more) to come to full power. On the negative side, the pedals were not quite up to our tester's standards. Just for fun, we experimented with routine maintenance tasks. Overall we found most operations to be straight forward and easy to complete. The only exception was replacing the front wheel. The maintenance manual that comes with the bike say to install the front wheel with the axle quick release or bolt, then compress the fork a few times before fastening and tightening the two quick-release mechanisms on the bottom of the dropouts. This is to seat the axle in the dropouts, and if you do not do this, the axle will become seated after you tightened the two bottom quick releases, which will then become loose. It's better to test the tightness carefully or you may notice that the two bottom quick releases have come loose enough to fall completely open. And that's something you don't want to experience while out on the road! The Road-550-W frame is available in a variety of sizes and colors and has the same durable, high-quality aluminum that AWC is known for. At a MSRP of just under $1125.00, it's comparable in price to its closest competitors and we think that after a test drive you'l find the quality and performance above and beyond . You'll have a grin on your face and be itching to get out on the road for more. While designed for serious road racing, the Road-550-W would be an excellent choice for just about any terrain and any level of experience. It's a huge step in the right direction for female cyclists and well worth your consideration and hard-earned money.'''

In Python, the triple-single-quote ”’ is used as the wrapper for a string that takes up multiple lines, or one that may have single or double quotes inside it. That way I didn’t have to “escape” the single quotes used in the contractions from the original review text. Just in case you were curious. Here’s the review printed out in one long blockquote.

display(HTML("<blockquote>{}</blockquote>".format(review)))

The Road-550-W from Adventure Works Cycles is everything it’s advertised to be. Finally, a quality bike that is actually built for a woman and provides control and comfort in one neat package. The top tube is shorter, the suspension is weight-tuned and there’s a much shorter reach to the brake levers. All this adds up to a great mountain bike that is sure to accommodate any woman’s anatomy. In addition to getting the size right, the saddle is incredibly comfortable. Attention to detail is apparent in every aspect from the frame finish to the careful design of each component. Each component is a solid performer without any fluff. The designers clearly did their homework and thought about size, weight, and funtionality throughout. And at less than 19 pounds, the bike is manageable for even the most petite cyclist. We had 5 riders, including my good friend Dr. Joseph A. Bicycle, and his wife, Mrs. Jane Bicycle, take the bike out for a spin and really put it to the test. The results were consistent and very positive. Our testers loved the manuverability and control they had with the redesigned frame on the 550-W. A definite improvement over the 2002 design. Four out of five testers listed quick handling and responsivness were the key elements they noticed. Technical climbing and on the flats, the bike just cruises through the rough. Tight corners and obstacles were handled effortlessly. The fifth tester was more impressed with the smooth ride. The heavy-duty shocks absorbed even the worst bumps and provided a soft ride on all but the nastiest trails and biggest drops. The shifting was rated superb and typical of what we’ve come to expect from Adventure Works Cycles. On descents, the bike handled flawlessly and tracked very well. The bike is well balanced front-to-rear and frame flex was minimal. In particular, the testers noted that the brake system had a unique combination of power and modulation. While some brake setups can be overly touchy, these brakes had a good amount of power, but also a good feel that allows you to apply as little or as much braking power as is needed. Second is their short break-in period. We found that they tend to break-in well before the end of the first ride; while others take two to three rides (or more) to come to full power. On the negative side, the pedals were not quite up to our tester’s standards. Just for fun, we experimented with routine maintenance tasks. Overall we found most operations to be straight forward and easy to complete. The only exception was replacing the front wheel. The maintenance manual that comes with the bike say to install the front wheel with the axle quick release or bolt, then compress the fork a few times before fastening and tightening the two quick-release mechanisms on the bottom of the dropouts. This is to seat the axle in the dropouts, and if you do not do this, the axle will become seated after you tightened the two bottom quick releases, which will then become loose. It’s better to test the tightness carefully or you may notice that the two bottom quick releases have come loose enough to fall completely open. And that’s something you don’t want to experience while out on the road! The Road-550-W frame is available in a variety of sizes and colors and has the same durable, high-quality aluminum that AWC is known for. At a MSRP of just under $1125.00, it’s comparable in price to its closest competitors and we think that after a test drive you’l find the quality and performance above and beyond . You’ll have a grin on your face and be itching to get out on the road for more. While designed for serious road racing, the Road-550-W would be an excellent choice for just about any terrain and any level of experience. It’s a huge step in the right direction for female cyclists and well worth your consideration and hard-earned money.

So that review sure looks great! Seems a little too enthusiastic, but did include a negative as well. Let’s see what sorts of things we can extract from this totally-not-fictional review.

First, in order to look for meaning in all those words, we need to let that nlp brain have a read:

doc = nlp(review)

It’s a quick read.

Natural Language Processing systems really like their language to be broken down into sentences. Now, that can be a long, involved process. You can’t just separate sentences based on punctuation. What would Dr. Bicycle et.al. say about that? Luckily, spaCy knows how to handle such things. Here is the list of sentences that it found. Well, not all of them. Just five of them, starting with the ninth.

sentences = [sentence.orth_ for sentence in doc.sents]
print("There were {} sentences found. Here's a sample:".format(len(sentences)))
pd.DataFrame(sentences[8:13])
There were 38 sentences found. Here's a sample:
0
0 And at less than 19 pounds, the bike is manageable for even the most petite cyclist.
1 We had 5 riders, including my good friend Dr. Joseph A. Bicycle, and his wife, Mrs. Jane Bicycle, take the bike out for a spin and really put it to the test.
2 The results were consistent and very positive.
3 Our testers loved the manuverability and control they had with the redesigned frame on the 550-W.
4 A definite improvement over the 2002 design.

Alright! Let’s see how well spaCy does with showing me noun phrases, which can help when determining the topic of a text – especially when combined with more advanced weighting algorithms like TF-IDF. I’m also showing the “head” of the phrase – i.e. the word connects that noun phrase to another part of the review. I’ll show just a subset.

nounphrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
print("There were {} noun phrases found. Here's a sample:".format(len(nounphrases)))
pd.DataFrame(nounphrases[8:18])
There were 155 noun phrases found. Here's a sample:
0 1
0 one neat package in
1 The top tube is
2 the suspension is
3 a much shorter reach ‘s
4 the brake levers to
5 a great mountain bike to
6 any woman’s anatomy accommodate
7 addition In
8 the size getting
9 the saddle is

How about those so-called “entities”?

entities = list(doc.ents)
print("There were {} entities found".format(len(entities)))
There were 19 entities found

Seems like a lot, doesn’t it? But we can tell that there are several “entities” in there that don’t seem to be the “proper nouns” we’re expecting. I’ll grab just the ones that are organizations (ORG, in the code) or people (PERSON).

orgs_and_people = [entity.orth_ for entity in entities if entity.label_ in ['ORG','PERSON']]
pd.DataFrame(orgs_and_people)
0
0 Adventure Works Cycles
1 Joseph A. Bicycle
2 Jane Bicycle
3 Adventure Works Cycles
4 AWC

That’s better. And we’re only scratching the surface. This sort of linguistic analysis can be used to identify brands being discussed on your SharePoint sites, identify subject matter experts based on the content they submit, and even enforce governance rules by identifying posted material that isn’t meant to be shared publicly.

Even that list is just the tip of the iceberg whose surface this article has only scratched.

Hungry for more?

I’ll be delivering a session at SPTechCon in Austin, TX – on my birthday, even! – called “Data (and text) Mining SharePoint for Fun and Profit”, where I’ll demonstrate several more clever and useful techniques, with real SharePoint data, that showcase the power of data mining and natural language processing. You should totally plan to be there.

I will also have Part 2 of what I hope to be a 5-part series, ready for you to read, shortly after I return from SPTechCon.

Thanks for reading!

P.S. You can download the Jupyter Notebook, if you’re comfortable getting that sort of thing running!


This article was originally written and posted  for SPTechReport on January 27, 2016.