(Level: Beginner)

“The beginning of wisdom is to call things by their proper name.” – Confucius

Hello there, we apologize for the delay in publishing this article. The last two weeks have been pretty hectic.

Now that you are equipped with the basics of text processing, it is high time that we move to some NLP specific concepts. This week’s article is about Named Entities, as the title suggests. You will understand what they are, why they are important, and how to identify them.

What are Named Entities?

Andy and Zoe are friends. Andy is working on an NLP project that identifies Named Entities(NE) in a given text, and Zoe wants to help him out. So Andy gives her a text file that contains information about Apple Inc., and asks her to manually identify the NEs in that text. But Zoe doesn’t know what a NE is, so Andy goes on to explain the same. Here is the text:

Apple is a technology company headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software, and online services. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976 to develop and sell personal computers. iPhone and iPad are popular products from Apple.

Can you identify NEs in the above text? If not, no problemo. Let us start off by defining what an entity is. An entity is a thing that has some meaning in the real-world. A car, a person, a country are all entities.

A NE is simply an entity that has a name. Examples include persons, organizations, locations, products, etc. So Andy, Antarctica, Microsoft, iPhone are all NEs. Simple enough?

So in the above text, we have the following NEs:

Apple - Organization
Cupertino - Location
California - Location
Steve Jobs - Person
Steve Wozniak - Person
Ronald Wayne - Person
iPhone - Product
iPad - Product

So, that’s that. The identification of these NEs from text is called Named Entity Recognition(NER). Additional entities like date, time, and money are also considered as NEs, by the NLP community.

Why should I perform NER?

So let us sit back for a moment and think about one of the most common things that we do everyday; answer questions. In fact, that’s exactly what we are doing right now. For the most part, answers to questions are based on facts. Facts are truthful pieces of information about entities. And facts, more often than not, contain NEs.

Consider the following question-answer pairs:

Q: What is the captial of Germany?
A: Berlin

Q: Who wrote "To Kill a Mockingbird"?
A: Harper Lee

Q: When was Albert Einstein born?
A: 14 March 1879

Q: Where does Andy live?
A: London

As you may have noticed, much of the answers to questions that are based on facts(and are not subjective) contain NEs. The above is called factoid question-answering. Now, assuming that you were to create a question-answering bot, it would be impossible to manually store the answer to every possible question in the universe!

Given that there is so much information on the Web, the most efficient way would be to crawl this information automatically and store facts in the database. Let us say that your machine is reading some text on the Web, the NEs in that text will give it information regarding what the article is about. Important relationships can then be identified and saved, which can then be used to answer questions. Like we saw in the previous example about Apple, NEs in the text often provide the most useful information.

How do I identify NEs?

This, however, is by no means trivial. An interesting observation is that most NEs begin with a capital letter. Could we use that as a hint? Sure. But that alone won’t suffice, because people may not always follow this convention, especially on social media.

One of the standard approaches to identifying NEs is through the use of large dictionaries, which are nothing but text files that contain NEs. This is simple to implement, however is by no means exhaustive or complete. Take, for instance, names of people. There are so many possible names, many unique to their countries of origin. It would be nearly impossible to capture all possible names successfully.

However, for some of the most common NEs, like country and city names, an exhaustive dictionary can be created pretty easily. In fact, many such dictionaries already exist on the Web. Because of their simplicity, we will use this technique in today’s tutorial. In due course of time, once we learn more advanced concepts like Parts-of-Speech tagging, Dependency Parsing and some other Machine Learning techniques, we will go on to build a more sophisticated and accurate NER system.

In today’s coding example, we will look at identifying a specific type of NEs: person names. Given some input text, we need to identify persons names from that. To solve this problem, we can begin by tokenizing the input text, comparing each word in the tokenized text with the NE dictionary. If there is a match, it is a NE.

First, go to our GitHub repo here to download male.txt and female.txt, which are two text files that contain about 3000 male names and 5000 female names respectively. We will load those files in our program into two lists. You will be able to enter some text, and it will identify person names from that.

#Identify person names in a given piece of text, using dictionaries.
import re
regex = "[a-zA-Z]+"													#Defining regex for tokenizer.

with open("female.txt") as f:
    female_names = f.readlines()									#Read names line by line and store them in a list.
female_names = [w.strip() for w in female_names]					#Remove \n or newline character from the end of each name.

with open("male.txt") as f:
    male_names = f.readlines()
male_names = [w.strip() for w in male_names]

person_names = female_names + male_names							#Concatenate both the lists to form one big list. Python cool 🙂
text = input("Enter some text: ")
tokenized_words = re.findall(regex,text)
ne = [word for word in tokenized_words if word in person_names]		#You have seen the use of this last week. Check if given word is NE.
print("NEs in the given text are: ")

Here is a sample execution:

Enter some text: Gwen, Mark, Sara and Tim are friends. Also, Andy are Jessica are siblings.
NEs in the given text are: 
['Gwen', 'Mark', 'Sara', 'Tim', 'Andy', 'Jessica']

As you can observe, it does a pretty good job of identifying person names. Of course, feel free to tinker with the code to identify names that may not start with an uppercase character, and maybe even add dictionaries of other NEs like countries, cities, organizations. Remember, our technique here can only identify a NE if it is present in our person names dictionary, and is thus by no means exhaustive. Happy NE hunting!

Now that you are familiar with NEs, feel free to check out the state-of-the-art NER system, Stanford NER. You can find it here. Enter some text in the box that says “Text to annotate”, choose “named entities” in “Annotations” and click on the Submit button. It will identify NEs for you. Some of the categories of NEs that identifies include: Person, Organization, Location, Date, Time.