New York Times
December 30, 2002
Taming the Task of Checking for Terrorists' Names
By SARAH MILSTEIN
When presented with a document like a passport or credit card, certain federal agencies and some private-sector companies, like airlines and insurance companies, are required by law to check whether the name on the document is also on watch lists of suspected terrorists and their supporters. It sounds pretty simple. But it can be perilously complicated.
Take, for instance, the name "Abd al-Rahman," which can be a given name or a surname, depending on its culture of origin. When transliterated from the Arabic into Latin characters, the name has three parts, the first two of which are prefixes for Rahman, meaning "slave of" or "servant of."
But when an English speaker hears the name, it tends to sound like "Abdurrahman." A person writing it down based simply on how it sounds could easily spell it as one word and in a way that shares few characters with the transliterated version.
In some databases, "Abd" or "al" might be stored in fields for first or middle names, or the name could be written as "Abdurrahman" or "Abdurahman" all in one field, or "Abdul Rahman" in two fields. There are endless variations, even assuming no typos.
And so, on lists with thousands or millions of entries, names of any ethnic origin can get lost when they are recorded with one spelling and searched for with another.
Altering a foreign name just slightly on a passport could allow even a most-wanted person to evade authorities at airports and other border crossings, according to John C. Hermansen, a co-founder of Language Analysis Systems, a company in Herndon, Va., that makes name-searching software and consults for the federal government.
"There are over 200 ways `Mohammed' is spelled in our alphabet, but there's only one way in Arabic," said Dr. Hermansen. "You have to find all of them, if the guy is coming across the border."
Whatever the alphabet used, variations in name structures can also pose problems. Some cultures have surname prefixes -like "de la" in some Romance languages - while others do not. Databases can capture those variations in multiple fields or they can lump all parts of a name into one field. There is no standard, and watch lists have many formats.
Moreover, a person entering data may not know how to read a name. Is "Van" a surname prefix? In Dutch it is, Dr. Hermansen noted, but in Vietnamese, it would probably be the entire surname.
To help identify potential terrorists, government agencies rely heavily on the Interagency Border Inspection System. Known as IBIS, it is a vast database of information on suspect individuals, businesses, vehicles, aircraft and vessels. IBIS is derived from the combination of dissimilar databases kept by the United States Customs Service, the Immigration and Naturalization Service, the State Department and 21 other federal agencies. A single name - particularly a transcribed, transliterated or mistyped name - can easily disappear in such a system.
Businesses also have stakes in name-searching. Since the Sept. 11 attacks, the Treasury Department's Office of Foreign Asset Control has issued new regulations compelling companies in industries like banking and tourism to check the office's list of potential terrorists, or face hefty fines. And all airlines flying into the United States must now check passenger information against IBIS for all passengers and crew members.
One company, Intelligent Search Technology, in Brewster, N.Y., has developed a name-searching program that has been used by businesses that are subject to the Office of Foreign Asset Control.
Language Analysis Systems, meanwhile, has received a government contract on a "sole source justification" basis, meaning no other company had the expertise to bid on the contract. The C.I.A. and National Security Agency have been using the company's products for about two years, Dr. Hermansen said, and the Customs Service recently completed a pilot project with the company's software for the agency's intranet.
There are a few ways computers can tackle the name-search problem. A simple database search will look for a string of characters, called a key-based search, and try to find a match. But if the string is not exact, the process is futile.
A more sophisticated search, using what is called fuzzy logic, can look for similar strings and return the closest matches. But that can return many inaccurate results. Researchers at Language Analysis Systems have found, for example, that looking for names with similar consonant sounds, a longstanding method of name analysis, could find these matches for the name Criton: Courtmanche, Corradino and Cortinez.
Language Analysis Systems has created NameClassifier, a program that attempts to identify the cultural origins of a name, and NameHunter, which searches for names with the same linguistic parameters, like whether affixes like "y" and "de la" are used. NameHunter uses a pair-wise strategy, a technique that accounts for dissimilarities and random errors by basing its search on a series of comparisons, looking for shared properties.
"Maria De la Cruz Vasco de Gamma' has a lot more different things going on than a short Chinese name," said Dr. Hermansen. "If you can discover what culture are you looking at, you can apply the best computational techniques."
Intelligent Search Technology's system, NameSearch, also ranks results. But rather than identify the culture of a language, NameSearch employs a key-based strategy that uses multiple keys for each name - taking affixes into account, for example, or various phonetic spellings and nicknames. The search is then performed using a series of comparisons that determine how similar a name is to the keys.
Another company, Basis Technology, in Cambridge, Mass., has developed a language analyzer, based on phonics, for translating Latinized names back into their original alphabets so they can be searched against lists in the original languages.
Joel Ross, vice president of military and intelligence services for Basis, noted that "Qaddafi," for example, can be spelled at least 60 ways in the Latin alphabet. To match a name on a watch list, the Basis system takes a Latinized name and compares it to the company's own transcription scheme, so that it will match with the one Arabic versionof the name.
The future of name-searching, according to the companies working on it, is not in watch lists, but in sifting through huge quantities of digital documents, like those that might be found on terrorists' computers or intercepted online.
"Going through massive amounts of text and identifying when there's a name, city, corporation," is the next challenge, said Richard Wagner, the president of Intelligent Search Technology. "That's where this industry is headed."