So much for the “anonymity” of the web: Researchers have developed new internet security tools that reportedly can determine a person’s gender, level of education, and whether two pieces of writing originated from the same person–all from a typing sample of a few hundred words.
The technology has important implications for helping to catch plagiarists or uncover pedophiles in chat rooms, its creators say–though others aren’t so sure.
In 2003, researchers from the Illinois Institute of Technology and Bar-Ilan University in Israel developed a way to guess a person’s gender from his or her word usage, based on a Bayesian network that uses weighted word frequencies and parts of speech. In short, the researchers found, men don’t write the same as women.
A simplified version of their work was used to create the Gender Genie, an internet site that used an algorithm to determine a writer’s gender.
Expanding on the technology used to create the Gender Genie, Neal Krawetz, a computer security consultant and researcher, has established an internet security tool called the Gender Guesser.
The Gender Guesser is not 100-percent accurate; Krawetz, who owns a safe-computing consulting and research company called Hacker Factor Solutions, says its accuracy is between 60 and 70 percent but adds that he’s working on methods to increase its effectiveness.
The technique for guessing an author’s gender is one of the easier tests, Krawetz said; the software looks at both formal writing (fiction, nonfiction, essays, news reports) and informal writing (blogs, chat-room messages, and so on). For an informal test, the system checks for 22 different words, and for a formal test, 35 words. The system uses a small subset of common words in the English language. Krawetz weights words such as “actually” and “too” as female, and words such as “now” and “something” as male.
Another central idea to the Gender Guesser is that men are more likely to use adverbs and proper names, while women are more likely to use personal pronouns, Krawetz said.
The Gender Guesser also reportedly can indicate if a person is American or European based on vocabulary size.
“Americans hate to hear this, but they have a small vocabulary size–someone in England of the same age and background will have two to three times [their] vocabulary,” Krawetz said. “Because we have a small vocabulary, we use these 22 words over and over. If someone’s European, they’ll use these particular words less often.”
Krawetz said his technology could be used to uncover pedophiles on internet sites and in chat rooms, and he said he’s used the technology in chat rooms to do just that, with surprising results.
“I was in a chat room, and one person claimed to be a young child, and [this person was] chatting up a storm. When I did an analysis of the text, I found out he had a college degree. When I called the guy on it, it turns out he was an undercover police officer, and he was very interested to know how I identified him,” Krawetz said.
The Gender Guesser can help law enforcement in online situations, he said, because many internet predators employ methods to hide their identities–such as going through online “relays” to change their computer IP address, or purposely using poor grammar and misspelled words–that are irrelevant to the software’s functionality.
“Internet predators can change their spelling and some word selections, but it is extremely difficult to change habitual behavior long-term. Their words and vocabulary range will identify themselves,” Krawetz said.
Krawetz said he’s had interest from law-enforcement officials and is working on fine-tuning the technology, adding: “If you can start up a small dialog with [someone], maybe 100 or 200 words, you can put together a picture of their vocabulary range, which can tell you whether they have a high school or post-high school education, their nationality, and [you] can really start building a profile.”
Nancy Willard, executive director of the Center for Safe and Responsible Internet Use at the University of Oregon, said the software might have some use for helping to identify pedophiles–but only in cases where online predators are being deceptive about their age or gender.
Many online predators are up front about their age and gender, Willard said, calling it a “myth” that predators masquerade as innocent teens.
“In a study of actual incidents, researchers at the Crimes Against Children Research Center found that deception was rare,” she said. “All of the teens who met with online predators did so knowing that they were interacting with an adult and intending to engage in sex with that adult. The only place where deception appeared to play a role was the deception that the predator actually cared for them.”
In addition to the Gender Guesser, which operates from a simple algorithm, Krawetz has developed other tools that he says can help enhance online profiling efforts.
A keyboard profiler uses the patterns of how people bang on their keyboard–such as how they enter random letters into an online text field–to determine if they are right-handed or left-handed.
It is “scarily accurate,” Krawetz said, and the tool “can even determine things like ergonomics, if you’re sitting too close to the monitor.” The keyboard profiler reportedly can help in determining nationality, too, because if users are typing on a French keyboard, they will drum on the keys in a different order than they would on a U.S. keyboard, he said.
Another tool, the sentence and word-preference analysis, takes the words a person uses and makes a histogram of the word frequency. “People have words they depend on and won’t use words they don’t know,” Krawetz said. Analyzing two writing samples reportedly can help determine whether they were written by the same person.
Krawetz also came up with a tool that measures punctuation frequency. All of these tools work independently of each other, but he said his long-term plans are to incorporate the tools into one mechanism.
He said his algorithms can be more reliable than human profiling, because profiling “is an inaccurate system–two people in a field can profile or analyze a person and come up with two different conclusions or personalities,” while two people can run a text sample through the Gender Guesser and come back with identical conclusions.
The Gender Guesser
Center for Safe and Responsible Internet Use
Crimes Against Children Research Center