|  | <<  
             ^ 
              >> 
            
              | Date: 1999-12-16 
 
 NSAs Semantic Forests: Schneier analysiert-.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
 
 Bruce Schneier über die technischen Möglichkeiten des
 "Semantic Forests" Patents der NSA. Schlu?satz: "Ich bin
 überrascht, dass die NSA dieses Dokument nicht unter
 Verschluß gehalten hat.
 -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-
 The NSA has been patenting, and publishing, technology that
 is relevant to ECHELON.
 
 ECHELON is a code word for an automated global
 interception system operated by the intelligence agencies of
 the U.S., the UK, Canada, Australia and New Zealand.  (The
 NSA takes the lead.) According to reports, it is capable of
 intercepting and processing many types of transmissions,
 throughout the globe.
 
 Over the past few months, the U.S. House of Representatives
 has been investigating ECHELON. As part of these
 investigations, the House Select Committee on Intelligence
 requested documents from the NSA regarding its operating
 standards for intelligence systems like ECHELON that may
 intercept communications of Americans.  To everyone's
 surprise, NSA officials invoked attorney-client privilege and
 refused to disclose the documents.  EPIC has taken the
 NSA to court.
 
 I've seen estimates that ECHELON intercepts as many as 3
 billion communications everyday, including phone calls, e-
 mail messages, Internet downloads, satellite transmissions,
 and so on.  The system gathers all of these transmissions
 indiscriminately, then sorts and distills the information
 through artificial intelligence programs.  Some sources have
 claimed that ECHELON sifts through 90% of the Internet's
 traffic.
 
 How does it do it? Read U.S. Patent 5,937,422,
 "Automatically generating a topic description for text and
 searching and sorting text by topic using the same,"
 assigned to the NSA.  Read two papers titled "Text Retrieval
 via Semantic Forests," written by NSA employees.
 
 Semantic Forests, patented by the NSA (the patent does not
 use the name), were developed to retrieve information "on the
 output of automatic speech-to-text (speech recognition)
 systems" and topic labeling.  It is described as a functional
 software program.
 
 The researchers tested this program on numerous pools of
 data, and improved the test results from one year to the next.
 All this occurred in the window between when the NSA
 applied for the patent, more than two years ago, and when
 the patent was granted this year.
 
 One of the major technological barriers to implementing
 ECHELON is automatic searching tools for voice
 communications.  Computers need to "think" like humans
 when analyzing the often imperfect computer transcriptions of
 voice conversations.
 
 The patent claims that the NSA has solved this problem.
 First, a computer automatically assigns a label, or topic
 description, to raw data.  This system is far more
 sophisticated than previous systems because it labels data
 based on meaning not on keywords.
 
 Second, the patent includes an optional pre-processing step
 which cleans up text, much of which the agency appears to
 expect will come from human conversations.  This pre-
 processing will remove what the patent calls "stutter
 phrases." These phrases "frequently occurs [sic] in text
 based on speech." The pre-processing step will also remove
 "obvious stop words" such as the article "the."
 
 The invention is designed to sift through foreign language
 documents, either in text, or "where the text may be derived
 from speech and where the text may be in any language," in
 the words of the patent.
 
 The papers go into more detail on the implementation of this
 technology. The NSA team ran the software over several
 pools of documents, some of which were text from spoken
 words (called SDR), and some regular documents. They ran
 the tests over each pool separately.  Some of the text
 documents analyzed appear to include data from "Internet
 discussion groups," though I can't quite determine if these
 were used to train the software program, or illustrate results.
 
 The "30-document average precision" (whatever that is) on
 one test pool rose significantly in one year, from 19% in 1997
 to 27% in 1998.  This shows that they're getting better.
 
 It appears that the tests on the pool of speech- to text-based
 documents came in at between 20% to 23% accuracy (see
 Tables 5 and 6 of the "Semantic Forests TREC7" paper) at
 the 30-document average.  (A "document" in this definition
 can mean a topic query.  In other words, 30 documents can
 actually mean 30 questions to the database).
 
 It's pretty clear to me that this technology can be used to
 support an ECHELON-like system.  I'm surprised the NSA
 hasn't classified this work.
 
 The Semantic Forest papers:
 
 http://trec.nist.gov/pubs/trec6/papers/nsa-rev.ps
 
 http://trec.nist.gov/pubs/trec7/papers/nsa-rev.pdf
 
 Source
 
 http://www.counterpane.com
 -.-  -.-. --.-
 -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-
 - -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
 edited by Harkank
 published on: 1999-12-16
 comments to office@quintessenz.at
 subscribe Newsletter
 - -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
 <<  
                   ^ 
                    >>
 |  |  |  |