Why does Extractor ignore the META tag in HTML?  

What is the meaning of the numbers associated with each key phrase?  

How can I normalize the score?  

Can I find the frequency of each key phrase in the document?  

Given a sentence such as, "I am not skiing today," why does Extractor select "skiing" as a key phrase instead of "not skiing"?  

I want to use Extractor for automatic document classification. Can you help me? How can I combine key phrases that were extracted from many different documents?

Can Extractor handle language X?  

Can Extractor handle document format X?  

Can Extractor handle character encoding X?  

How can I generate 100 key phrases?  

When I give a document to Extractor and ask for four key phrases and then take the same document and ask for seven key phrases, the four key phrases are not always a subset of the seven key phrases. Why?  

I want Extractor to generate exactly N highlights (key sentences). I know that I can set the number of key phrases, but how do I set the number of highlights?  

In my input document, I repeatedly use the word "X". It is a very important word, and I use it early and frequently. Yet Extractor does not recognize it as a key phrase. Why?  

In our documents, we have phrases with four and more words. What does Extractor do? 

Is there a limit to the number of words in a key phrase?  

I use programming language X. Is there a way to call the Extractor API from within language X?

 


 
     
    
Features

     Evaluate
            online demonstration
            sample application
            software development kit
      
     Platform
            operating system
                    Windows
                    Solaris
                    Linux
                    Mac OS
                    HP/UX
                    ...
            development
                    C / C#
                    Java
                    Perl
                    Python
                    Visual Basic

     API Functions

     Great for...
         
workforce optimization
          web log tagging
          refined search
          knowledge management (KM)
          information retrieval (IR)
          semantic web development
          indexing
          categorization
          cataloguing
          inference engines
          document management
          Portal Services

     Examples:
         
Research
          Internet Communications
          HomeLand Security
          Contextual Web Search
          Document Mangement
          Indexing
          Knowledge Management
          Intellectual Property Filter
          Intelligent Search
          Text Summarization
          Wireless Push Technology


     Supporting Documentation

     FAQ

     Purchase

     About

     Contact

     Home

       

 

 

 

Why does Extractor ignore the META tag in HTML?
The META tag in HTML is used to convey meta-information about the document, for example:

<META HTTP-EQUIV="Expires" CONTENT="Tue, 04 Dec 1993 21:29:02 GMT">
<META HTTP-EQUIV="Keywords" CONTENT="Nanotechnology, Biochemistry">
<META HTTP-EQUIV="Reply-to" CONTENT="dsr@w3.org (Dave Raggett)">

Extractor ignores this meta-information. In particular, it does not use the "Keywords" meta-information. Extractor ignores the META tag for two reasons: (1) If you really care about the META tag, then you can easily write your own subroutine to parse it. (2) The META tag is widely abused. It is mainly used as a device for tricking search engines into giving a page a higher ranking in a hit list when a user enters a query. If you search for the word "meta", you will find many web pages that give web authors tips on how to fool search engines by using the META tag.


 
What is the meaning of the numbers associated with each key phrase?
When you run the sample program "test_api.exe" (or "test_api.bin" for Unix platforms), each key phrase will be output with a number after it. These numbers are the scores returned by the API function ExtrGetScoreByIndex(). The score of a phrase is an estimate of its value as a key phrase. key phrases are ranked in order of descending score. A score can be any positive real number. The scores with long documents as input tend to be higher than the scores with short documents. The method of calculating the score is described in detail in Learning to Extract key phrases from Text. For some applications, it might be desirable to normalize the score.


How can I normalize the score?
For some applications, it might be desirable to normalize the score, so that the scores of key phrases from different documents can be compared. Here are some suggestions for normalization: 

  • Ignore the scores produced by Extractor. Given a large collection of documents (e.g., web pages), score each key phrase by the percentage of documents for which the given key phrase was suggested by Extractor. (Example: "The key phrase 'corporate merger' was generated by Extractor for 45 of the 100 documents. Thus 'corporate merger' has a score of 45%.")
  • Ignore the scores produced by Extractor. Given a large collection of documents (e.g., web pages), score each key phrase by the percentage of documents in which the given key phrase appears somewhere (even if it was not suggested by Extractor). (Example: "The key phrase 'corporate merger' appears somewhere in the body of 45 of the 100 documents. Thus 'corporate merger' has a score of 45%.")
  • Take the score produced by Extractor and normalize it so that it ranges from 0% to 100%, by dividing the score of each key phrase by the score of the first key phrase. (The first key phrase always has the highest score.) (Example: "Extractor suggests three phrases: 'corporate merger' with a score of 50, 'stocks' with a score of 30, and 'bonds' with a score of 10. The normalized scores are 100%, 60%, and 20%, respectively.")
  • Longer documents often seem to have better key phrases than shorter documents. The problem with suggestion (3) is that it ignores the document length. One possibility would be to multiply the normalized score of (3) by (say) the logarithm of the length of the document (measured in number of words or in bytes). Another possibility would be to sort the document collection by length and increase the score of documents according to the percentile in which they appear. (Example: "The key phrase 'corporate merger' appears in document #345. The key phrase has a normalized score of 60%. However, since document #345 is in the top 25 percentile of documents in the collection, according to length, we will boost the score of 'corporate merger' by 20%, for an adjusted score of 80%.")

Can I find the frequency of each key phrase in the document?
Although Extractor calculates the frequency of each key phrase in the input document, the API does not currently enable access to these numbers. If you are using the frequency as an indicator of the importance of the key phrase, then you should consider using the score instead.


Given a sentence such as, "I am not skiing today," why does Extractor select "skiing" as a key phrase instead of "not skiing"?
The intention of Extractor is to capture the main topics that are discussed in the input document. Extractor does not attempt to convey exactly how these topics are discussed. For example, if a document discusses legal issues concerning guns, Extractor might suggest the key phrase "gun law". This key phrase does not indicate whether the document supports strict legal control of guns or it is against any government involvement in gun control. The design of Extractor was based on a study of how authors use key phrases. We have examined several thousand documents with key phrases supplied by their authors. None of the key phrases we have seen so far include the word "not".


I want to use Extractor for automatic document classification. Can you help me?
Automatic document classification
is the use of software to sort documents into various pre-defined categories. A similar task is automatic document clustering, in which there are no pre-defined categories, so the software must create the categories by itself. If you want to learn more about automatic document classification and clustering, there is a hypertext Bibliography on Machine Learning Applied to Text. Extractor can be used to generate features for use in feature vectors for machine learning algorithms. (If you are not familiar with this terminology, it should become clear to you as you read the papers in the bibliography.) If you wish to use Extractor to generate feature vectors, we suggest the following approach:

  • Apply Extractor to all of the documents in your sample collection.
  • Take the union of all of the extracted key phrases as the feature set.
  • For each document and each feature, let the value of the feature be the number of times that the given phrase occurs in the given document (regardless of whether Extractor extracted it from the given document).
  • Apply your favorite machine learning algorithm (e.g., decision tree induction, neural network, genetic algorithm, etc.) to the resulting feature vectors.

How can I combine key phrases that were extracted from many different documents?
For some applications, you may wish to have a list of key phrases that covers a whole collection of documents, where each document has been processed individually by Extractor. If you have no constraints on the size of the list of key phrases, you might simply take the union of all of the phrases as your combined list. To reduce the size of the list slightly, you might drop words that have the same stem (e.g., "automobile" and "automobiles"). If you want to substantially reduce the size of the list, then you can assign a normalized score to each key phrase and select the key phrases with the highest normalized scores.


Can Extractor handle language X?
Extractor currently works with monolingual documents in English, French, Japanese, German, Spanish, or Korean.


Can Extractor handle document format X?
Extractor currently handles plain text, HTML, and email. The HTML filter handles HTML escape sequences for accents and ISO Latin-1 HTML character entities. The email filter handles MIME quoted-printable accents. If you are developing software which must handle other formats, there are several companies that offer conversion modules that can be embedded in your software.


Can Extractor handle character encoding X?
For English, French, German, and Spanish, Extractor currently handles ISO Latin-1, MS-DOS Code Page 437, and Unicode UCS2 double-byte character codes, using native byte ordering. There is a choice of four Japanese character encodings: JIS, Shift-JIS, EUC-JP, and Unicode UCS-2. There is a choice of three Korean character encodings: EUC-KR, Johap, and Unicode UCS-2.


How can I generate 100 key phrases?
Extractor currently allows the user to specify from 3 to 30 key phrases. For some applications, you may wish to have more key phrases. One solution is to break the document into smaller sections and pass each section to Extractor.

Suppose we gave you a book and asked you to give us a list of key phrases that capture the main topics of the book. When your list approached 30 key phrases, we think you would struggling to think of more key phrases. It seems likely that there are less than 30 "main topics" for most books. Perhaps an average book only has 10 or 15 "main topics", but you could cover each topic with 2 or 3 synonymous key phrases, to yield a total of about 30 key phrases.

On the other hand, if we took any single chapter from the same book, and asked you to give us a list of key phrases that capture the main topics of the chapter, we think the list would be approximately the same size as the list you would give us for the whole book. A key phrase that captures the "main topic" of the chapter might only capture a "minor topic" of the whole book. So the union of the key phrases for each chapter would be a superset of the key phrases for the whole book.

This is why Extractor has a maximum of 30 key phrases per "chunk". If you want more key phrases, then you can break the document into smaller "chunks" and take the union of the key phrases for each individual "chunk". We believe that this strategy will produce a superior list to the strategy of treating the document as a single, homogenous whole.

 
When I give a document to Extractor and ask for four key phrases and then take the same document and ask for seven key phrases, the four key phrases are not always a subset of the seven key phrases. Why?
This is explained in detail in Learning to Extract key phrases from Text. If it is important for your application that the four key phrases that you get when you ask for four key phrases should be the same as the first four key phrases that you get when you ask for seven key phrases, then ask for seven key phrases but only take the first four. In general, if you currently want M key phrases but you might eventually want N key phrases (where N > M), then ask Extractor for N key phrases, but only take the first M key phrases. Better yet, store all N key phrases, so you can later lookup the remaining N - M key phrases instead of running Extractor twice.


I want Extractor to generate exactly N highlights (key sentences). I know that I can set the number of key phrases, but how do I set the number of highlights?
Extractor currently allows the user to specify from 3 to 30 key phrases (key concepts). If you have set the highlight type to allow duplicates, then the number of highlights (key sentences) will be the same as the specified number of key phrases. For each key phrase, there will be a matching highlight, showing the key phrase in context. (There might be fewer highlights than key phrases, if Extractor was not able to find a good highlight for a certain key phrase.) However, if you set the highlight type to remove duplicates, then there will usually be fewer highlights than key phrases (because two or more different key phrases may be best illustrated by the same key sentence). On average, when duplicate highlights are removed, if you specify K key phrases, then you will get approximately N = 0.6 × K highlights. If you require exactly N highlights, with duplicate highlights removed, here are some options:

  • Ask for K = 2 × N key phrases. On average, you will get about 0.6 × 2 × N = 1.2 × N highlights. Set the highlight type to remove duplicates and to sort the highlights by order of appearance in the text. Take the first N highlights as your desired key sentences. If there are more highlights available, ignore them. If there are not enough highlights available, try asking for K = 2.5 × N key phrases. If K = 2.5 × N is greater than 30, then break the document into smaller sections and pass each section to Extractor.
  • Alternatively, ask for K = 3 × N key phrases. Set the highlight type to allow duplicates. In general, you will get 3 × N highlights, with duplicates. The i-th highlight shows the i-th key phrase in the context of a sentence. Find the score of the i-th key phrase and use this score as a measure of the quality of the corresponding highlight. When there are duplicate highlights, score the highlight by the maximum of the scores of each copy of the highlight. Output the top N scoring highlights.
  • Proceed as in the previous suggestion, but when there are duplicate highlights, score the highlight by the sum of the scores of each copy of the highlight. Output the top N scoring highlights.
  • Proceed as in the previous suggestion, but when there are duplicate highlights, score the highlight by the number of copies of the highlight. For example, if there are three copies of a certain sentence, then that sentence gets a score of three. In other words, the score of a highlight is the number of key phrases that it contains. Output the top N scoring highlights.


In my input document, I repeatedly use the word "X". It is a very important word, and I use it early and frequently. Yet Extractor does not recognize it as a key phrase. Why? There are several possibilities. First, Extractor ignores words with less than three letters. Second, your word "X" might be in the stop word list. Third, your word "X" might be in the stop phrase list. You cannot remove a word from the stop word or stop phrase lists through the API. You will need access to the source code if you wish to remove words or phrases from the stop word or stop phrase lists. You also cannot modify the minimum required word length (three letters) through the API. However, your issue might be addressed by adding "X" to the list of go phrases. A go phrase will be found, when it appears in the input document, even if it also appears in the stop word or stop phrase lists.


In our documents, we have phrases with four and more words. What does Extractor do? Is there a limit to the number of words in a key phrase?
Extractor is designed to extract key phrases with one, two, or three words. We have collected thousands of documents with key phrases supplied by the authors, and authors only create key phrases with four or more words about 5% of the time. When we try to include phrases with four or more words, we can cover a few more of the authors' key phrases, but we also introduce a few more errors. Since there is a net loss, Extractor does not attempt to cover these longer phrases. There are two things you might try, if you really need to capture these longer phrases:

  • If Extractor outputs a phrase of the form "A B C" and a phrase of the form "B C D", then you can conjecture that these are parts of a longer phrase "A B C D", and join them together. For example, "National Research Council" and "Research Council Canada" would be joined to make "National Research Council Canada".
     
  • If you activate the highlights feature (key sentences) and set the highlight feature to mark key phrases in bold, the bold marking will include phrases of four or more words. You can then extract, from the highlights, the phrases that are marked in bold, by writing your own routine to process the output highlights.

I use programming language X. Is there a way to call the Extractor API from within language X?
The
Extractor API is written in ISO/ANSI C. Whatever programming language you use, it is almost certain that there is a way for your language to call an external C program. If you are programming in C or C++, you will have no problems calling Extractor. If you are programming in Java, Perl, Python, or Visual Basic, we have some experience with calling Extractor from these languages. Please contact us for help.