Features     Try Extractor     Purchase    Extractor API    History

Customers Using Extractor   Frequently Asked Questions

Credits    Supporting Publications   Press Release   Contact Us

Frequently Asked Questions ...


Why does Extractor ignore the META tag in HTML?
The META tag in HTML is used to convey meta-information about the document, for example:
<META HTTP-EQUIV="Expires" CONTENT="Tue, 04 Dec 1993 21:29:02 GMT">
<META HTTP-EQUIV="Keywords" CONTENT="Nanotechnology, Biochemistry">
<META HTTP-EQUIV="Reply-to" CONTENT="dsr@w3.org (Dave Raggett)">
Extractor ignores this meta-information. In particular, it does not use the "Keywords" meta-information. Extractor ignores the META tag for two reasons: (1) If you really care about the META tag, then you can easily write your own subroutine to parse it. (2) The META tag is widely abused. It is mainly used as a device for tricking search engines into giving a page a higher ranking in a hit list when a user enters a query. If you search for the word "meta", you will find many web pages that give web authors tips on how to fool search engines by using the META tag.

 

What is the meaning of the numbers associated with each key phrase?
When you run the sample program "test_api.exe" (or "test_api.bin" for Unix platforms), each key phrase will be output with a number after it. These numbers are the scores returned by the API function ExtrGetScoreByIndex(). The score of a phrase is an estimate of its value as a key phrase. key phrases are ranked in order of descending score. A score can be any positive real number. The scores with long documents as input tend to be higher than the scores with short documents. The method of calculating the score is described in detail in Learning to Extract key phrases from Text. For some applications, it might be desirable to normalize the score.

 

How can I normalize the score?
For some applications, it might be desirable to normalize the score, so that the scores of key phrases from different documents can be compared. Here are some suggestions for normalization:

 

Can I find the frequency of each key phrase in the document?
Although Extractor calculates the frequency of each key phrase in the input document, the API does not currently enable access to these numbers. If you are using the frequency as an indicator of the importance of the key phrase, then you should consider using the score instead.

 

Given a sentence such as, "I am not skiing today," why does Extractor select "skiing" as a key phrase instead of "not skiing"?
The intention of Extractor is to capture the main topics that are discussed in the input document. Extractor does not attempt to convey exactly how these topics are discussed. For example, if a document discusses legal issues concerning guns, Extractor might suggest the key phrase "gun law". This key phrase does not indicate whether the document supports strict legal control of guns or it is against any government involvement in gun control. The design of Extractor was based on a study of how authors use key phrases. We have examined several thousand documents with key phrases supplied by their authors. None of the key phrases we have seen so far include the word "not".

 

I want to use Extractor for automatic document classification. Can you help me?
Automatic document classification
is the use of software to sort documents into various pre-defined categories. A similar task is automatic document clustering, in which there are no pre-defined categories, so the software must create the categories by itself. If you want to learn more about automatic document classification and clustering, there is a hypertext Bibliography on Machine Learning Applied to Text. Extractor can be used to generate features for use in feature vectors for machine learning algorithms. (If you are not familiar with this terminology, it should become clear to you as you read the papers in the bibliography.) If you wish to use Extractor to generate feature vectors, we suggest the following approach:

 

How can I combine key phrases that were extracted from many different documents?
For some applications, you may wish to have a list of key phrases that covers a whole collection of documents, where each document has been processed individually by Extractor. If you have no constraints on the size of the list of key phrases, you might simply take the union of all of the phrases as your combined list. To reduce the size of the list slightly, you might drop words that have the same stem (e.g., "automobile" and "automobiles"). If you want to substantially reduce the size of the list, then you can assign a normalized score to each key phrase and select the key phrases with the highest normalized scores.

 

Can Extractor handle language X?
Extractor currently works with monolingual documents in English, French, Japanese, German, Spanish, or Korean.

 

Can Extractor handle document format X?
Extractor currently handles plain text, HTML, and email. The HTML filter handles HTML escape sequences for accents and ISO Latin-1 HTML character entities. The email filter handles MIME quoted-printable accents. If you are developing software which must handle other formats, there are several companies that offer conversion modules that can be embedded in your software.

 

Can Extractor handle character encoding X?
For English, French, German, and Spanish, Extractor currently handles ISO Latin-1, MS-DOS Code Page 437, and Unicode UCS2 double-byte character codes, using native byte ordering. There is a choice of four Japanese character encodings: JIS, Shift-JIS, EUC-JP, and Unicode UCS-2. There is a choice of three Korean character encodings: EUC-KR, Johap, and Unicode UCS-2.

 

How can I generate 100 key phrases?
Extractor currently allows the user to specify from 3 to 30 key phrases. For some applications, you may wish to have more key phrases. One solution is to break the document into smaller sections and pass each section to Extractor.

Suppose we gave you a book and asked you to give us a list of key phrases that capture the main topics of the book. When your list approached 30 key phrases, we think you would struggling to think of more key phrases. It seems likely that there are less than 30 "main topics" for most books. Perhaps an average book only has 10 or 15 "main topics", but you could cover each topic with 2 or 3 synonymous key phrases, to yield a total of about 30 key phrases.

On the other hand, if we took any single chapter from the same book, and asked you to give us a list of key phrases that capture the main topics of the chapter, we think the list would be approximately the same size as the list you would give us for the whole book. A key phrase that captures the "main topic" of the chapter might only capture a "minor topic" of the whole book. So the union of the key phrases for each chapter would be a superset of the key phrases for the whole book.

This is why Extractor has a maximum of 30 key phrases per "chunk". If you want more key phrases, then you can break the document into smaller "chunks" and take the union of the key phrases for each individual "chunk". We believe that this strategy will produce a superior list to the strategy of treating the document as a single, homogenous whole.

 

When I give a document to Extractor and ask for four key phrases and then take the same document and ask for seven key phrases, the four key phrases are not always a subset of the seven key phrases. Why?
This is explained in detail in Learning to Extract key phrases from Text. If it is important for your application that the four key phrases that you get when you ask for four key phrases should be the same as the first four key phrases that you get when you ask for seven key phrases, then ask for seven key phrases but only take the first four. In general, if you currently want M key phrases but you might eventually want N key phrases (where N > M), then ask Extractor for N key phrases, but only take the first M key phrases. Better yet, store all N key phrases, so you can later lookup the remaining N - M key phrases instead of running Extractor twice.

 

I want Extractor to generate exactly N highlights (key sentences). I know that I can set the number of key phrases, but how do I set the number of highlights?
Extractor currently allows the user to specify from 3 to 30 key phrases (key concepts). If you have set the highlight type to allow duplicates, then the number of highlights (key sentences) will be the same as the specified number of key phrases. For each key phrase, there will be a matching highlight, showing the key phrase in context. (There might be fewer highlights than key phrases, if Extractor was not able to find a good highlight for a certain key phrase.) However, if you set the highlight type to remove duplicates, then there will usually be fewer highlights than key phrases (because two or more different key phrases may be best illustrated by the same key sentence). On average, when duplicate highlights are removed, if you specify K key phrases, then you will get approximately N = 0.6 × K highlights. If you require exactly N highlights, with duplicate highlights removed, here are some options:

 

In my input document, I repeatedly use the word "X". It is a very important word, and I use it early and frequently. Yet Extractor does not recognize it as a key phrase. Why?

There are several possibilities. First, Extractor ignores words with less than three letters. Second, your word "X" might be in the stop word list. Third, your word "X" might be in the stop phrase list. You cannot remove a word from the stop word or stop phrase lists through the API. You will need access to the source code if you wish to remove words or phrases from the stop word or stop phrase lists. You also cannot modify the minimum required word length (three letters) through the API. However, your issue might be addressed by adding "X" to the list of go phrases. A go phrase will be found, when it appears in the input document, even if it also appears in the stop word or stop phrase lists.

 

In our documents, we have phrases with four and more words. What does Extractor do? Is there a limit to the number of words in a key phrase?
Extractor is designed to extract key phrases with one, two, or three words. We have collected thousands of documents with key phrases supplied by the authors, and authors only create key phrases with four or more words about 5% of the time. When we try to include phrases with four or more words, we can cover a few more of the authors' key phrases, but we also introduce a few more errors. Since there is a net loss, Extractor does not attempt to cover these longer phrases. There are two things you might try, if you really need to capture these longer phrases:

 

I use programming language X. Is there a way to call the Extractor API from within language X?
The
Extractor API is written in ISO/ANSI C. Whatever programming language you use, it is almost certain that there is a way for your language to call an external C program. If you are programming in C or C++, you will have no problems calling Extractor. If you are programming in Java, Perl, Python, or Visual Basic, we have some experience with calling Extractor from these languages. Please contact us for help.