The Extractor API allows several documents to be processed simultaneously, using separate threads for each document. This is useful, for example, when processing web pages. A major bottle-neck when downloading web pages is waiting for web servers to respond to requests for pages. One way around this bottle-neck is to download several pages simultaneously, using a separate thread to process each page.

Extractor is fully reentrant, to allow multithreading without the use of Win32 services such as semaphores and the EnterCriticalSection and LeaveCriticalSection functions. There should be a one-to-one relationship between threads and
DocumentMemory values, so only one thread reads or writes to a given DocumentMemory. On the other hand, there may be a many-to-one relationship between threads and StopMemory values. That is, many threads may simultaneously read one StopMemory.

Most functions that take
StopMemory as an argument only read StopMemory; they do not write. This is why many threads can safely access the same StopMemory. However, the functions ExtrAddStopWord and ExtrAddStopPhrase write StopMemory. These two functions should be called (one after the other; not at the same time) before any other threads access StopMemory. If one thread calls ExtrAddStopWord or ExtrAddStopPhrase with a given value of StopMemory while a second thread calls any function with the same value of StopMemory, the memory may become corrupted.


ExtrCreateDocumentMemory

This function creates a block of memory for storing data about a single document. It returns a pointer value that is a unique identifier for this block of memory. This pointer is later passed to any other functions that process the given document.

A document is processed as a sequence of memory blocks, by calling ExtrReadDocumentBuffer. A typical document will involve multiple calls to ExtrReadDocumentBuffer. Each call updates the state of the memory that is reserved for processing the given document, DocumentMemory.

In a typical application with multiple threads, there will be a one-to-one relationship between threads and DocumentMemory values, and also between DocumentMemory values and individual documents. On the other hand, threads may share StopMemory values, depending on whether it makes sense to use the same stop words and stop phrases for all of the documents that are currently being processed.


ExtrCreateStopMemory

This function creates a block of memory for storing stop words and stop phrases. It returns a pointer value in StopMemory that is a unique identifier for this block of memory. This pointer is later passed to any other functions that use the stop words or stop phrases.

The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.

A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected.

Calling ExtrCreateStopMemory will initialize the stop word list with some standard stop words (including "the", for example). The standard list may be extended by calling ExtrAddStopWord or ExtrAddStopPhrase.


ExtrActivateHighlights

A highlight is a key sentence. This function activates the highlight extraction feature for DocumentMemory. By default, it is assumed that the user does not want highlight extraction. ExtrActivateHighlights should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateHighlights is that the functions ExtrGetHighlightListSize and ExtrGetHighlightByIndex will return some highlights selected by Extractor.

Extractor attempts to find one key sentence for each keyphrase that it finds. For a given keyphrase, it is possible that Extractor may not be able to find a good example of a sentence that contains the keyphrase. The function ExtrGetHighlightListSize will return the number of highlights that were generated. This number is always less than or equal to the number of keyphrases that were generated, as given by ExtrGetPhraseListSize.


ExtrActivateHTMLFilter

This function signals that the document DocumentMemory contains HTML tags. By default, it is assumed that the document does not contain HTML tags. ExtrActivateHTMLFilter should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateHTMLFilter is that HTML tags will be parsed. Most tags are ignored, but some tags are used to identify sentence boundaries.

The HTML filter will also convert special symbol codes to the symbols that they represent. For example, "é" will be converted to "é".


ExtrActivateEmailFilter

This function signals that the document DocumentMemory contains an e-mail header. By default, it is assumed that the document does not contain an e-mail header. ExtrActivateEmailFilter should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateEmailFilter is that the e-mail header will be ignored, except for the "Subject" field.

Many e-mail gateways cannot handle 8 bit character codes. Often 8 bit character codes will be converted to 7 bit codes, for safe mailing. The e-mail filter will convert MIME quoted-printable 7 bit character codes back to 8 bit codes.

The e-mail filter understands MIME types. E-mail attachments will be treated according to their MIME types. Keyphrases will be extracted from plain text and HTML attachments. Other types of attachments will be ignored. The HTML filter will be automatically activated if the MIME type indicates that the attachment is HTML. Therefore ExtrActivateHTMLFilter should not be called by the user when processing e-mail.

Note: Activating the e-mail filter with Japanese or Korean text will have no effect.


ExtrDeactivateTextFilter

This function deactivates the plain text filter for DocumentMemory. By default, when the following conditions are met, the input document is assumed to be plain text:

  • the HTML filter has not been activated
  • the email filter has not been activated
  • the language has not been set to Japanese
  • the language has not been set to Korean
  • When these conditions are met, the plain text filter is activated. The plain text filter will attempt to remove non-textual items from the input document, such as tables and addresses. It will also attempt to use white space to determine the boundaries between titles, section headings, and regular paragraphs. If you do not want the plain text filter to process the input document in these ways, then call ExtrDeactivateTextFilter. Since calling ExtrDeactivateTextFilter will affect how the document is read, it should be called before any calls to ExtrReadDocumentBuffer.

    If the input document contains tabs, the text filter may interpret the lines with tabs as table rows. These lines may be skipped. If you suspect that the text filter is skipping lines that should be processed, then try calling ExtrDeactivateTextFilter.

    Internally, Extractor uses the characters 1D (hex) to mark a phrase boundary and 1E (hex) to mark a sentence boundary. The text filter automatically inserts these characters in a plain text document, by analyzing the white space in the document (i.e., line feeds, blanks, tabs, and carriage returns). For example, if two lines are separated by several line feeds (significant vertical white space), then the text filter will remove the white space and insert a sentence boundary marker. This automatic process works well for most plain text documents, but you may wish to write your own filter for a certain type of input document (e.g., a certain type of word processor file). You can run the document through your own filter program, and then send the resulting plain text to Extractor. In this case, you should call ExtrDeactivateTextFilter, but do not call ExtrActivateHTMLFilter or ExtrActivateEmailFilter. Your filter program can help Extractor by inserting markers for phrase boundaries (1D) and sentence boundaries (1E) in the appropriate places.


    ExtrSetInputCode

    A call to ExtrSetInputCode sets the document character code that Extractor uses to process the input text buffer. The character code is given by CharCodeID.

    CharCodeID
    Character Code Compatible languages Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    • JIS X 0208-1990 (note: JIS X 0212-1990 NOT SUPPORTED)
    • ASCII
    • Halfwidth katakana

    The supported Korean character sets for all the Korean encodings are:

    • KS X 1001, KS X 1002, KS X 1005-1, Code page 949 (for Windows 95, NT)
    • KS X 2901 (for UNIX), Johap
    • ASCII

    ISO-8859-1 and MS-DOS Code Page 437 agree on the coding of non-accented alphabetical characters. If there are no accents in the input text, and the text is in single-byte characters, then the choice between the two should not matter.

    Unicode UCS2 uses double-byte characters. UCS2 is sensitive to the byte ordering of the hardware platform (big endian versus little endian). Extractor handles UCS2 characters using the byte ordering of the hardware for which it is compiled (native byte ordering).


    ExtrSetOutputCode

    A call to ExtrSetOutputCode sets the document character code that Extractor uses for the output list of keyphrases. The character code is given by CharCodeID.

    CharCodeID
    Character Code Compatible languages Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    • JIS X 0208-1990 (note: JIS X 0212-1990 NOT SUPPORTED)
    • ASCII
    • Halfwidth katakana

    The supported Korean character sets for all the Korean encodings are:

    • KS X 1001, KS X 1002, KS X 1005-1, Code page 949 (for Windows 95, NT)
    • KS X 2901 (for UNIX), Johap
    • ASCII

    ISO-8859-1 and MS-DOS Code Page 437 agree on the coding of non-accented alphabetical characters. If there are no accents in the input text, and the text is in single-byte characters, then the choice between the two should not matter.

    Unicode UCS2 uses double-byte characters. UCS2 is sensitive to the byte ordering of the hardware platform (big endian versus little endian). Extractor handles UCS2 characters using the byte ordering of the hardware for which it is compiled (native byte ordering).


    ExtrSetDocumentLanguage

    A call to ExtrSetDocumentLanguage sets the language that Extractor uses to process the input text buffer. The language is given by LanguageID.

    LanguageID
    Language Description
    0
    Automatic Let Extractor automatically detect the language (for English, French, German, Spanish).
    1
    English Force Extractor to interpret the document as English.
    2
    French Force Extractor to interpret the document as French.
    3
    Japanese Force Extractor to interpret the document as Japanese.
    4
    German Force Extractor to interpret the document as German.
    5
    Spanish Force Extractor to interpret the document as Spanish.
    6
    Korean Force Extractor to interpret the document as Korean.

    ExtrSetNumberPhrases

    This function sets the desired number of output phrases. The default number is seven. This is the number that will be generated on average; the actual number of phrases that are output for a given document may be slightly less or slightly more than the number specified by DesiredNumber. Note that DesiredNumber is only set for the given document DocumentMemory. This is so that several documents may be processed simultaneously, each with a different desired number of keyphrases.

    The DesiredNumber must be between 3 and 30. Values outside of this range will be converted to the closest value inside the range. No error message will be generated when values are out of range.

    This function is optional. There is no need to call it unless you wish to override the default value of seven phrases.


    ExtrSetHighlightType

    A highlight is a key sentence. If ExtrActivateHighlights has been called, then Extractor attempts to find one key sentence for each keyphrase that it finds. The ExtrSetHighlightType function sets the type (i.e., style) of highlight that is generated.


    ExtrAddStopWord

    This function adds the string Word to the list of stop words stored in the memory at StopMemory. The stop words are stored in a hash table. It does no harm to try to store the same word twice. It is assumed that Word is in lower case and that Word is a single word (containing no white space).

    Stop words are stored separately for each language. The language is given by LanguageID. ExtrAddStopWord will return a non-zero error code if LanguageID is invalid or if Word contains anything other than lower case characters.

    LanguageID
    Language Description
    1
    English Add the given stop word to the English stop words.
    2
    French Add the given stop word to the French stop words.
    4
    German Add the given stop word to the German stop words.
    5
    Spanish Add the given stop word to the Spanish stop words.
    6
    Korean Add the given stop word to the Korean stop words.

    The character code is given by CharCodeID. Word is of type void * so that either single-byte or double-byte character strings can be passed to this function.

    CharCodeID
    Character Code Language Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    ExtrAddStopWord should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read.

    When the stop word list is first created, by ExtrCreateStopMemory, it is initialized with a list of common stop words. It may not be necessary to add any extra stop words. That is, it may not be necessary to call ExtrAddStopWord.

    A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected.

    Note: At this time, you cannot add new stop words for Japanese text. However, you can add new Japanese stop phrases.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrAddStopPhrase

    This function adds the string Phrase to the list of stop phrases stored in the memory at StopMemory. The stop phrases are stored in a hash table. It does no harm to try to store the same phrase twice. It is assumed that Phrase is in lower case. Phrase may be one, two, or three words, separated by a single space.

    Stop phrases are stored separately for each language. The language is given by LanguageID. ExtrAddStopPhrase will return a non-zero error code if LanguageID is invalid or if Phrase contains anything other than lower case characters and spaces.

    LanguageID
    Language Description
    1
    English Add the given stop phrase to the English stop phrases.
    2
    French Add the given stop phrase to the French stop phrases.
    3
    Japanese Add the given stop phrase to the Japanese stop phrases.
    4
    German Add the given stop phrase to the German stop phrases.
    5
    Spanish Add the given stop phrase to the Spanish stop phrases.
    6
    Korean Add the given stop phrase to the Korean stop phrases.

    The character code is given by CharCodeID. Phrase is of type void * so that either single-byte or double-byte character strings can be passed to this function.

    CharCodeID
    Character Code Language Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    • JIS X 0208-1990 (note: JIS X 0212-1990 NOT SUPPORTED)
    • ASCII
    • Halfwidth katakana

    The supported Korean character sets for all the Korean encodings are:

    • KS X 1001, KS X 1002, KS X 1005-1, Code page 949 (for Windows 95, NT)
    • KS X 2901 (for UNIX), Johap
    • ASCII

    When the stop phrase list is first created, by ExtrCreateStopMemory, it is initialized with a list of common stop phrases. It may not be necessary to add any extra stop phrases. That is, it may not be necessary to call ExtrAddStopPhrase.

    A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrAddGoPhrase

    If the input document was found by issuing a query to a search engine, the user may have a special interest in whether the query terms appear in the document, and the context in which the query terms appear. This can be achieved by calling the function ExtrAddGoPhrase with each of the terms in the query.

    This function adds the string Phrase to the list of go phrases stored in the memory at StopMemory. A go phrase is a phrase that will be treated as if it were a key phrase, if it appears in the input document. Go phrases are stored in a list and each sentence in the input document is scanned for each go phrase in the list. This has two important implications: (1) A large list of go phrases may slow the execution of Extractor. (2) A go phrase in the input document will not be detected if it spans a sentence boundary.

    A go phrase may consist of one or more words or fragments of words. Any character sequence is permitted, except for an empty string. The letters may be in upper or lower case. A go phrase may range from a single character to a full sentence. A go phrase may contain punctuation.

    Go phrases are stored separately for each language. The language is given by LanguageID. ExtrAddGoPhrase will return a non-zero error code if LanguageID is invalid or if CharCodeID is not compatible with LanguageID.

    The following types of matches are supported:

    When go phrases are found in the input document, they will be inserted at the top of the keyphrase list. They will take priority over the regular keyphrases. The length of the keyphrase list will be kept at the value set by ExtrSetNumberPhrases. For each go phrase that is added to the top of the keyphrase list, a regular keyphrase will be deleted from the bottom of the keyphrase list. (Note that Extractor ranks the keyphrases in order of decreasing estimated importance.) A go phrase can be distinguished from a regular keyphrase (a keyphrase generated automatically by Extractor) by its score. All go phrases are given a score of zero, but a regular keyphrase never has a score of zero.

    When a go phrase is found, it is inserted into the keyphrase list in exactly the same form as it was given to ExtrAddGoPhrase. This may be different from the form it has in the input document, depending on MatchType.

    If highlights have been activated (by ExtrActivateHighlights), then each go phrase that is found in the input document will have a corresponding highlight. Extractor attempts to find a good sentence to illustrate each go phrase. If bold markup is set (by ExtrSetHighlightType, then the go phrases will be marked in bold within the corresponding highlights. Neighbouring words and characters may also be marked in bold, if they appear to be closely connected to the go phrase.

    A go phrase might appear in the document, and yet not be found by Extractor. If the go phrase spans a sentence boundary, it will not be detected. For example, "home cooking" will not be found in the text "Pasta is popular in our home. Cooking pasta is easy." Also, if the input document is very long, Extractor may not read the full document, since it should be possible to make a good summary without reading the full text. Therefore, if the go phrase only appears at the end of a very long document, it might not be detected by Extractor. Finally, the number of go phrases that will be found is limited by the desired number of keyphrases, set by ExtrSetNumberPhrases. If the number of go phrases in the input document is greater than the desired number of keyphrases, then the go phrases that appear earlier in the text will be given priority.

    The following languages are supported:

    LanguageID
    Language Description
    1
    English Add the given go phrase to the English go phrases.
    2
    French Add the given go phrase to the French go phrases.
    3
    Japanese Add the given go phrase to the Japanese go phrases.
    4
    German Add the given go phrase to the German go phrases.
    5
    Spanish Add the given go phrase to the Spanish go phrases.
    6
    Korean Add the given go phrase to the Korean go phrases.

    The character code is given by CharCodeID. Phrase is of type void * so that either single-byte or double-byte character strings can be passed to this function.

    CharCodeID
    Character Code Language Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    • JIS X 0208-1990 (note: JIS X 0212-1990 NOT SUPPORTED)
    • ASCII
    • Halfwidth katakana

    The supported Korean character sets for all the Korean encodings are:

    • KS X 1001, KS X 1002, KS X 1005-1, Code page 949 (for Windows 95, NT)
    • KS X 2901 (for UNIX), Johap
    • ASCII

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrReadDocumentBuffer

    This function reads the text in the buffer DocumentBuffer and updates the memory at DocumentMemory. The processing of the buffer is affected by StopMemory.

    In a typical application, there will be a series of calls to ExtrReadDocumentBuffer for a given document DocumentMemory. The idea is that the document is read in chunks. A call to ExtrSignalDocumentEnd signals that the last chunk has been sent (the end of the given document has been reached).

    A call to ExtrReadDocumentBuffer will change the memory at DocumentMemory, but the memory at StopMemory will not be modified. If there are multiple threads, each thread will have a unique value for DocumentMemory, but several threads may share StopMemory.

    The buffer DocumentBuffer may contain single-byte or double-byte characters (see ExtrSetInputCode). This is why it is of type void *. The buffer length BufferLength specifies the number of bytes in the buffer, not the number of characters. When the character code (set by ExtrSetInputCode) indicates double-byte characters, BufferLength must be an even number. That is, the end of the buffer is not allowed to divide a double-byte character into two parts.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSignalDocumentEnd

    A call to ExtrSignalDocumentEnd signals that the end of the document has been reached; there will be no further calls to ExtrReadDocumentBuffer with this particular DocumentMemory. This signal triggers the generation of the final list of keyphrases.

    The phrases in the final list of keyphrases are compared with the list of stop phrases in StopMemory and any matching phrases are deleted from the final list of keyphrases. Case is ignored for matching, but otherwise an exact match is required.

    ExtrSignalDocumentEnd should only be called once for a given document DocumentMemory. After ExtrSignalDocumentEnd has been called for a given document, that document has no further need for the stop words and stop phrases stored in StopMemory. Unless there are other documents that will need StopMemory, the memory used by StopMemory may be released after ExtrSignalDocumentEnd has been called.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetPhraseListSize

    The function ExtrGetPhraseListSize returns an integer value that is the number of keyphrases that were generated. If there is an error, PhraseListSize will be set to zero.

    ExtrGetPhraseListSize may be called repeatedly for a given document. It does not modify the memory at DocumentMemory. ExtrGetPhraseListSize should not be called until after ExtrSignalDocumentEnd has been called.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetPhraseByIndex

    A call to ExtrGetPhraseByIndex returns a pointer to a string. The string is phrase number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus one. Phrases are approximately in order of decreasing quality. ExtrSignalDocumentEnd must be called before ExtrGetPhraseByIndex.

    The string Phrase may contain single-byte or double-byte characters (see ExtrSetOutputCode). This is why it is of type void **.

    The memory where Phrase is stored will be cleared when ExtrClearDocumentMemory is called. The application should copy Phrase into a more permanent location.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetScoreByIndex

    A call to ExtrGetScoreByIndex copies a number into the location given by the pointer. The number is the score assigned to phrase number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus one. The score of a phrase is an estimate of its value as a keyphrase. Keyphrases are ranked in order of descending score. ExtrSignalDocumentEnd must be called before ExtrGetScoreByIndex.

    This function is optional. There is no need to call it unless you are curious about the score that is assigned to a phrase.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetDocumentLanguage

    A call to ExtrGetDocumentLanguage gets the language of the document. If the language was set by a call to ExtrSetDocumentLanguage, then ExtrGetDocumentLanguage returns the same value that was specified with ExtrSetDocumentLanguage. If Extractor was allowed to guess the language, then ExtrGetDocumentLanguage returns the best guess. LanguageID is passed by reference and is modified in the function.

    LanguageID
    Language Description
    0
    Unknown Extractor was not able to guess, or the language is neither English, French, German, nor Spanish.
    1
    English Extractor guessed English, or English was specified by ExtrSetDocumentLanguage.
    2
    French Extractor guessed French, or French was specified by ExtrSetDocumentLanguage.
    3
    Japanese Japanese was specified by ExtrSetDocumentLanguage.
    4
    German Extractor guessed German, or German was specified by ExtrSetDocumentLanguage.
    5
    Spanish Extractor guessed Spanish, or Spanish was specified by ExtrSetDocumentLanguage.
    6
    Korean Korean was specified by ExtrSetDocumentLanguage.

    This function is optional. There is no need to call it unless you wish to know which language Extractor guessed (English, French, German, or Spanish). Note that language guessing is currently not available for Japanese or Korean.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetHighlightListSize

    The function ExtrGetHighlightListSize returns an integer value that is the number of highlights that were generated. If there is an error, HighlightListSize will be set to zero.

    The number of highlights will be less than or equal to the number of keyphrases. There are two reasons that the number of highlights might be less than the number of keyphrases. First, when HighlightType is an odd number, Extractor removes any duplicate highlights. Second, there may be keyphrases for which no acceptable highlights were found. Therefore, for all values of HighlightType, it cannot be assumed that the highlight list size equals the keyphrase list size.

    ExtrGetHighlightListSize may be called repeatedly for a given document. It does not modify the memory at DocumentMemory. ExtrGetHighlightListSize should not be called until after ExtrSignalDocumentEnd has been called.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetHighlightByIndex

    A call to ExtrGetHighlightByIndex returns a pointer to a string. The string is highlight number HighlightIndex. HighlightIndex ranges from zero to HighlightListSize minus one. ExtrSignalDocumentEnd must be called before ExtrGetHighlightByIndex.

    The string Highlight may contain single-byte or double-byte characters (see ExtrSetOutputCode). This is why it is of type void **.

    The memory where Highlight is stored will be cleared when ExtrClearDocumentMemory is called. The application should copy Highlight into a more permanent location.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetDocumentProperties

    A call to ExtrGetDocumentProperties gets various properties of the document. The following properties are currently defined:

    PropID
    Description
    1
    get the number of words that were read
    2
    get the number of non-stop words (content words) that were read
    3
    see whether the whole document was read
    (0 = only the beginning of the document was read; 1 = the whole document was read)

    The desired property is specified by setting PropID. The property value is returned in PropValue.

    The values returned for PropID 1 and 2 depend on the language. For example, a word with an apostrophe counts as two words in French (e.g., "j'ai"), but as one word in English (e.g., "don't"). There are no spaces between words in Japanese, so the values returned for PropID 1 and 2 are rough approximations when the document is in Japanese. If ExtrGetDocumentProperties is called before the language has been determined, the values returned for PropID 1 and 2 will be zero.

    If the document is exceptionally long, Extractor will only read as much of the document as it needs to generate a summary. In this case, PropID 3 will return a value of 0 and PropID 1 and 2 will return values that are less than the actual values for the whole document.

    This function is optional. There is no need to call it unless you wish to know one or more of the above properties. The function may be called multiple times, in order to get multiple properties.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetErrorMessage

    A call to ExtrGetErrorMessage returns a pointer to a character string. The string will contain a short description of the problem, such as, "ERROR: Memory allocation error. Out of RAM."


    ExtrClearDocumentMemory

    A call to ExtrClearDocumentMemory will free the memory that was allocated for processing a given document.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrClearStopMemory

    A call to ExtrClearStopMemory will free the memory that was allocated for stop words and stop phrases.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.

     
        
        
    Features

         Evaluate
                online demonstration
                sample application
                software development kit
          
         Platform
                operating system
                        Windows
                        Solaris
                        Linux
                        Mac OS
                        HP/UX
                        ...
                development
                        C / C#
                        Java
                        Perl
                        Python
                        Visual Basic

         API Functions

         Great for...
             
    workforce optimization
              web log tagging
              refined search
              knowledge management (KM)
              information retrieval (IR)
              semantic web development
              indexing
              categorization
              cataloguing
              inference engines
              document management
              Portal Services

         Examples:
             
    Research
              Internet Communications
              HomeLand Security
              Contextual Web Search
              Document Mangement
              Indexing
              Knowledge Management
              Intellectual Property Filter
              Intelligent Search
              Text Summarization
              Wireless Push Technology


         Supporting Documentation

         FAQ

         Purchase

         About

         Contact

         Home