Package org.htmlparser.util
Class Translate
java.lang.Object
org.htmlparser.util.Translate
Translate numeric character references and character entity references to unicode characters.
Based on tables found at
http://www.w3.org/TR/REC-html40/sgml/entities.html
Typical usage:
String s = Translate.decode (getTextFromHtmlPage ());or
String s = "<HTML>" + Translate.encode (getArbitraryText ()) + "</HTML>";
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final int
The dividing point between a simple table lookup and a binary search.static boolean
If this member is settrue
, decoding of streams is done line by line in order to reduce the maximum memory required.static boolean
If this member is settrue
, encoding of numeric character references uses hexadecimal digits, i.e.protected static final CharacterReference[]
List of references sorted by character.protected static final CharacterReference[]
Table mapping entity reference kernel to character. -
Method Summary
Modifier and TypeMethodDescriptionstatic void
decode
(InputStream in, PrintStream out) Decode a stream containing references.static String
Decode a string containing references.static String
decode
(StringBuffer buffer) Decode the characters in a string buffer containing references.static String
encode
(int character) Convert a character to a numeric character reference.static void
encode
(InputStream in, PrintStream out) Encode a stream to use references.static String
Encode a string to use references.static CharacterReference
lookup
(char character) Look up a reference by character.static CharacterReference
Look up a reference by kernel.protected static CharacterReference
lookup
(CharacterReference key) Look up a reference by kernel.protected static int
lookup
(CharacterReference[] array, char ref, int lo, int hi) Binary search for a reference.static void
Numeric character reference and character entity reference to unicode codec.
-
Field Details
-
DECODE_LINE_BY_LINE
public static boolean DECODE_LINE_BY_LINEIf this member is settrue
, decoding of streams is done line by line in order to reduce the maximum memory required. -
ENCODE_HEXADECIMAL
public static boolean ENCODE_HEXADECIMALIf this member is settrue
, encoding of numeric character references uses hexadecimal digits, i.e. ○, instead of decimal digits. -
mCharacterReferences
Table mapping entity reference kernel to character. This is sorted by kernel when the class is loaded. -
BREAKPOINT
protected static final int BREAKPOINTThe dividing point between a simple table lookup and a binary search. Characters below the break point are stored in a sparse array allowing direct index lookup.- See Also:
-
mCharacterList
List of references sorted by character. The first part of this array, up toBREAKPOINT
is stored in a direct translational table, indexing into the table with a character yields the reference. The second part is dense and sorted by character, suitable for binary lookup.
-
-
Method Details
-
lookup
Binary search for a reference.- Parameters:
array
- The array ofCharacterReference
objects.ref
- The character to search for.lo
- The lower index within which to look.hi
- The upper index within which to look.- Returns:
- The index at which reference was found or is to be inserted.
-
lookup
Look up a reference by character. Use a combination of direct table lookup and binary search to find the reference corresponding to the character.- Parameters:
character
- The character to be looked up.- Returns:
- The entity reference for that character or
null
.
-
lookup
Look up a reference by kernel. Use a binary search on the ordered list of known references. Since the binary search returns the position at which a new item should be inserted, we check the references earlier in the list if there is a failure.- Parameters:
key
- A character reference with the kernel set to the string to be found. It need not be truncated at the exact end of the reference.
-
lookup
Look up a reference by kernel. Use a binary search on the ordered list of known references. This is not very efficient, uselookup(CharacterReference)
instead.- Parameters:
kernel
- The string to lookup, i.e. "amp".start
- The starting point in the string of the kernel.end
- The ending point in the string of the kernel. This should be the index of the semicolon if it exists, or failing that, at least an index past the last character of the kernel.- Returns:
- The reference that matches the given string, or
null
if it wasn't found.
-
decode
Decode a string containing references. Change all numeric character reference and character entity references to unicode characters.- Parameters:
string
- The string to translate.
-
decode
Decode the characters in a string buffer containing references. Change all numeric character reference and character entity references to unicode characters.- Parameters:
buffer
- The StringBuffer containing references.- Returns:
- The decoded string.
-
decode
Decode a stream containing references. Change all numeric character reference and character entity references to unicode characters. IfDECODE_LINE_BY_LINE
is true, the input stream is broken up into lines, terminated by either carriage return or newline, in order to reduce the latency and maximum buffering memory size required.- Parameters:
in
- The stream to translate. It is assumed that the input stream is encoded with ISO-8859-1 since the table of character entity references in this class applies only to ISO-8859-1.out
- The stream to write the decoded stream to.
-
encode
Convert a character to a numeric character reference. Convert a unicode character to a numeric character reference of the form &#xxxx;.- Parameters:
character
- The character to convert.- Returns:
- The converted character.
-
encode
Encode a string to use references. Change all characters that are not ISO-8859-1 to their numeric character reference or character entity reference.- Parameters:
string
- The string to translate.- Returns:
- The encoded string.
-
encode
Encode a stream to use references. Change all characters that are not ISO-8859-1 to their numeric character reference or character entity reference.- Parameters:
in
- The stream to translate. It is assumed that the input stream is encoded with ISO-8859-1 since the table of character entity references in this class applies only to ISO-8859-1.out
- The stream to write the decoded stream to.
-
main
Numeric character reference and character entity reference to unicode codec. Translate theSystem.in
input into an encoded or decoded stream and send the results toSystem.out
.- Parameters:
args
- If arg[0] is-encode
perform an encoding onSystem.in
, otherwise perform a decoding.
-