Doxygen
Loading...
Searching...
No Matches
utf8.h File Reference

Various UTF8 related helper functions. More...

#include <cstdint>
#include <string>
+ Include dependency graph for utf8.h:
+ This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Functions

std::string convertUTF8ToLower (const std::string &input)
 Converts the input string into a lower case version, also taking into account non-ASCII characters that has a lower case variant.
 
std::string convertUTF8ToUpper (const std::string &input)
 Converts the input string into a upper case version, also taking into account non-ASCII characters that has a upper case variant.
 
std::string getUTF8CharAt (const std::string &input, size_t pos)
 Returns the UTF8 character found at byte position pos in the input string.
 
uint32_t getUnicodeForUTF8CharAt (const std::string &input, size_t pos)
 Returns the 32bit Unicode value matching character at byte position pos in the UTF8 encoded input.
 
uint8_t getUTF8CharNumBytes (char firstByte)
 Returns the number of bytes making up a single UTF8 character given the first byte in the sequence.
 
const char * writeUTF8Char (TextStream &t, const char *s)
 Writes the UTF8 character pointed to by s to stream t and returns a pointer to the next character.
 
bool lastUTF8CharIsMultibyte (const std::string &input)
 Returns true iff the last character in input is a multibyte character.
 
bool isUTF8CharUpperCase (const std::string &input, size_t pos)
 Returns true iff the input string at byte position pos holds an upper case character.
 
int isUTF8NonBreakableSpace (const char *input)
 Check if the first character pointed at by input is a non-breakable whitespace character.
 
bool isUTF8PunctuationCharacter (uint32_t unicode)
 Check if the given Unicode character represents a punctuation character.
 

Detailed Description

Various UTF8 related helper functions.

See https://en.wikipedia.org/wiki/UTF-8 for details on UTF8 encoding.

Definition in file utf8.h.

Function Documentation

◆ convertUTF8ToLower()

std::string convertUTF8ToLower ( const std::string & input)

Converts the input string into a lower case version, also taking into account non-ASCII characters that has a lower case variant.

Definition at line 187 of file utf8.cpp.

188{
190}
const char * convertUnicodeToLower(uint32_t code)
static char asciiToLower(char in)
Definition debug.cpp:92
static std::string caseConvert(const std::string &input, char(*asciiConversionFunc)(uint32_t code), const char *(*conversionFunc)(uint32_t code))
Definition utf8.cpp:152

References asciiToLower(), caseConvert(), and convertUnicodeToLower().

Referenced by SearchIndexInfo::add(), Index::addClassMemberNameToIndex(), Index::addFileMemberNameToIndex(), Index::addModuleMemberNameToIndex(), Index::addNamespaceMemberNameToIndex(), AnchorGenerator::generate(), QCString::lower(), FileNameFn::searchKey(), and SearchTerm::termEncoded().

◆ convertUTF8ToUpper()

std::string convertUTF8ToUpper ( const std::string & input)

Converts the input string into a upper case version, also taking into account non-ASCII characters that has a upper case variant.

Definition at line 192 of file utf8.cpp.

193{
195}
const char * convertUnicodeToUpper(uint32_t code)
Definition caseconvert.h:12
static char asciiToUpper(uint32_t code)
Definition utf8.cpp:147

References asciiToUpper(), caseConvert(), and convertUnicodeToUpper().

Referenced by Translator::createNoun(), QCString::upper(), and writeAlphabeticalClassList().

◆ getUnicodeForUTF8CharAt()

uint32_t getUnicodeForUTF8CharAt ( const std::string & input,
size_t pos )

Returns the 32bit Unicode value matching character at byte position pos in the UTF8 encoded input.

Definition at line 135 of file utf8.cpp.

136{
137 std::string charS = getUTF8CharAt(input,pos);
138 int len=0;
139 return convertUTF8CharToUnicode(charS.c_str(),charS.length(),len);
140}
static uint32_t convertUTF8CharToUnicode(const char *s, size_t bytesLeft, int &len)
Definition utf8.cpp:69
std::string getUTF8CharAt(const std::string &input, size_t pos)
Returns the UTF8 character found at byte position pos in the input string.
Definition utf8.cpp:127

References convertUTF8CharToUnicode(), and getUTF8CharAt().

Referenced by AnchorGenerator::generate().

◆ getUTF8CharAt()

std::string getUTF8CharAt ( const std::string & input,
size_t pos )

Returns the UTF8 character found at byte position pos in the input string.

The resulting string can be a multi byte sequence.

Definition at line 127 of file utf8.cpp.

128{
129 if (input.length()<=pos) return std::string();
130 int numBytes=getUTF8CharNumBytes(input[pos]);
131 if (input.length()<pos+numBytes) return std::string();
132 return input.substr(pos,numBytes);
133}
uint8_t getUTF8CharNumBytes(char c)
Returns the number of bytes making up a single UTF8 character given the first byte in the sequence.
Definition utf8.cpp:23

References getUTF8CharNumBytes().

Referenced by SearchIndexInfo::add(), Index::addClassMemberNameToIndex(), Index::addFileMemberNameToIndex(), Index::addModuleMemberNameToIndex(), Index::addNamespaceMemberNameToIndex(), Translator::createNoun(), AnchorGenerator::generate(), getUnicodeForUTF8CharAt(), and writeAlphabeticalClassList().

◆ getUTF8CharNumBytes()

uint8_t getUTF8CharNumBytes ( char firstByte)

Returns the number of bytes making up a single UTF8 character given the first byte in the sequence.

Definition at line 23 of file utf8.cpp.

24{
25 uint8_t num=1;
26 unsigned char uc = static_cast<unsigned char>(c);
27 if (uc>=0x80u) // multibyte character
28 {
29 if ((uc&0xE0u)==0xC0u)
30 {
31 num=2; // 110x.xxxx: 2 byte character
32 }
33 if ((uc&0xF0u)==0xE0u)
34 {
35 num=3; // 1110.xxxx: 3 byte character
36 }
37 if ((uc&0xF8u)==0xF0u)
38 {
39 num=4; // 1111.0xxx: 4 byte character
40 }
41 if ((uc&0xFCu)==0xF8u)
42 {
43 num=5; // 1111.10xx: 5 byte character
44 }
45 if ((uc&0xFEu)==0xFCu)
46 {
47 num=6; // 1111.110x: 6 byte character
48 }
49 }
50 return num;
51}

Referenced by detab(), escapeCharsInString(), AnchorGenerator::generate(), getUTF8CharAt(), nextUTF8CharPosition(), updateColumnCount(), and writeUTF8Char().

◆ isUTF8CharUpperCase()

bool isUTF8CharUpperCase ( const std::string & input,
size_t pos )

Returns true iff the input string at byte position pos holds an upper case character.

Definition at line 218 of file utf8.cpp.

219{
220 if (input.length()<=pos) return false;
221 int len=0;
222 // turn the UTF8 character at position pos into a unicode value
223 uint32_t code = convertUTF8CharToUnicode(input.c_str()+pos,input.length()-pos,len);
224 // check if the character can be converted to lower case, if so it was an upper case character
225 return convertUnicodeToLower(code)!=nullptr;
226}

References convertUnicodeToLower(), and convertUTF8CharToUnicode().

Referenced by DefinitionImpl::_setBriefDescription().

◆ isUTF8NonBreakableSpace()

int isUTF8NonBreakableSpace ( const char * input)

Check if the first character pointed at by input is a non-breakable whitespace character.

Returns the byte size of the character if there is match or 0 if not.

Definition at line 228 of file utf8.cpp.

229{
230 return (static_cast<unsigned char>(input[0])==0xC2 &&
231 static_cast<unsigned char>(input[1])==0xA0) ? 2 : 0;
232}

Referenced by detab().

◆ isUTF8PunctuationCharacter()

bool isUTF8PunctuationCharacter ( uint32_t unicode)

Check if the given Unicode character represents a punctuation character.

Definition at line 234 of file utf8.cpp.

235{
236 bool b = isPunctuationCharacter(unicode);
237 return b;
238}
bool isPunctuationCharacter(uint32_t code)

References isPunctuationCharacter().

Referenced by AnchorGenerator::generate().

◆ lastUTF8CharIsMultibyte()

bool lastUTF8CharIsMultibyte ( const std::string & input)

Returns true iff the last character in input is a multibyte character.

Definition at line 212 of file utf8.cpp.

213{
214 // last byte is part of a multibyte UTF8 char if bit 8 is set and bit 7 is not
215 return !input.empty() && (static_cast<unsigned char>(input[input.length()-1])&0xC0)==0x80;
216}

Referenced by DefinitionImpl::_setBriefDescription().

◆ writeUTF8Char()

const char * writeUTF8Char ( TextStream & t,
const char * s )

Writes the UTF8 character pointed to by s to stream t and returns a pointer to the next character.

Definition at line 197 of file utf8.cpp.

198{
199 if (s==nullptr) return nullptr;
200 uint8_t len = getUTF8CharNumBytes(*s);
201 for (uint8_t i=0;i<len;i++)
202 {
203 if (s[i]==0) // detect premature end of string (due to invalid UTF8 char)
204 {
205 len=i;
206 }
207 }
208 t.write(s,len);
209 return s+len;
210}
void write(const char *buf, size_t len)
Adds a array of character to the stream.
Definition textstream.h:201

References getUTF8CharNumBytes(), and TextStream::write().

Referenced by HtmlCodeGenerator::codify(), ManCodeGenerator::codify(), RTFCodeGenerator::codify(), HtmlDocVisitor::operator()(), HtmlDocVisitor::writeObfuscatedMailAddress(), and writeXMLCodeString().