Various UTF8 related helper functions. More...

#include <cstdint>
#include <string>

Include dependency graph for utf8.h:

This graph shows which files directly or indirectly include this file:

Functions
std::string	convertUTF8ToLower (const std::string &input)
	Converts the input string into a lower case version, also taking into account non-ASCII characters that has a lower case variant.
std::string	convertUTF8ToUpper (const std::string &input)
	Converts the input string into a upper case version, also taking into account non-ASCII characters that has a upper case variant.
std::string	getUTF8CharAt (const std::string &input, size_t pos)
	Returns the UTF8 character found at byte position pos in the input string.
uint32_t	getUnicodeForUTF8CharAt (const std::string &input, size_t pos)
	Returns the 32bit Unicode value matching character at byte position pos in the UTF8 encoded input.
uint8_t	getUTF8CharNumBytes (char firstByte)
	Returns the number of bytes making up a single UTF8 character given the first byte in the sequence.
const char *	writeUTF8Char (TextStream &t, const char *s)
	Writes the UTF8 character pointed to by s to stream t and returns a pointer to the next character.
bool	lastUTF8CharIsMultibyte (const std::string &input)
	Returns true iff the last character in input is a multibyte character.
bool	isUTF8CharUpperCase (const std::string &input, size_t pos)
	Returns true iff the input string at byte position pos holds an upper case character.
int	isUTF8NonBreakableSpace (const char *input)
	Check if the first character pointed at by input is a non-breakable whitespace character.
bool	isUTF8PunctuationCharacter (uint32_t unicode)
	Check if the given Unicode character represents a punctuation character.

Detailed Description

Various UTF8 related helper functions.

See https://en.wikipedia.org/wiki/UTF-8 for details on UTF8 encoding.

Definition in file utf8.h.

Function Documentation

◆ convertUTF8ToLower()

std::string convertUTF8ToLower ( const std::string & input )

Converts the input string into a lower case version, also taking into account non-ASCII characters that has a lower case variant.

Definition at line 187 of file utf8.cpp.

{
  return caseConvert(input,asciiToLower,convertUnicodeToLower);
}

References asciiToLower(), caseConvert(), and convertUnicodeToLower().

Referenced by SearchIndexInfo::add(), Index::addClassMemberNameToIndex(), Index::addFileMemberNameToIndex(), Index::addModuleMemberNameToIndex(), Index::addNamespaceMemberNameToIndex(), AnchorGenerator::generate(), QCString::lower(), FileNameFn::searchKey(), SearchTerm::termEncoded(), and HtmlGenerator::writeLabel().

◆ convertUTF8ToUpper()

std::string convertUTF8ToUpper ( const std::string & input )

Converts the input string into a upper case version, also taking into account non-ASCII characters that has a upper case variant.

Definition at line 192 of file utf8.cpp.

{
  return caseConvert(input,asciiToUpper,convertUnicodeToUpper);
}

References asciiToUpper(), caseConvert(), and convertUnicodeToUpper().

Referenced by Translator::createNoun(), QCString::upper(), and writeAlphabeticalClassList().

◆ getUnicodeForUTF8CharAt()

uint32_t getUnicodeForUTF8CharAt	(	const std::string &	input,
		size_t	pos )

Returns the 32bit Unicode value matching character at byte position pos in the UTF8 encoded input.

Definition at line 135 of file utf8.cpp.

{
  std::string charS = getUTF8CharAt(input,pos);
  int len=0;
  return convertUTF8CharToUnicode(charS.c_str(),charS.length(),len);
}

References convertUTF8CharToUnicode(), and getUTF8CharAt().

Referenced by AnchorGenerator::generate().

◆ getUTF8CharAt()

std::string getUTF8CharAt	(	const std::string &	input,
		size_t	pos )

Returns the UTF8 character found at byte position pos in the input string.

The resulting string can be a multi byte sequence.

Definition at line 127 of file utf8.cpp.

{
  if (input.length()<=pos) return std::string();
  int numBytes=getUTF8CharNumBytes(input[pos]);
  if (input.length()<pos+numBytes) return std::string();
  return input.substr(pos,numBytes);
}

References getUTF8CharNumBytes().

Referenced by SearchIndexInfo::add(), Index::addClassMemberNameToIndex(), Index::addFileMemberNameToIndex(), Index::addModuleMemberNameToIndex(), Index::addNamespaceMemberNameToIndex(), Translator::createNoun(), AnchorGenerator::generate(), getUnicodeForUTF8CharAt(), and writeAlphabeticalClassList().

◆ getUTF8CharNumBytes()

uint8_t getUTF8CharNumBytes ( char firstByte )

Returns the number of bytes making up a single UTF8 character given the first byte in the sequence.

Definition at line 23 of file utf8.cpp.

{
  uint8_t num=1;
  unsigned char uc = static_cast<unsigned char>(c);
  if (uc>=0x80u) // multibyte character
  {
    if ((uc&0xE0u)==0xC0u)
    {
      num=2; // 110x.xxxx: 2 byte character
    }
    if ((uc&0xF0u)==0xE0u)
    {
      num=3; // 1110.xxxx: 3 byte character
    }
    if ((uc&0xF8u)==0xF0u)
    {
      num=4; // 1111.0xxx: 4 byte character
    }
    if ((uc&0xFCu)==0xF8u)
    {
      num=5; // 1111.10xx: 5 byte character
    }
    if ((uc&0xFEu)==0xFCu)
    {
      num=6; // 1111.110x: 6 byte character
    }
  }
  return num;
}

Referenced by detab(), escapeCharsInString(), AnchorGenerator::generate(), getUTF8CharAt(), nextUTF8CharPosition(), updateColumnCount(), and writeUTF8Char().

◆ isUTF8CharUpperCase()

bool isUTF8CharUpperCase	(	const std::string &	input,
		size_t	pos )

Returns true iff the input string at byte position pos holds an upper case character.

Definition at line 218 of file utf8.cpp.

{
  if (input.length()<=pos) return false;
  int len=0;
  // turn the UTF8 character at position pos into a unicode value
  uint32_t code = convertUTF8CharToUnicode(input.c_str()+pos,input.length()-pos,len);
  // check if the character can be converted to lower case, if so it was an upper case character
  return convertUnicodeToLower(code)!=nullptr;
}

References convertUnicodeToLower(), and convertUTF8CharToUnicode().

Referenced by DefinitionImpl::_setBriefDescription().

◆ isUTF8NonBreakableSpace()

int isUTF8NonBreakableSpace ( const char * input )

Check if the first character pointed at by input is a non-breakable whitespace character.

Returns the byte size of the character if there is match or 0 if not.

Definition at line 228 of file utf8.cpp.

{
  return (static_cast<unsigned char>(input[0])==0xC2 &&
          static_cast<unsigned char>(input[1])==0xA0) ? 2 : 0;
}

Referenced by detab().

◆ isUTF8PunctuationCharacter()

bool isUTF8PunctuationCharacter ( uint32_t unicode )

Check if the given Unicode character represents a punctuation character.

Definition at line 234 of file utf8.cpp.

{
  bool b = isPunctuationCharacter(unicode);
  return b;
}

References isPunctuationCharacter().

Referenced by AnchorGenerator::generate().

◆ lastUTF8CharIsMultibyte()

bool lastUTF8CharIsMultibyte ( const std::string & input )

Returns true iff the last character in input is a multibyte character.

Definition at line 212 of file utf8.cpp.

{
  // last byte is part of a multibyte UTF8 char if bit 8 is set and bit 7 is not
  return !input.empty() && (static_cast<unsigned char>(input[input.length()-1])&0xC0)==0x80;
}

Referenced by DefinitionImpl::_setBriefDescription().

◆ writeUTF8Char()

const char * writeUTF8Char	(	TextStream &	t,
		const char *	s )

Writes the UTF8 character pointed to by s to stream t and returns a pointer to the next character.

Definition at line 197 of file utf8.cpp.

{
  if (s==nullptr) return nullptr;
  uint8_t len = getUTF8CharNumBytes(*s);
  for (uint8_t i=0;i<len;i++)
  {
    if (s[i]==0) // detect premature end of string (due to invalid UTF8 char)
    {
      len=i;
    }
  }
  t.write(s,len);
  return s+len;
}

References getUTF8CharNumBytes(), and TextStream::write().

Referenced by HtmlCodeGenerator::codify(), ManCodeGenerator::codify(), RTFCodeGenerator::codify(), HtmlDocVisitor::operator()(), HtmlDocVisitor::writeObfuscatedMailAddress(), and writeXMLCodeString().

Functions