Wednesday, March 28, 2012

Strange character transformations

Hi,

I have an ASP.Net website, which allows users to upload a file which is
then inserted into a database.

This is all fine until it reads a line with the string +Anu in it.
It transforms this to this char ? (which, if Googled for, is
described as Unicode Character 'LATIN SMALL LETTER TURNED R WITH HOOK'
(U+027B) or, in Phonetics, as a 'Retroflex approximant'.)

Has anyone seen this behaviour before, and know how to stop it?
The code's simple - here's an example. The ? appears in the output
where the input is +Anu - it's transformed before I can touch it!

using (StreamReader sr = new StreamReader(strFile,
System.Text.Encoding.UTF7)) {
// Read and display lines from the file until the end of the file is
reached.
while ((line = sr.ReadLine()) != null) {
Response.Write(line);
}
}

Regards

AdamLooks like an encoding issue, alright.
Have you tried using the StreamReader constructor that does not require a
character encoding?

"CyberSpyders@.gmail.com" wrote:

Quote:

Originally Posted by

Hi,
>
I have an ASP.Net website, which allows users to upload a file which is
then inserted into a database.
>
This is all fine until it reads a line with the string +Anu in it.
It transforms this to this char ? (which, if Googled for, is
described as Unicode Character 'LATIN SMALL LETTER TURNED R WITH HOOK'
(U+027B) or, in Phonetics, as a 'Retroflex approximant'.)
>
Has anyone seen this behaviour before, and know how to stop it?
The code's simple - here's an example. The ? appears in the output
where the input is +Anu - it's transformed before I can touch it!
>
using (StreamReader sr = new StreamReader(strFile,
System.Text.Encoding.UTF7)) {
// Read and display lines from the file until the end of the file is
reached.
while ((line = sr.ReadLine()) != null) {
Response.Write(line);
}
}
>
Regards
>
Adam
>
>


Graven,

I'm not sure how a 4 letter string like this could be seen as an
encoding issue, but I will certainly give it a go. Thanks for the
suggestion.

Adam

Graven wrote:

Quote:

Originally Posted by

Try to use plain latin-1 encoding. I think it's an unicode
normalization issue, but don't know if StreamReader performs it by
default.
>
>
CyberSpyders@.gmail.com wrote:

Quote:

Originally Posted by

Hi,

I have an ASP.Net website, which allows users to upload a file which is
then inserted into a database.

This is all fine until it reads a line with the string +Anu in it.
It transforms this to this char ? (which, if Googled for, is
described as Unicode Character 'LATIN SMALL LETTER TURNED R WITH HOOK'
(U+027B) or, in Phonetics, as a 'Retroflex approximant'.)

Has anyone seen this behaviour before, and know how to stop it?
The code's simple - here's an example. The ? appears in the output
where the input is +Anu - it's transformed before I can touch it!

using (StreamReader sr = new StreamReader(strFile,
System.Text.Encoding.UTF7)) {
// Read and display lines from the file until the end of the file is
reached.
while ((line = sr.ReadLine()) != null) {
Response.Write(line);
}
}

Regards

Adam


Larry,

You were spot on - changing to UTF8 stopped this transformation. Thanks

It's not quite solved my problem though.
The file is a Text file, each line being a series of files delimited by
the character, as this was unliekley to ever appear in the actual
data.

Unfortunately, UTF8 encoding strips these characters completely. ASCII
encoding, on the other hand, replaces them with ?

Oh the joy of character encoding.

Regards

Adam

Larry Lard wrote:

Quote:

Originally Posted by

This is why you are seeing what you are seeing. UTF7 encodes characters
outside the printable 7 bit range using UTF16 then modified base64, with
+ as the indicator mark for this encoding. I haven't checked, but I
imagine +Anu is the UTF7 encoding of that character. You shouldn't use a
UTF7 reader to read a file that you don't know for sure was produced by
a UTF7 writer.
>
The correct way to read the file depends on what kind of file it is. If
it is text of an unknown encoding, there is no way to be absolutely
sure, but UTF8 is a good starting point. If it's binary data, you
shouldn't be using a TextReader class at all.

0 comments:

Post a Comment