Solved VS2022, how can I read properly unicode from a file into a string? So far everything converts to ansi

sdowney717 · May 19, 2024

What I did so far is this then

Dim Fileunicode type as object
tester("C:\Temp\MyTest.txt", Fileunicodetype)
Dim content As String
If Fileunicodetype = "System.Text.UTF8Encoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8)
If Fileunicodetype = "System.Text.UnicodeEncoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.BigEndianUnicode)
If Fileunicodetype = "System.Text.UTF32Encoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF32)

had to use as object because sr.currentencoding, to capture the value is not a string

Public Sub tester(ByRef FileName As String, ByRef FileUnicodetype As Object)
'Dim path As String = "c:\temp\MyTestunicodeencoding.txt"
Try
' If File.Exists(path) Then
' File.Delete(path)
' End If

'Use an encoding other than the default (UTF8).
' Dim sw As StreamWriter = New StreamWriter(path, False, New UTF32Encoding())
' Dim sw As New StreamWriter(path, False, New UTF8Encoding())
'Dim sw As New StreamWriter(path, False, New UTF32Encoding())
'Dim sw As New StreamWriter(path, False, New UTF7Encoding())
'Dim sw As New StreamWriter(path, False, New UnicodeEncoding())

' sw.WriteLine("This")
' sw.WriteLine("is some text")
' sw.WriteLine("to test")
' sw.WriteLine("Reading")
' sw.Close()

'********************************************************************
Dim sr As New StreamReader(FileName, True)
Dim Countchars As Integer
Do While sr.Peek() >= 0
'Debug.Write(Convert.ToChar(sr.Read()))
Countchars += 1
If Countchars > 10 Then Exit Do
Loop
Debug.WriteLine(" ")

'Test for the encoding after reading, or at least
'after the first read.

Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)
Debug.WriteLine(" ")
FileUnicodetype = sr.CurrentEncoding
sr.Close()
Catch e As Exception

Debug.Print("The process failed: {0}", e.ToString())
Debug.WriteLine(" ")
FileUnicodetype = "EncodingUnknown"

End Try

pseymour said:
UnicodeEncoding Class (System.Text)

Represents a UTF-16 encoding of Unicode characters.

learn.microsoft.com

sdowney717 · May 19, 2024

And even that is not enough for UTF16 can be big or little endian

BitConverter.IsLittleEndian Field (System)

Indicates the byte order ("endianness") in which data is stored in this computer architecture.

learn.microsoft.com

If it is reported as Unicode encoding have to check this?
Meaning this line would fail

If Fileunicodetype = "System.Text.UnicodeEncoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.BigEndianUnicode)

sdowney717 · May 19, 2024

I am reading some conflicting information now, he says streamreader auto detects between unicode file types

@JimMischel & @MarkJ: where does it say that it defaults to UTF-8? All I can see is: The character encoding is set by the encoding parameter, and the buffer size is set to 1024 bytes. The StreamReader object attempts to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.
– CJ7

Meaning most of this extra coding is not needed
Maybe I can test this using that sub to create different unicode encodings and see what happens!

What character encoding is used by StreamReader.ReadToEnd()?

What character encoding is used by StreamReader.ReadToEnd()? What would be the reason to use (b) instead of (a) below? Is there a risk of their being a character encoding problem if (a) is used i...

stackoverflow.com

pseymour · May 19, 2024

That’s what I said in post #9.

pseymour · May 19, 2024

sdowney717 said:
I am not opening any old Unicode text file, my program is opening a MARC 21 file which could be Unicode.

Seems you're overcomplicating this.

Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records.

from: Character Sets: UCS/Unicode Environment: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)

If the Leader has an A in position 9, then the file is Unicode (UTF-8). Otherwise, it's MARC-8.

from: MARC 21 Format for Bibliographic Data: lead: Leader (Network Development and MARC Standards Office, Library of Congress)

sdowney717 · May 19, 2024

pseymour said:
That’s what I said in post #9.

I don't exactly know what I am doing here

I created 4 different unicode encoded files, then tried to read them
like this code

Dim path As String = "c:\temp\MyTestunicodeencoding.txt"

'Dim sw As New StreamWriter(path, False, New UTF7Encoding())
'Dim sw As New StreamWriter(path, False, New UTF8Encoding())
'Dim sw As New StreamWriter(path, False, New UnicodeEncoding())
Dim sw As New StreamWriter(path, False, New UTF32Encoding())
sw.WriteLine("This")
sw.WriteLine("is some text")
sw.WriteLine("to test")
sw.WriteLine("圖圖3解弓月金難手中大")
sw.Close()
Exit Sub

The last 2 output lines from ?filereader were 'unicodeencoding' and 'utf32encoding' files being read this way
Using .default it read only those 2 files properly.

Using .default did not read the files made with utf7 or utf8 encoding

What code line writing would autodetect the unicode and work for at least utf8, utf16 and utf32, or at least for utf8 and utf16?
Is .default going to read big and little endian files properly?

Filename = "c:\temp\MyTestunicodeencoding.txt"
Dim FileReader As String
FileReader = My.Computer.FileSystem.ReadAllText(Filename, System.Text.Encoding.Default)

?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "+VxZXFg-3+ieNfE2cIkdGW42JLTi1ZJw-" & vbCrLf
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "åœ–åœ–3è§£å¼“æœˆé‡‘é›£æ‰‹ä¸å¤§" & vbCrLf
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "圖圖3解弓月金難手中大" & vbCrLf
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "圖圖3解弓月金難手中大" & vbCrLf

sdowney717 · May 19, 2024

pseymour said:
Seems you're overcomplicating this.

from: Character Sets: UCS/Unicode Environment: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)

If the Leader has an A in position 9, then the file is Unicode (UTF-8). Otherwise, it's MARC-8.

from: MARC 21 Format for Bibliographic Data: lead: Leader (Network Development and MARC Standards Office, Library of Congress)

yes, all I have seen is utf8 for Marc 21

I still need some help to figure out my prior post.

pseymour · May 19, 2024

Can you tell me which version of .NET you're working with? I want to be on the same page.

pseymour · May 19, 2024

Anyhoooo... given the following test files

Code:

./Sample File - ASCII.txt:      ASCII text
./Sample File - UTF-32.txt:     Unicode text, UTF-32, little-endian
./Sample File - UTF-7.txt:      ASCII text
./Sample File - UTF-8.txt:      Unicode text, UTF-8 (with BOM) text
./Sample File - Unicode BE.txt: Unicode text, UTF-16, big-endian text
./Sample File - Unicode LE.txt: Unicode text, UTF-16, little-endian text

The StreamReader class I mentioned in #9 and you mentioned in #23 is the easiest way to handle this.

Code:

For Each filePath As String In IO.Directory.GetFiles("D:\", "*.txt", IO.SearchOption.TopDirectoryOnly)
    Console.WriteLine("reading file ""{0}""", filePath)

    Using reader As New IO.StreamReader(filePath, True)
        Dim contents As String = reader.ReadToEnd()
        Console.WriteLine("file encoding: {0}", reader.CurrentEncoding)
        Try
            Console.OutputEncoding = reader.CurrentEncoding
        Catch
        End Try
        Console.WriteLine(contents)
        reader.Close()
    End Using

    Console.WriteLine(Environment.NewLine & (New String("=", 50)) & Environment.NewLine)
Next

Code:

reading file "D:\Sample File - ASCII.txt"
file encoding: System.Text.UTF8Encoding
????? Sample Text ???????

==================================================

reading file "D:\Sample File - Unicode BE.txt"
file encoding: System.Text.UnicodeEncoding
的是不我他 Sample Text широкий

==================================================

reading file "D:\Sample File - Unicode LE.txt"
file encoding: System.Text.UnicodeEncoding
的是不我他 Sample Text широкий

==================================================

reading file "D:\Sample File - UTF-32.txt"
file encoding: System.Text.UTF32Encoding
的是不我他 Sample Text широкий

==================================================

reading file "D:\Sample File - UTF-7.txt"
file encoding: System.Text.UTF8Encoding
+doRmL04NYhFO1g- Sample Text +BEgEOARABD4EOgQ4BDk-

==================================================

reading file "D:\Sample File - UTF-8.txt"
file encoding: System.Text.UTF8Encoding
的是不我他 Sample Text широкий

==================================================

Obviously the ASCII and UTF-7 encodings had trouble with the non-English characters.

sdowney717 · May 19, 2024

I have figured out somethings.
I can use NotePad++ to change the file encoding and also leave off or add on the BOM for UTF8 files

And then can test it in VS2022. So yes, I am making some headway.
That app has been a super handy app for me.

How can I make Notepad to save text in UTF-8 without the BOM?

I have a CSV file with special accents and save it in Notepad by selecting UTF-8 encoding. When I read the file using Java, it reads the BOM characters too. So I want to save this file in UTF-8 fo...

stackoverflow.com

sdowney717 · May 19, 2024

pseymour said:
Can you tell me which version of .NET you're working with? I want to be on the same page.

it's 4.8

sdowney717 · May 19, 2024

So now this code snip is really working to select how to open unicode files.
I really do appreciate your help.

tester(FilenameToBreak, Fileunicodetype)
Dim content As String
If Fileunicodetype = "Unicode (UTF-8)" Or Fileunicodetype = "EncodingUnknown" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8) 'utf8
If Fileunicodetype = "Unicode" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.Default) 'utf16
If Fileunicodetype = "Unicode (Big-Endian)" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.BigEndianUnicode) 'utf16
If Fileunicodetype = "Unicode (UTF-32)" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF32) 'or use default

The sub I have been using, and I got rid of as object, that was failing, it can be as string, I had to change one line on sr.currentencoding to this

FileUnicodetype = sr.CurrentEncoding.EncodingName 'CurrentEncoding

This gets the encoding very soon into the file reading, just 10 loops. Arbitrarily set by me.

Public Sub tester(ByRef FileName As String, ByRef FileUnicodetype As String)
'will tell you the file unicode encoding

'********************************************************************
Dim sr As New StreamReader(FileName, True)
Dim Countchars As Integer
Do While sr.Peek() >= 0
'Debug.Write(Convert.ToChar(sr.Read()))
Countchars += 1
If Countchars > 10 Then Exit Do
Loop
Debug.WriteLine(" ")

'Test for the encoding after reading, or at least
'after the first read.

' Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)

FileUnicodetype = sr.CurrentEncoding.EncodingName 'CurrentEncoding
sr.Close()
Catch e As Exception

'Debug.Print("The process failed: {0}", e.ToString())

FileUnicodetype = "EncodingUnknown"

End Try

End Sub

sdowney717 · May 19, 2024

And Notepad++ app gave me a huge boost on a file I had that showed Swedish chars but kept failing to be read as any kind of unicode, it gave strange results... I was really stuck and confused, I kept wondering why VS2022 with my coding could not read the string, it always had ???????

THEN, I remembered, I had created the file using a windows code page 1252...
When I opened the file in Notepad++ and checked the encoding, it said ANSI !!!
Changing the file coding to Unicode, all the Swedish chars disappeared into a mess like XE4 XF6 Xe5

pseymour · May 19, 2024

I'm not sure why you're doing all this read the file, get the encoding, then a bunch of If statements. Just open the file with StreamReader and start doing stuff with the contents.

Edit: Let me try to clarify. You're using StreamReader to crack open the file and peek inside, to determine the encoding. Then based on that, you're opening the file based on that encoding, and taking action of some sort on the contents. But StreamReader already had the file open and was reading it. As you can see from the mini code sample I last posted, it was able to handle files of six different encodings. A couple of the files themselves didn't handle the non-English characters, but that's because they were written with encodings that don't handle that very well.

sdowney717 · May 20, 2024

pseymour said:
I'm not sure why you're doing all this read the file, get the encoding, then a bunch of If statements. Just open the file with StreamReader and start doing stuff with the contents.

Edit: Let me try to clarify. You're using StreamReader to crack open the file and peek inside, to determine the encoding. Then based on that, you're opening the file based on that encoding, and taking action of some sort on the contents. But StreamReader already had the file open and was reading it. As you can see from the mini code sample I last posted, it was able to handle files of six different encodings. A couple of the files themselves didn't handle the non-English characters, but that's because they were written with encodings that don't handle that very well.

Not sure? absolutely, I am a real newbie here.

I just tried this and it seems to work on 4 unicode file types
Is this going to be sufficient? Or try to catch an exception?
I get the file name from a file dialog so the file exists.

Can it really be this easy to just do the following?

Dim sr As New StreamReader(FilenameToBreak, True)
content = sr.ReadToEnd()
sr.Close()

Somehow did not think would be that easy. The MS learning web page had so much info, I found it ovewheming, I just thought it could not be that simple. Why the MS does not just say 3 simple lines, at least to get a coder to see something working.

StreamReader Constructor (System.IO)

Initializes a new instance of the StreamReader class for the specified stream.

learn.microsoft.com

Going through the extra coding did help me learn about unicode and streamreader some more. Maye more than I really wanted to know.

sdowney717 · May 20, 2024

Dim file As New StreamWriter(FileNameToCreate, False, New UTF8Encoding())

This is how I write to a file using utf8 encoding.
If the encoding is not specified on the DIM, does it assume UTF8?

I process the string called 'content' with a lot of coding.
Been testing and it is writing unicode chars, so I am happy.
For a while nothing was working right, but then I figured out enough to make it function.

file.Write(MarcData)

pseymour · May 20, 2024

I would say that it's likely to be that easy, but there are no guarantees. There really is no such thing as a simple text files. Multiple encodings, languages, code pages, etc. It's a jungle out there.

I'll give a different example. Let's say you're making an app to handle images. I pass you a file with a JPEG extension, but I put a PNG header (signature) at the beginning of the data. What does your app do then? The point is, don't trust anything a user gives you. Verify the contents. Sounds like you're doing that ("I process the string called 'content' with a lot of coding.").

In the case of MARC-21 files, make sure the data you get for each field makes sense for that field, before doing something with it.

Dim file As New StreamWriter(FileNameToCreate, False, New UTF8Encoding())
This is how I write to a file using utf8 encoding.
If the encoding is not specified on the DIM, does it assume UTF8?

It looks like, if you don't pass an encoding to the constructor, it uses UTF-8. So I would say yes.

Solved VS2022, how can I read properly unicode from a file into a string? So far everything converts to ansi

Well-known member

My Computer

Well-known member

My Computer

Well-known member

My Computer

Windows developer and admіn

My Computers

Windows developer and admіn

My Computers

Well-known member

My Computer

Well-known member

My Computer

Windows developer and admіn

My Computers

Windows developer and admіn

My Computers

Well-known member

My Computer

Well-known member

My Computer

Well-known member

My Computer

Well-known member

My Computer

Windows developer and admіn

My Computers

Well-known member

My Computer

Well-known member

My Computer

Windows developer and admіn

My Computers