Solved VS2022, how can I read properly unicode from a file into a string? So far everything converts to ansi


What I did so far is this then
Dim Fileunicode type as object
tester("C:\Temp\MyTest.txt", Fileunicodetype)
Dim content As String
If Fileunicodetype = "System.Text.UTF8Encoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8)
If Fileunicodetype = "System.Text.UnicodeEncoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.BigEndianUnicode)
If Fileunicodetype = "System.Text.UTF32Encoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF32)
had to use as object because sr.currentencoding, to capture the value is not a string
Public Sub tester(ByRef FileName As String, ByRef FileUnicodetype As Object)
'Dim path As String = "c:\temp\MyTestunicodeencoding.txt"
Try
' If File.Exists(path) Then
' File.Delete(path)
' End If

'Use an encoding other than the default (UTF8).
' Dim sw As StreamWriter = New StreamWriter(path, False, New UTF32Encoding())
' Dim sw As New StreamWriter(path, False, New UTF8Encoding())
'Dim sw As New StreamWriter(path, False, New UTF32Encoding())
'Dim sw As New StreamWriter(path, False, New UTF7Encoding())
'Dim sw As New StreamWriter(path, False, New UnicodeEncoding())

' sw.WriteLine("This")
' sw.WriteLine("is some text")
' sw.WriteLine("to test")
' sw.WriteLine("Reading")
' sw.Close()

'********************************************************************
Dim sr As New StreamReader(FileName, True)
Dim Countchars As Integer
Do While sr.Peek() >= 0
'Debug.Write(Convert.ToChar(sr.Read()))
Countchars += 1
If Countchars > 10 Then Exit Do
Loop
Debug.WriteLine(" ")

'Test for the encoding after reading, or at least
'after the first read.

Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)
Debug.WriteLine(" ")
FileUnicodetype = sr.CurrentEncoding
sr.Close()
Catch e As Exception

Debug.Print("The process failed: {0}", e.ToString())
Debug.WriteLine(" ")
FileUnicodetype = "EncodingUnknown"

End Try
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
And even that is not enough for UTF16 can be big or little endian

If it is reported as Unicode encoding have to check this?
Meaning this line would fail

If Fileunicodetype = "System.Text.UnicodeEncoding" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.BigEndianUnicode)
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I am reading some conflicting information now, he says streamreader auto detects between unicode file types
@JimMischel & @MarkJ: where does it say that it defaults to UTF-8? All I can see is: The character encoding is set by the encoding parameter, and the buffer size is set to 1024 bytes. The StreamReader object attempts to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.
CJ7
Meaning most of this extra coding is not needed
Maybe I can test this using that sub to create different unicode encodings and see what happens!

 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
That’s what I said in post #9.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 24H2
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
    Antivirus
    Microsoft Defender
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
I am not opening any old Unicode text file, my program is opening a MARC 21 file which could be Unicode.

Seems you're overcomplicating this.

Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records.
from: Character Sets: UCS/Unicode Environment: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)

If the Leader has an A in position 9, then the file is Unicode (UTF-8). Otherwise, it's MARC-8.

from: MARC 21 Format for Bibliographic Data: lead: Leader (Network Development and MARC Standards Office, Library of Congress)
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 24H2
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
    Antivirus
    Microsoft Defender
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
That’s what I said in post #9.
I don't exactly know what I am doing here

I created 4 different unicode encoded files, then tried to read them
like this code

Dim path As String = "c:\temp\MyTestunicodeencoding.txt"


'Dim sw As New StreamWriter(path, False, New UTF7Encoding())
'Dim sw As New StreamWriter(path, False, New UTF8Encoding())
'Dim sw As New StreamWriter(path, False, New UnicodeEncoding())
Dim sw As New StreamWriter(path, False, New UTF32Encoding())
sw.WriteLine("This")
sw.WriteLine("is some text")
sw.WriteLine("to test")
sw.WriteLine("圖圖3解弓月金難手中大")
sw.Close()
Exit Sub
The last 2 output lines from ?filereader were 'unicodeencoding' and 'utf32encoding' files being read this way
Using .default it read only those 2 files properly.

Using .default did not read the files made with utf7 or utf8 encoding

What code line writing would autodetect the unicode and work for at least utf8, utf16 and utf32, or at least for utf8 and utf16?
Is .default going to read big and little endian files properly?

Filename = "c:\temp\MyTestunicodeencoding.txt"
Dim FileReader As String
FileReader = My.Computer.FileSystem.ReadAllText(Filename, System.Text.Encoding.Default)
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "+VxZXFg-3+ieNfE2cIkdGW42JLTi1ZJw-" & vbCrLf
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "圖圖3解弓月金難手ä¸å¤§" & vbCrLf
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "圖圖3解弓月金難手中大" & vbCrLf
?filereader
"This" & vbCrLf & "is some text" & vbCrLf & "to test" & vbCrLf & "圖圖3解弓月金難手中大" & vbCrLf
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
yes, all I have seen is utf8 for Marc 21

I still need some help to figure out my prior post.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
Can you tell me which version of .NET you're working with? I want to be on the same page.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 24H2
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
    Antivirus
    Microsoft Defender
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
Anyhoooo... given the following test files

Code:
./Sample File - ASCII.txt:      ASCII text
./Sample File - UTF-32.txt:     Unicode text, UTF-32, little-endian
./Sample File - UTF-7.txt:      ASCII text
./Sample File - UTF-8.txt:      Unicode text, UTF-8 (with BOM) text
./Sample File - Unicode BE.txt: Unicode text, UTF-16, big-endian text
./Sample File - Unicode LE.txt: Unicode text, UTF-16, little-endian text

The StreamReader class I mentioned in #9 and you mentioned in #23 is the easiest way to handle this.

Code:
For Each filePath As String In IO.Directory.GetFiles("D:\", "*.txt", IO.SearchOption.TopDirectoryOnly)
    Console.WriteLine("reading file ""{0}""", filePath)

    Using reader As New IO.StreamReader(filePath, True)
        Dim contents As String = reader.ReadToEnd()
        Console.WriteLine("file encoding: {0}", reader.CurrentEncoding)
        Try
            Console.OutputEncoding = reader.CurrentEncoding
        Catch
        End Try
        Console.WriteLine(contents)
        reader.Close()
    End Using

    Console.WriteLine(Environment.NewLine & (New String("=", 50)) & Environment.NewLine)
Next


Code:
reading file "D:\Sample File - ASCII.txt"
file encoding: System.Text.UTF8Encoding
????? Sample Text ???????

==================================================

reading file "D:\Sample File - Unicode BE.txt"
file encoding: System.Text.UnicodeEncoding
的是不我他 Sample Text широкий

==================================================

reading file "D:\Sample File - Unicode LE.txt"
file encoding: System.Text.UnicodeEncoding
的是不我他 Sample Text широкий

==================================================

reading file "D:\Sample File - UTF-32.txt"
file encoding: System.Text.UTF32Encoding
的是不我他 Sample Text широкий

==================================================

reading file "D:\Sample File - UTF-7.txt"
file encoding: System.Text.UTF8Encoding
+doRmL04NYhFO1g- Sample Text +BEgEOARABD4EOgQ4BDk-

==================================================

reading file "D:\Sample File - UTF-8.txt"
file encoding: System.Text.UTF8Encoding
的是不我他 Sample Text широкий

==================================================

Obviously the ASCII and UTF-7 encodings had trouble with the non-English characters.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 24H2
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
    Antivirus
    Microsoft Defender
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
I have figured out somethings.
I can use NotePad++ to change the file encoding and also leave off or add on the BOM for UTF8 files

And then can test it in VS2022. So yes, I am making some headway.
That app has been a super handy app for me.

 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
So now this code snip is really working to select how to open unicode files.
I really do appreciate your help.

tester(FilenameToBreak, Fileunicodetype)
Dim content As String
If Fileunicodetype = "Unicode (UTF-8)" Or Fileunicodetype = "EncodingUnknown" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8) 'utf8
If Fileunicodetype = "Unicode" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.Default) 'utf16
If Fileunicodetype = "Unicode (Big-Endian)" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.BigEndianUnicode) 'utf16
If Fileunicodetype = "Unicode (UTF-32)" Then content = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF32) 'or use default

The sub I have been using, and I got rid of as object, that was failing, it can be as string, I had to change one line on sr.currentencoding to this

FileUnicodetype = sr.CurrentEncoding.EncodingName 'CurrentEncoding

This gets the encoding very soon into the file reading, just 10 loops. Arbitrarily set by me.

Public Sub tester(ByRef FileName As String, ByRef FileUnicodetype As String)
'will tell you the file unicode encoding



'********************************************************************
Dim sr As New StreamReader(FileName, True)
Dim Countchars As Integer
Do While sr.Peek() >= 0
'Debug.Write(Convert.ToChar(sr.Read()))
Countchars += 1
If Countchars > 10 Then Exit Do
Loop
Debug.WriteLine(" ")

'Test for the encoding after reading, or at least
'after the first read.

' Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)

FileUnicodetype = sr.CurrentEncoding.EncodingName 'CurrentEncoding
sr.Close()
Catch e As Exception

'Debug.Print("The process failed: {0}", e.ToString())

FileUnicodetype = "EncodingUnknown"

End Try


End Sub
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
And Notepad++ app gave me a huge boost on a file I had that showed Swedish chars but kept failing to be read as any kind of unicode, it gave strange results... I was really stuck and confused, I kept wondering why VS2022 with my coding could not read the string, it always had ???????

THEN, I remembered, I had created the file using a windows code page 1252...
When I opened the file in Notepad++ and checked the encoding, it said ANSI !!!
Changing the file coding to Unicode, all the Swedish chars disappeared into a mess like XE4 XF6 Xe5
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I'm not sure why you're doing all this read the file, get the encoding, then a bunch of If statements. Just open the file with StreamReader and start doing stuff with the contents.

Edit: Let me try to clarify. You're using StreamReader to crack open the file and peek inside, to determine the encoding. Then based on that, you're opening the file based on that encoding, and taking action of some sort on the contents. But StreamReader already had the file open and was reading it. As you can see from the mini code sample I last posted, it was able to handle files of six different encodings. A couple of the files themselves didn't handle the non-English characters, but that's because they were written with encodings that don't handle that very well.
 
Last edited:

My Computers

System One System Two

  • OS
    Windows 11 Pro 24H2
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
    Antivirus
    Microsoft Defender
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
I'm not sure why you're doing all this read the file, get the encoding, then a bunch of If statements. Just open the file with StreamReader and start doing stuff with the contents.

Edit: Let me try to clarify. You're using StreamReader to crack open the file and peek inside, to determine the encoding. Then based on that, you're opening the file based on that encoding, and taking action of some sort on the contents. But StreamReader already had the file open and was reading it. As you can see from the mini code sample I last posted, it was able to handle files of six different encodings. A couple of the files themselves didn't handle the non-English characters, but that's because they were written with encodings that don't handle that very well.
Not sure? absolutely, I am a real newbie here.

I just tried this and it seems to work on 4 unicode file types
Is this going to be sufficient? Or try to catch an exception?
I get the file name from a file dialog so the file exists.

Can it really be this easy to just do the following?
Dim sr As New StreamReader(FilenameToBreak, True)
content = sr.ReadToEnd()
sr.Close()

Somehow did not think would be that easy. The MS learning web page had so much info, I found it ovewheming, I just thought it could not be that simple. Why the MS does not just say 3 simple lines, at least to get a coder to see something working.

Going through the extra coding did help me learn about unicode and streamreader some more. Maye more than I really wanted to know.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
Dim file As New StreamWriter(FileNameToCreate, False, New UTF8Encoding())
This is how I write to a file using utf8 encoding.
If the encoding is not specified on the DIM, does it assume UTF8?

I process the string called 'content' with a lot of coding.
Been testing and it is writing unicode chars, so I am happy.
For a while nothing was working right, but then I figured out enough to make it function.
file.Write(MarcData)
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I would say that it's likely to be that easy, but there are no guarantees. There really is no such thing as a simple text files. Multiple encodings, languages, code pages, etc. It's a jungle out there. :-)

I'll give a different example. Let's say you're making an app to handle images. I pass you a file with a JPEG extension, but I put a PNG header (signature) at the beginning of the data. What does your app do then? The point is, don't trust anything a user gives you. Verify the contents. Sounds like you're doing that ("I process the string called 'content' with a lot of coding.").

In the case of MARC-21 files, make sure the data you get for each field makes sense for that field, before doing something with it.

Dim file As New StreamWriter(FileNameToCreate, False, New UTF8Encoding())
This is how I write to a file using utf8 encoding.
If the encoding is not specified on the DIM, does it assume UTF8?

It looks like, if you don't pass an encoding to the constructor, it uses UTF-8. So I would say yes.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 24H2
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears
    Antivirus
    Microsoft Defender
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical with Cherry MX Clears

Latest Support Threads

Back
Top Bottom