i'm working couple of binary files , want parse utf-8 strings exist.
i have function takes starting location of file, returns string found:
def str_extract(file, start, size, delimiter = none, index = none): file.seek(start) if (delimiter != none , index != none): return file.read(size).explode('0x00000000')[index] #incorrect else: return file.read(size)
some strings in file separated 0x00 00 00 00
, possible split these php's explode? i'm new python pointers on code improvements welcome.
sample file:
48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00 6f 00 72 00 6c 00 64 00 | 00 00 00 00 | 31 00 32 00 33 00
hello world123
, i've noted 00 00 00 00
separator enclosing |
bars.
so:
str_extract(file, 0x00, 0x20, 0x00000000, 0) => 'hello world'
similarly:
str_extract(file, 0x00, 0x20, 0x00000000, 1) => '123'
i'm going assume using python 2 here, write code work on both python 2 , python 3.
you have utf-16 data, not utf-8. can read binary data , split on 4 nul bytes str.split()
method:
file.read(size).split(b'\x00' * 4)[index]
the resulting data encoded utf-16 little-endian (you may or may not have omitted utf-16 bom @ start; can decode data with:
result.decode('utf-16-le')
this fail cut off text @ last nul byte; python splits on first 4 nuls found, , won't skip last nul byte part of text.
the better idea decode unicode first, then split on unicode double-nul codepoint:
file.read(size).decode('utf-16-le').split(u'\x00' * 2)[index]
putting function be:
def str_extract(file, start, size, delimiter = none, index = none): file.seek(start) if (delimiter not none , index not none): delimiter = delimiter.decode('utf-16-le') # or pass in unicode return file.read(size).decode('utf-16-le').split(delimiter)[index] else: return file.read(size).decode('utf-16-le') open('filename', 'rb') fobj: result = str_extract(fobj, 0, 0x20, b'\x00' * 4, 0)
if file bom @ start, consider opening file utf-16 instead start with:
import io io.open('filename', 'r', encoding='utf16') fobj: # ....
and remove explicit decoding.
python 2 demo:
>>> io import bytesio >>> data = b'h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00\x00\x00\x00\x001\x002\x003\x00' >>> fobj = bytesio(data) >>> str_extract(fobj, 0, 0x20, '\x00' * 4, 0) u'hello world' >>> str_extract(fobj, 0, 0x20, '\x00' * 4, 1) u'123'
Comments
Post a Comment