Python: split bytes with a hexadecimal delimiter -


i'm working couple of binary files , want parse utf-8 strings exist.

i have function takes starting location of file, returns string found:

def str_extract(file, start, size, delimiter = none, index = none):    file.seek(start)    if (delimiter != none , index != none):        return file.read(size).explode('0x00000000')[index] #incorrect    else:        return file.read(size) 

some strings in file separated 0x00 00 00 00, possible split these php's explode? i'm new python pointers on code improvements welcome.

sample file:

48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00 6f 00 72 00 6c 00 64 00 | 00 00 00 00 | 31 00 32 00 33 00 hello world123, i've noted 00 00 00 00 separator enclosing | bars.

so:

str_extract(file, 0x00, 0x20, 0x00000000, 0) => 'hello world' 

similarly:

str_extract(file, 0x00, 0x20, 0x00000000, 1) => '123' 

i'm going assume using python 2 here, write code work on both python 2 , python 3.

you have utf-16 data, not utf-8. can read binary data , split on 4 nul bytes str.split() method:

file.read(size).split(b'\x00' * 4)[index] 

the resulting data encoded utf-16 little-endian (you may or may not have omitted utf-16 bom @ start; can decode data with:

result.decode('utf-16-le') 

this fail cut off text @ last nul byte; python splits on first 4 nuls found, , won't skip last nul byte part of text.

the better idea decode unicode first, then split on unicode double-nul codepoint:

file.read(size).decode('utf-16-le').split(u'\x00' * 2)[index] 

putting function be:

def str_extract(file, start, size, delimiter = none, index = none):    file.seek(start)    if (delimiter not none , index not none):        delimiter = delimiter.decode('utf-16-le')  # or pass in unicode        return file.read(size).decode('utf-16-le').split(delimiter)[index]    else:        return file.read(size).decode('utf-16-le')  open('filename', 'rb') fobj:     result = str_extract(fobj, 0, 0x20, b'\x00' * 4, 0) 

if file bom @ start, consider opening file utf-16 instead start with:

import io  io.open('filename', 'r', encoding='utf16') fobj:     # .... 

and remove explicit decoding.

python 2 demo:

>>> io import bytesio >>> data = b'h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00\x00\x00\x00\x001\x002\x003\x00' >>> fobj = bytesio(data) >>> str_extract(fobj, 0, 0x20, '\x00' * 4, 0) u'hello world' >>> str_extract(fobj, 0, 0x20, '\x00' * 4, 1) u'123' 

Comments