Programmers -- Can you help me identify this data type, how to unscramble it?| Off-Topic Discussion forum |

procainestart Dork
1/19/22 11:27 a.m.

I use a MS Word add-in with a customizable lookup table that's only viewable in the app. You can export/share the table, an XML file with just one tag -- < data > -- then 700,000 gibberish characters and no spaces (see sample, below) -- it's not structured XML, just a massive string.

I need the table in a readable format -- the vendor wants $$ to provide it. My Google-fu is weak, apparently -- I couldn't figure this out.

Possible clues:

I know the add-in was written in C#.
In Notepad++ and Word, I can't see line breaks, but I can see individual lines in Notepad; each starts with a "+" character. (I do this by stretching the window across three monitors.)

Can I unravel this, and if so, how? I have access to Visual Studio (dunno much about it, tho).

Seems like the kind of noob question that'd get me pummeled on StackOverflow... :-)

Data Sample

(Notepad displays individual lines, starting with a "+", except the very first one, shown here)

<?xml version="1.0" encoding="utf-8"?>
<style name="Style_2022-01-07" edition="0.0">
<data>hQ6KMj4UUTYTQakO820QOmmLNS3K+nte/adJpAa1lEaP52C6KDDbXvDXvH/rfV54sdDNy6iiBHpo3h0u8IAr10iMuosdbU6bIWQNV/PBY/eEbccNKZAwujSrfCB6P1zewNYUKUaaT2YSNkxH/vd9hGw6o7Mq35koAKnyYShkpuik565fy8TvnZoa0FifUDAqKZjaZFFfyZWazeMwRdrrYfqIUdsimm6xzguuDL61mTKFogDN1SIwdiQ1b43/1q9Fa8rBAJ7BTODzT54R2N1GxzpnRBU74ebMJbNIteaauwDzwN1oc/eQUIDLkovUDL789IQJPxrOAIsLwq2sC8NjGBeucc73fLtJYMXj48uwY3CQMGfknapZ7eyoMKzItm916cJRw6B5/DdLjfbvHsNs4pM0v8DA5BU14jx2532uFOTzf36M/94uJabs2Sw6PaqhySoWMjpa9/kQxIFiZeGxZXx9sCyySsR0yAbDids+lxkU2j5J8wSR8YVvByLkcKnK0VVhBgt74jfnLgIP8/Li4Bhwl6w1JLW5e2ZqndBzBALOxLtZwHGAYCbzMthq4qAOQ0fPbkAdw/NyMsTt9zdgOqHq5/cNRFkKDvPai+LmpgVdGAnLxGEsHqDYCz69KsNbjY98Da1w3amTZsNv307lD28nYoFYXia/ZFmeC7HS8Gfn1HFzHEJA2XI45fxI3yTOOP3tdA7e8/Zh3Arl70TQ9qWNanRIXLcTFuo5p91n2EM+wTOjihbSyfrH3pMJZvn3BSmqMwjo3bZeYfZVVCrfXTbznrOORUy3FwA36fYmXvWhnt8zrVJne5+FVWl904CF8esCVq1cIBJfYxQRKYh3vPwRZxMwQYL8QWAmIXjWLsN8ZuNFzuNUTwUpcHjGoPnlkwBtF+1VEACl/s0W7OKQnQQNLzSvdMzhGNWnribWiOi0qUdKVoOjQmxkMQgpE1mh+Fw2f/FIoY

[followed by hundreds of thousands of more characters that all look like this0

obsolete HalfDork
1/19/22 11:42 a.m.

procainestart said:
I need the table in a readable format -- the vendor wants $$ to provide it.

Is paying the vendor not an option?

wae PowerDork
1/19/22 11:46 a.m.

In reply to procainestart :

All I see is blonde, brunette, redhead...

It's obviously some sort of encoding... I'd copy/paste that <data> data into maybe a Base64 decoder and see what you get?

procainestart Dork
1/19/22 12:08 p.m.

In reply to obsolete :

We need it asap; they can't do it soon enough.

californiamilleghia UltraDork
1/19/22 12:12 p.m.

How about asking Reddit ?

might work !

codrus (Forum Supporter) PowerDork
1/19/22 12:21 p.m.

I suspect that's a proprietary encoding written by the folks who developed the extension, given that there's nothing in the xml header to identify it. You'll probably need to use the name of the extension or the vendor in your search.

If you want to reverse engineer it, you could start with the actual data in your table and see if you can find those values encoded in the data stream.

GameboyRMH MegaDork
1/19/22 12:28 p.m.

XML is a very generic multi-purpose file format. It looks like the developer is using an XML file with a custom layout, containing base64-encoded data in the XML tags. What's in the base64 data could be anything really - what I would do to identify it is decode the base64 data into a binary file, and then use the Linux "file" tool to try to identify that.

The gibberish that looks like base64 data could even be encrypted if you're really unlucky, it would be possible if difficult to extract the encryption key from the app if that's the case.

BoxheadTim MegaDork
1/19/22 12:43 p.m.

Came here to say what GameboyRMH said - at first glance this very much looks like base64 encoded binary data, which is exactly what one would use if one wanted to store a binary blob in an XML document (which is essentially text).

So you'd have to decode it from base64 into binary and the look at the data with suitable tools to see if you can reverse engineer the data structure from there. All of which might be possible if the data is not encrypted, but you'll need a fair amount of low level .NET knowledge to figure out how the data is stored after looking at it in a hex editor.

I also hate to be that guy, but there is a good chance that the license for the plugin has some provisions against reverse engineering so I'd proceed with care. And possible legal advice. Or just give the vendor the cash.

mslevin New Reader
1/19/22 1:09 p.m.

I assume you can't share the whole document with us?

Edit: Or, a much longer sample and some example known data from the table?

Edit 2: ah just noticed the sample you provided is longer than I thought

Driven5 UberDork
1/19/22 1:57 p.m.

So they have an export function that customers have to pay to use, or the intended export function is broken and they want you to pay them to fix it? Basically are you trying to defraud them or are they trying to defraud you?

codrus (Forum Supporter) PowerDork
1/19/22 2:07 p.m.

Driven5 said:
So they have an export function that customers have to pay to use, or the intended export function is broken and they want you to pay them to fix it? Basically are you trying to defraud them or are they trying to defraud you?

Trying to get back your own data isn't defrauding the vendor.

GameboyRMH MegaDork
1/19/22 2:11 p.m.

I base64-decoded the bit you posted and so far it's not identifiable, may be encrypted. Maybe a larger piece would be identifiable, but since we're expecting a lookup table, I'd guess not.

APEowner SuperDork
1/19/22 2:40 p.m.

I assume you tried coping from the filled out lookup table and pasting into excel.

procainestart Dork
1/19/22 3:15 p.m.

In reply to Driven5 :

No, I'm not trying to defraud them, and while they're not trying to defraud us, it's kinda bullE36 M3 that I can't access my own data table without paying them -- which is moot anyway: our deadline is before they can help us. (BTW, the export feature is for sharing your custom tables with colleagues.)

I'm trying to take a large table of data that I created -- it's not proprietary (well, it is -- it belongs to my employer) -- and export it so I can create a test document. The app uses a list of what are essentially RegEx strings I wrote -- there are hundreds. Each one has a plain-language description I need in order to test them. My strings and descriptions are presented in a table in the app, but I can't even select multiple cells across a single row. The vendor used to use standard XML, which I could easily work with. I'm an editor, obviously not a coder, and I like cars, hence a post about this on a car forum.

GameboyRMH MegaDork
1/19/22 3:22 p.m.

Why your software should be FLOSS: The Thread

procainestart Dork
1/19/22 3:30 p.m.

In reply to GameboyRMH :

Ha! It's a pretty esoteric product that we're generally happy with and don't mind paying for.

Thanks, all, for taking a little time to respond to this thread.

Driven5 UberDork
1/19/22 3:36 p.m.

In reply to codrus (Forum Supporter) :

Once you enter your data into somebody else's program, you do not necessarily own any formatting or functionality that the program is capable of performing on/with/to your data.

In reply to procainestart :

Thanks for entertaining my curiosity.

GameboyRMH MegaDork
1/19/22 3:51 p.m.

Just for fun I also analyzed the sample data with ent and it appears to have rather high randomness which again suggests that it's encrypted:

Entropy = 7.704500 bits per byte.

Optimum compression would reduce the size
of this 727 byte file by 3 percent.

Chi square distribution for 727 samples is 285.38, and randomly
would exceed this value 9.27 percent of the times.

Arithmetic mean value of data bytes is 130.0674 (127.5 = random).
Monte Carlo value for Pi is 3.008264463 (error 4.24 percent).
Serial correlation coefficient is 0.048273 (totally uncorrelated = 0.0).

procainestart Dork
1/19/22 4:35 p.m.

In reply to GameboyRMH :

Good to know, and thanks for taking the time to do that.

procainestart Dork
1/19/22 4:43 p.m.

...and I just figured out a kludgy work-around: a narrow, empty column appears on the left when I increase the window; clicking on the empty cell selects the row and I can copy it. So all I gotta do now is repeat that over and over and over and I'll have my strings and descriptions.

codrus (Forum Supporter) PowerDork
1/19/22 8:03 p.m.

GameboyRMH said:
Just for fun I also analyzed the sample data with ent and it appears to have rather high randomness which again suggests that it's encrypted:

Or just already compressed. It's pretty common to compress binary data before you inflate the size by encoding it in a text-safe transferrable form.

WonkoTheSane UltraDork
1/19/22 8:56 p.m.

codrus (Forum Supporter) said:
GameboyRMH said:
Just for fun I also analyzed the sample data with ent and it appears to have rather high randomness which again suggests that it's encrypted:

Or just already compressed. It's pretty common to compress binary data before you inflate the size by encoding it in a text-safe transferrable form.

Try using WinRAR to uncompress it. The developer may be the first person to pay for a license of WinRAR..

GameboyRMH MegaDork
1/20/22 1:30 p.m.

codrus (Forum Supporter) said:
GameboyRMH said:
Just for fun I also analyzed the sample data with ent and it appears to have rather high randomness which again suggests that it's encrypted:

Or just already compressed. It's pretty common to compress binary data before you inflate the size by encoding it in a text-safe transferrable form.

Decent theory but if that was simply the first piece of a compressed file, the file tool should've picked that up. Confirmed by testing on the first bit of a .tar.gz and also a .rar file.

Edit: ent analysis of my .rar file head show similar randomness except in the chi square test:

Entropy = 7.924718 bits per byte.

Optimum compression would reduce the size
of this 3119 byte file by 0 percent.

Chi square distribution for 3119 samples is 326.86, and randomly
would exceed this value 0.16 percent of the times.

Arithmetic mean value of data bytes is 131.2029 (127.5 = random).
Monte Carlo value for Pi is 3.098265896 (error 1.38 percent).
Serial correlation coefficient is 0.022859 (totally uncorrelated = 0.0).

codrus (Forum Supporter) PowerDork
1/20/22 5:52 p.m.

GameboyRMH said:
Decent theory but if that was simply the first piece of a compressed file, the file tool should've picked that up. Confirmed by testing on the first bit of a .tar.gz and also a .rar file.

That test would probably work for gzip because it's just a simple compression utility, but archive programs do more than that. They need tables of contents with file names, permissions, etc. You don't need that if you're just compressing a blob of data before base64 encoding it and so if it were using a compression library from one of those utilities without doing the whole archive file creation then it probably would not be recognized by 'file'.

GameboyRMH MegaDork
1/21/22 10:34 a.m.

codrus (Forum Supporter) said:
GameboyRMH said:
Decent theory but if that was simply the first piece of a compressed file, the file tool should've picked that up. Confirmed by testing on the first bit of a .tar.gz and also a .rar file.

That test would probably work for gzip because it's just a simple compression utility, but archive programs do more than that. They need tables of contents with file names, permissions, etc. You don't need that if you're just compressing a blob of data before base64 encoding it and so if it were using a compression library from one of those utilities without doing the whole archive file creation then it probably would not be recognized by 'file'.

Possible...the two easiest compression libraries to use in C# are gzip and brotli, if it were gzip then file should identify it or zcat should be able to open it (it can't). It also gives an error on brotli decompression. The ent analysis difference makes it looks more like the mystery data here is encrypted - it's slightly compressible and does much better on the chi square test than a piece of an unencrypted compressed file.

Decompressing the complete blob like this may be worth a try.