Understanding Pickle & JSON Serialization in Python

  • Python
  • Thread starter fog37
  • Start date
  • Tags
    Python
In summary: Pickling converts Python data into a binary format. This binary format is not human readable, but it is easy to convert the binary format back into a human readable format using the pickle module.
  • #1
fog37
1,569
108
TL;DR Summary
Difference between Ppickle and json with Python
Hello,

I understand that pickle and json are two different modules to serialize Python data. Serialization means to convert an object into a string.

When using pickle, we convert the Python data into a binary file. With json, the data is converted into a json file which is essentially a text-file, like a huge string. Is that correct?

The json module can only serialize certain types ( int, str, dict, list) while pickle is more flexible and can serialize other objects. What kind of other objects would we pickle? Would anyone have an example?

In regards to pickle, a binary file always seems a better option as far as storage space even if it is not human readable. But I read that, with the pickle module, once the object is serialized, it is not possible to deserialize it using another language. Why not?

Thanks!
 
Technology news on Phys.org
  • #2
fog37 said:
Serialization means to convert an object into a string.

"String" in the sense of "a sequence of bytes", yes. But a "string" in this sense might not be a "string" in the sense of "text".

fog37 said:
When using pickle, we convert the Python data into a binary file. With json, the data is converted into a json file which is essentially a text-file, like a huge string. Is that correct?

If you interpret "string" as "text" in the second part, yes. What you are calling a "binary" file is still a sequence of bytes; it's just that the sequence of bytes doesn't have any interpretation as "text".

fog37 said:
The json module can only serialize certain types ( int, str, dict, list)

More precisely, only serialization of certain types are built into the json module. That's because the JSON specification only allows certain kinds of things to be in a JSON text stream.

You can, however, serialize other Python types to JSON by writing custom code to do the serialization. The custom code needs to output a serialization that is compatible with the JSON spec. Usually this means finding a way to serialize the object's state to a JSON string. You then have to write custom code to also load the object back from the JSON file.

fog37 said:
while pickle is more flexible and can serialize other objects. What kind of other objects would we pickle?

The biggest advantage of pickle is that it automatically handles instances of user-defined classes, without having to write any custom code. You can only serialize instances of user-defined classes to JSON by writing custom code, as above.

fog37 said:
with the pickle module, once the object is serialized, it is not possible to deserialize it using another language. Why not?

Because no other language has libraries that handle the binary file format that Python's pickle uses; that file format was invented by the Python developers and is only supported in Python. Technically, I suppose you could write a custom library in some other language that would handle Python pickle files, but why would you bother?
 
  • Like
Likes fog37
  • #3
fog37 said:
a binary file always seems a better option as far as storage space even if it is not human readable.

If storage space is really an issue, you can always use lossless file compression, and text files tend to compress much better than binary files. So with compression, a binary file format might not actually save any storage space.

Also, storage space is not always the primary issue. It is often very, very useful to have your file format be human readable and even human editable. There are good reasons why text file formats like JSON and YAML exist.
 
  • Like
Likes fog37
  • #4
PeterDonis said:
"String" in the sense of "a sequence of bytes", yes. But a "string" in this sense might not be a "string" in the sense of "text".

If you interpret "string" as "text" in the second part, yes. What you are calling a "binary" file is still a sequence of bytes; it's just that the sequence of bytes doesn't have any interpretation as "text".

Thanks. It is all clear. Just still a little foggy on the string/text/binary concept.

JSON serialization is the process of converting an object into a json format which is a text-based format.
Based on my general understanding, a string is a sequence of alphanumeric characters (ASCII, Unicode, etc.), i.e a string is text-type information. So text (a single character, a word, a page, an entire essay written in Word) would essentially be a string, unless I missing something. When I write a file in notepad, that text file is still a binary file (a sequence of bytes) under the hood.
Every file is binary data deep down but a text-file, I guess, is a particular type of binary file that can be opened with programs like Word, Notepad, etc., that can handle binary data in that particular form (text).

Converting Python data to JSON format creates a json file that looks like a dictionary. For example,
1615689897615.png


On the other hand, when we pickle Python data instead, we create a binary file that cannot be opened by programs designed to create/read/open text...

Corrections are welcome.
 
  • #5
If you’re doing this to transmit data in a program agnostic manner between programs or for a micro service then perhaps you should check out Google Protocol buffers. It supports multi-language access including java, python , go and others.

Protocol buffers are message oriented, hence the .proto file where you describe the message and its components. The .proto is used to create code that you call in your program. They have some examples on the Google developers site for each of the supported languages.

https://developers.google.com/protocol-buffers

Heres a brief usage for storing and retrieving Protobuf messages from a file.

https://www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/

Building a custom binary scheme for a given language isn't that hard But there’s a lot to keep in mind. The hardest part is serializing and deserializing multi-dim arrays, OO instances and structs where you have to write more code to get at the atomic elements ie floats, int and strings and serialize and deserialize them in the proper order.

Depending on the CPU architecture you also need to consider big-endian vs little-endian for the floats and ints. The convention used is the reader/deserializer determines it and converts if needed.
 
  • Like
Likes suremarc
  • #6
fog37 said:
Just still a little foggy on the string/text/binary concept.

"Text" just means "a sequence of bytes that conforms to a set of constraints that people have agreed on over the years for text data".

fog37 said:
a string is a sequence of alphanumeric characters (ASCII, Unicode, etc.)

The term "string" is ambiguous. It can mean just "a sequence of bytes" (which is what the built-in "string" object was in Python 2--most people used such objects for text data but you could in principle put any bytes you like in a Python 2 string), or it can mean "a sequence of bytes that encodes a sequence of alphanumeric characters". (Or it can even mean just "a sequence of alphanumeric characters", without making any statement about how the sequence is represented in bytes--that's what a Python 3 "string" object is, a sequence of Unicode code points, whose underlying representation is not visible to the Python program.)

fog37 said:
When I write a file in notepad, that text file is still a binary file (a sequence of bytes) under the hood.

Yes, but, as above, not all sequences of bytes are "text". Try opening, say, an .exe file in notepad; if it opens at all, it will look like gibberish. That's because an .exe file is not a "text" file; the sequence of bytes that makes up the file does not conform to the constraints that people have agreed on over the years for text data. The same would happen if you tried to open a Python pickle file in notepad.
 
  • #7
PeterDonis said:
"Text" just means "a sequence of bytes that conforms to a set of constraints that people have agreed on over the years for text data".

Reviewing/reflecting on this topic of string vs text...

All files (text, applications, HDF5, etc.) are binary under the hood, i.e. they are made of 1s and 0s stored in memory.
A text file is surely any file with extension .txt that can be opened and created by specific programs (Notepad, etc.). Text files are generally human-readable and can be "encoded" in different ways (ASCII, UTF, etc.) which means that the same text information can be encoded and stored in main memory using a different binary representation (shorter or longer) based on the selected encoding scheme.

But also html files (.html) and JSON files (.json) are text files at all rights since they can be created and opened with a text editor like Notepad. The only difference between a JSON file or a HTML file and a regular text file created in Notepad is the presence of special formatting in the JSON file that a machine can interpret.

In Python, data can converted to a JSON file using the json.dumps() method. The created JSON file is a string object in Python (dict, list, tuple, string, int, float, True, False, None are the data typeso that can be converted to json objects).

Python:
import json
# Dictionary:
x = {
  "name": "Paul",
  "age": 20,
  "city": "Houston"
}

# Dictionary x is converted into JSON:
y = json.dumps(x)

# the result is a JSON string
print(y)
print(type(y))

So a JSON file is a just a common text file but it is a string object in Python. In Python, a string a sequence of alphanumeric characters (symbols). We can open and manipulate a text-file in Python using the open() function which converts the entire text-file to a long string object. So text becomes a string in Python.

If we have a .json file, we can read it in Python using the json.load() method. That converts the JSON file into a long Python string object. So, we can convert Python data to a JSON string object via json.read(). How does that JSON string object become text-file then? Does it happen when we save it to a folder?
 
  • #8
fog37 said:
A text file is surely any file with extension .txt that can be opened and created by specific programs (Notepad, etc.).

Text files don't have to have the .txt extension. Also, what the precise definition of "text" is (i.e., what precise sequences of bytes are valid "text" vs. not) depends on multiple things, of which the program you are using to open the file is only one.

fog37 said:
html files (.html) and JSON files (.json) are text files

Yes, because the specifications for both HTML and JSON restrict the bytes that can be used in a valid file of those types, and those restrictions qualify as "text" by pretty much any definition that anyone has ever used.

fog37 said:
a JSON file is a just a common text file but it is a string object in Python

No. There is no single Python object that corresponds to "a JSON file", but if you use the load function from the json module to read a JSON file, you won't get a string, you will get either a dict or a list, depending on whether the top-level thing in the JSON file is a JSON object or a JSON array. You can, of course, read the file into a Python string, but then you have to parse it as JSON using the loads function from the json module, and again, what you'll get will be either a dict or a list.

fog37 said:
We can open and manipulate a text-file in Python using the open() function which converts the entire text-file to a long string object.

Strictly speaking, open just opens the file; the read method of the open file object reads the file's data into a string (assuming you opened the file in text mode, which is the default).

Also, as above, this string has nothing whatever to do with JSON, even if the text file was a JSON file. It has to be parsed into JSON, and what results from the parsing is not a string.

fog37 said:
If we have a .json file, we can read it in Python using the json.load() method

Which requires an open file object to be passed to it--this is what you would call instead of the file object's read method if you wanted to directly parse the file into JSON, without going through the intermediate stage of reading it into a string.

fog37 said:
That converts the JSON file into a long Python string object.

No, it doesn't. See above. You can easily verify what I am saying by trying it out at the Python interactive prompt.

fog37 said:
How does that JSON string object become text-file then?

As above, there is no such thing as a "JSON string object" in the sense you mean. Of course a JSON object or array can contain strings, but that's not what you mean.

I strongly suggest spending some time with the Python documentation for the json module in the standard library. There you will find out about the counterpart functions to the load and loads functions that serialize Python objects in JSON format.
 
  • Like
Likes fog37
  • #9
Thank you.
The reason I talk about "JSON string object" is because of the example below where a dictionary named x (class 'dict') is converted to JSON, which is y (class 'str'). Both x any look the same but one is a dictionary and the other is a string...

1616820341856.png
 
Last edited:
  • #10
fog37 said:
The reason I talk about "JSON string object" is because of the example below where a dictionary named x (class 'dict') is converted to JSON, which is y (class 'str').

Yes, but such an object, as your code explicitly shows, is not what you get when you call the load function, as your earlier post claimed. It's what you get when you call the dumps function. But if you're going to save to a JSON file, you could just call the dump function and pass it an open file object. There is no need to go through the intermediate stage of serializing to a Python string at all. And to load from a JSON file, the load function just gives you the corresponding Python dict (or list if the file stores a JSON array), without going through an intermediate stage as a Python string.
 
  • #11
JSON is designed to be human readable and widely compatible.

Pickle is designed to be more efficient.

However pickle is brittle - it's keyed to a specific Python version (and probably architecture) - making conmpatabiltiy or upgrades difficult. It's also hideously insecure and you should never unpickle data from an untrusted source unless you like it when other people take control of your computer.

I agree with the poster above - use something like protobuf instead of pickle if you want an efficient binary format, and if you just want no hassle and don't care about efficiency, use JSON.

To store a custom type in JSON you need to write an encoder, but it's very simple code, and it's well documented in the JSON module docs.
 

FAQ: Understanding Pickle & JSON Serialization in Python

What is pickle and JSON serialization in Python?

Pickle and JSON serialization are methods used in Python to convert Python objects into a format that can be stored or transmitted. Pickle is used to convert objects into a stream of bytes, while JSON is used to convert objects into a string representation in JavaScript Object Notation format.

Why is serialization important in Python?

Serialization is important in Python because it allows for the transfer and storage of data in a structured format that can be easily understood by other programs and languages. This makes it easier to share and access data between different systems.

What is the difference between pickle and JSON serialization?

The main difference between pickle and JSON serialization is the format in which the data is converted. Pickle converts data into a binary format, while JSON converts data into a human-readable string format. Additionally, pickle can only be used in Python, while JSON is a widely used format that is compatible with many languages.

Can any Python object be pickled or serialized into JSON?

While most Python objects can be pickled or serialized into JSON, there are some limitations. Pickle cannot be used with objects that have file handles or database connections, and JSON cannot be used with objects that contain circular references. It is also important to note that pickling and JSON serialization are not secure methods of storing data, as the data can be easily accessed and modified.

How can I use pickle and JSON serialization in my Python code?

To use pickle and JSON serialization in your Python code, you can import the pickle and json modules and use their respective functions to convert your objects into the desired format. It is important to note that pickle and JSON should only be used to serialize data between trusted sources, as they can be vulnerable to security threats.

Similar threads

Replies
5
Views
862
Replies
8
Views
2K
Replies
10
Views
2K
Replies
5
Views
3K
Replies
4
Views
1K
Replies
3
Views
2K
Back
Top