JSON wasn't designed for big files. The default parsers in every language load the entire document into memory before they give you anything. For a 5 GB file, that's a problem. Here's how to handle large JSON without crashing.
The root issue
JSON.parse in JavaScript and json.load in Python work the same way: read the whole file, build the whole object tree, then return. Memory usage is roughly 2-5x the file size on disk (the parsed object has more overhead than the text). A 1 GB JSON file can need 5 GB of RAM to parse.
The fix is to stream: process pieces of the document as they're read, never holding more than you need.
Strategy 1: Switch to JSON Lines (NDJSON)
This is the best solution if you control the format. Instead of one giant array:
[
{ "id": 1, ... },
{ "id": 2, ... },
...
]
...write one object per line:
{"id": 1, ...}
{"id": 2, ...}
...
Each line is independently valid JSON. You can stream-process them line-by-line with constant memory in any language:
# Python
with open('data.ndjson') as f:
for line in f:
record = json.loads(line)
process(record)
// Node.js
import { createInterface } from 'node:readline';
import { createReadStream } from 'node:fs';
const rl = createInterface({
input: createReadStream('data.ndjson'),
crlfDelay: Infinity
});
for await (const line of rl) {
if (!line) continue;
const record = JSON.parse(line);
process(record);
}
NDJSON is now standard for logs (Cloud logging, Elasticsearch bulk import), data exports, and event streams. If you have a choice in the format, choose this.
Strategy 2: Stream-parse the existing array
When the source is a giant array and you can't change it, use a streaming parser. These libraries parse the document incrementally and emit events for each object encountered.
Python — ijson:
pip install ijson
import ijson
with open('huge.json', 'rb') as f:
for record in ijson.items(f, 'item'):
process(record)
# only one record in memory at a time
The selector 'item' means "each item in the top-level array." For nested arrays, use a dotted path: 'data.users.item'.
Node.js — stream-json:
npm install stream-json
import { createReadStream } from 'node:fs';
import StreamArray from 'stream-json/streamers/StreamArray.js';
const pipeline = createReadStream('huge.json').pipe(StreamArray.withParser());
for await (const { value } of pipeline) {
process(value);
}
Go — jsonstream or json.Decoder:
dec := json.NewDecoder(file)
dec.Token() // consume "["
for dec.More() {
var record MyType
dec.Decode(&record)
process(record)
}
Strategy 3: Use jq for one-off processing
jq streams by default for many operations. For a 10 GB API export:
# Extract just the names, one per line — constant memory
jq -r '.[].name' < huge.json > names.txt
# Filter records, output as NDJSON
jq -c '.[] | select(.active == true)' < huge.json > active.ndjson
The -c flag means "compact output, one JSON object per line." Combined with array iteration, it converts a giant array to NDJSON.
For files too big even for jq, use jq --stream, which emits one path/value pair at a time — but the query syntax becomes awkward. Usually jq -c '.[]' handles what you need.
Strategy 4: Convert to a database or columnar format
If you'll query the data more than once, parse it into something queryable instead of streaming it repeatedly:
SQLite: import JSON into a table once, then SQL forever. SQLite has JSON functions for keeping nested fields intact.
DuckDB: reads JSON files directly with SQL — SELECT * FROM 'huge.json'. Especially good for analytical queries on multi-GB files.
Parquet: columnar binary format. Convert once with pandas.read_json + to_parquet, then queries are 10-100x faster than scanning JSON.
Memory-saving tricks within standard parsing
If you must use a non-streaming parser (small enough file, just tight memory budget), a few tricks:
Discard fields you don't need. If your records have a 100 KB raw_html field you never use, drop it before deserialization. In Python, orjson + a jsonpath projection is faster than parsing the whole thing.
Use orjson or simdjson. Python's stdlib JSON is correct but slow. orjson is 2-5x faster and uses less memory. simdjson uses SIMD instructions; it's the fastest parser available.
Process and forget. After you've extracted what you need from a record, set the reference to None / null so the GC can reclaim it. Especially important inside long loops.
The streaming output side
Streaming reads only solve half the problem. If you're writing a 10 GB array, you can't hold the whole thing in memory either. Write NDJSON:
with open('out.ndjson', 'w') as f:
for record in source:
f.write(json.dumps(record))
f.write('\n')
Or stream a JSON array manually:
with open('out.json', 'w') as f:
f.write('[')
first = True
for record in source:
if not first: f.write(',')
f.write(json.dumps(record))
first = False
f.write(']')
When to give up on JSON
If you're dealing with terabyte-scale data, JSON is the wrong format. The text encoding and the lack of types make every operation slower than it needs to be. Convert once to Parquet, Avro, or ORC, then never look back. JSON is for human-debuggable structured data — it's not a data warehouse format.
Wrap-up
The summary: NDJSON when you control the format, streaming parsers when you don't, columnar when you'll query the data more than once. Big JSON files are a solved problem — just not by the default JSON.parse path.