I created a Nationwide dataset of 155M land parcels using two GPUs and a 30TB hard drive.
Because I don't have $100K+ to buy the US parcel dataset from Regrid or ReportAll, I bought a pair of L40s and a 30TB NVMe hard drive, and used them to collect and harmonize 155M parcels into a single dataset from over 3,100 US counties.
And because I don't have a couple dozen employees to feed like Reportall and Regrid and Corelogic, my goal is to try to resell this dataset at much lower prices than the current incumbents, and make the data accessible to smaller projects and smaller budgets.
I ended up with close to 99% coverage of the United States.
Backend stack is a single server running Postgres, gemma3 on ollama, and a big pile of python and plpgsql. Website is running on Firebase with PMTiles as the mapping layer. Parcel file exports are served from Google Cloud Storage.
My plan is to open-source a big portion of this system once I can clean it up, but my first priority was getting a product on the market and trying to make this self-sustaining.
If anyone is interested in any of the technical details or if you want to try to do this yourself, I'm happy to share anything you want to know.
One at a time. The county is the sole unit of authority for land records in the US (with a few exceptions). Luckily, these days, most of them publish this data via web services or APIs.
I was able to automate a big chunk of this work by crawling county websites and looking for these web services that I could download from.
But there is no agreed-upon schema standard -- they all store the data in different formats, schemas, etc. About 50% of the effort in maintaining a dataset like this is maintaining the mappings from the source data to the target schema. That's where I am making heavy use of LLMs. This turns out to be something they are very good at. I found gemma3 to have the best balance of reliability, ease of use, and speed for my use case.
I created a Nationwide dataset of 155M land parcels using two GPUs and a 30TB hard drive.
Because I don't have $100K+ to buy the US parcel dataset from Regrid or ReportAll, I bought a pair of L40s and a 30TB NVMe hard drive, and used them to collect and harmonize 155M parcels into a single dataset from over 3,100 US counties.
And because I don't have a couple dozen employees to feed like Reportall and Regrid and Corelogic, my goal is to try to resell this dataset at much lower prices than the current incumbents, and make the data accessible to smaller projects and smaller budgets.
I ended up with close to 99% coverage of the United States.
Backend stack is a single server running Postgres, gemma3 on ollama, and a big pile of python and plpgsql. Website is running on Firebase with PMTiles as the mapping layer. Parcel file exports are served from Google Cloud Storage.
My plan is to open-source a big portion of this system once I can clean it up, but my first priority was getting a product on the market and trying to make this self-sustaining.
If anyone is interested in any of the technical details or if you want to try to do this yourself, I'm happy to share anything you want to know.
I would like to know more. For example how did you get the county records?
One at a time. The county is the sole unit of authority for land records in the US (with a few exceptions). Luckily, these days, most of them publish this data via web services or APIs.
I was able to automate a big chunk of this work by crawling county websites and looking for these web services that I could download from.
But there is no agreed-upon schema standard -- they all store the data in different formats, schemas, etc. About 50% of the effort in maintaining a dataset like this is maintaining the mappings from the source data to the target schema. That's where I am making heavy use of LLMs. This turns out to be something they are very good at. I found gemma3 to have the best balance of reliability, ease of use, and speed for my use case.
I'm very interested to learn more.