Seems like a nice result but wouldn’t have hurt for them to give a few performance benchmarks. I understand that the point of the paper was a quality improvement, but it’s always nice to reference a baseline for practicality.
The learned frequency banks reminded me of a notion I had: Instead of learning upscaling or image generation in pixel space, why not reuse the decades of effort that has gone into lossy image compression by generating output in a psychovisually optimal space?
Perhaps frequency space (discrete cosine transform) with a perceptually uniform color space like UCS. This would allow models to be optimised so that they spend more of their compute budget outputting detail that's relevant to human vision. Color spaces that split brightness from chroma would allow increased contrast detail and lower color detail. This is basically what JPG does.
Instead of training on vast amounts of arbitrary data that may lead to hallucinations, wouldn't it be better to train on high-resolution images of the specific subject we want to upscale? For example, using high-resolution modern photos of a building to enhance an old photo of the same building, or using a family album of a person to upscale an old image of that person. Does such an approach exist?
Seems like a nice result but wouldn’t have hurt for them to give a few performance benchmarks. I understand that the point of the paper was a quality improvement, but it’s always nice to reference a baseline for practicality.
Why do those algorithm not include prompting to guide the scaling ?
Very good work!
Sadly this model really does not like nosy images that have codec compression artifacts, at least with my few test images.
I'd like to see the results in something like Wing Commander Privateer.
@0x12A what’s the difference between this version and v1 of the paper from November 2023?
DLSS will benefit greatly from research in this area. DLSS 4 uses transformers.
DLSS 3 vs DLSS 4 (Transformer)
https://www.youtube.com/watch?v=CMBpGbUCgm4
I tried photos of animals, and it was okayish except the eyes were completely off.
The learned frequency banks reminded me of a notion I had: Instead of learning upscaling or image generation in pixel space, why not reuse the decades of effort that has gone into lossy image compression by generating output in a psychovisually optimal space?
Perhaps frequency space (discrete cosine transform) with a perceptually uniform color space like UCS. This would allow models to be optimised so that they spend more of their compute budget outputting detail that's relevant to human vision. Color spaces that split brightness from chroma would allow increased contrast detail and lower color detail. This is basically what JPG does.
Could such method be adapted to audio? For instance to upscale 8-bit samples to 16-bit in Amiga mods?
hrm. on nature portrait photography 600x600 upscale, it has a LOT of artifacts. Perhaps too far out of distribution?
That said, your examples are promising, and thank you for posting a HF space to try it out!
I would love to see this kind of work applied to old movies from the 30s and 40s like the Marx Brothers.
Where are the ground truth images?
Was anyone else expecting an infinitely zoomable pictures from that title? I am disappoint
Instead of training on vast amounts of arbitrary data that may lead to hallucinations, wouldn't it be better to train on high-resolution images of the specific subject we want to upscale? For example, using high-resolution modern photos of a building to enhance an old photo of the same building, or using a family album of a person to upscale an old image of that person. Does such an approach exist?