I actually found this product most interesting, perhaps oddly, for voyaging far afield from the fashion preference.
Picture leads to books leads to toys leads to ... a television that emits heat like a fireplace. I actually kind of want a television that emits heat / cold in response to what's on television (the product was just a tv fireplace that made heat).
Notably, like others have pointed out, and the author already addressed, the model definitely starts giving some rather "strange" results when you travel far away from the central training theme. Just click on a kid's hoverboard, and it gives me vacuum cleaners and leaf blowers. Dinosaur toy, and not a dinosaur in the results.
They're still interesting, and the main idea of having a "variety" slider with nearness of similarity was a cool feature for image browsing like most image searches. Be kind of nice if some browser image searches would let you have a "variety" or "nearness" slider when you're just looking at similar images.
I love this kind of “reaction decision” process. I have a hard time styling things until I see them and importantly when I see examples of what I don’t like.
Also this is what I imagine Stitch Fix uses for their stylists. I wish there was a polished stylist service that didn’t also have me buying clothes from them. I don’t need a $60 white T shirt or a $120 basic jean jacket but I do want to have styles that look good specially for me.
This is interesting but seems price independent. I get that since nothing is final, they're all just more nodes, artificial restrictions based on price might limit the pathways to a match that might fit the users price parameters. But if a user has a budget, some items will only ever be intermediary nodes, while others are also potential purchases. I wonder if it would be computational trivial to highlight this distinction visually, so a user can easily distinguish between items to consider for purchase and items they will only see as refinement.
Our feels like there's a bit of a gamification in just clicking one more time, like "I like this, but if I click one more, maybe I'll like something in the next set even more." And repeat forever - like a great (window) shopping tool that doesn't result in much buying. But I'm not a shopper/consumer, so maybe my impression is not representative.
It's interesting to see how some ideas keep looping around.
When I see this I'm reminded of a startup Modista from ~2008 which did visual-similarity for browsing fashion products. Here's a writeup from a point when they did only shoes, but I know they expanded beyond shoes at somepoint: https://thenoisychannel.com/2008/11/05/modista-similarity-br...
I think they died over legal issues with rights to product images.
But in the vibewall demo, I wonder if the embedding is capturing the right similarity concept for this application. E.g. in the results most similar to this men's polo, I see a bunch of henleys, a women's quarter-zip pull-over, a women's full-zip fleece, a men's tank, a women's top with a plunging neckline, even a baby wrap! These are appropriate to be worn by different people in different social contexts and in different seasons. The main visual similarity seems to be that they include human upper bodies on white backgrounds?
https://vibewall.shop/?id=c43bc222-e68b-11ef-8208-0242ac1100...
You definitely highlighted a shortcoming of the feature vector model in this case. Indeed it's quite a small model trained on a single Mac for about a week, so it's not very "smart".
I'd expect that this is a problem that could be solved by using larger off the shelf models for image similarity. For this project, I thought it would be cooler to train the model end-to-end myself, but doing so has negative consequences for sure.
This is really cool. You might be interested in this talk which shows you can incrementally add some preference vectors as a way to improve the recommendation.
To optimize for fast nearest neighbors, I chose 256 dims. Notably, this actually hurt some of the pre-training classification losses pretty severely compared to 2k dims, so it definitely has a quality cost.
The site uses cosine distance. The code itself implements Euclidean distance, but I decided to normalize the vectors last minute out of FUD that some unusually small vectors would appear as neighbors for an abnormal number of examples.
This is great! I've forwarded the site to my wife.
Would you mind sharing how you trained the model to produce the vectors? Are you using a vision transformer under the hood with contrastive training against price, product category, etc.?
EDIT: I see that the training script is included in the repo and you are using a CNN. Inspiring work!
Yup, it's a small model I trained on my Mac mini! The model itself just classifies product attributes like keywords, price, retailer, etc. The features it learns are then used as embeddings
Visiting the site first I was quite annoyed it always pushed me towards women's fashion, which obviously makes sense reading your statement.
If anyone reimplements this for men's fashion let me know! I think this tool is great for anyone who isn't well educated in terms of fashion and I guess it is safe to say that this applies to men more often than to women.
Nice! I was going to ask about if the nearest neighbour algorithm gives less distance if a model has the same pose and then I realised that similar products (like a t-shirt) are shown with the same pose, so it shouldn't be an issue.
I think it would be a useful feature. For the sake of being a fun project, I didn't use CLIP because I only wanted to use models that I trained myself on a single Mac. However, to make this more useful, text search would be quite helpful.
I actually found this product most interesting, perhaps oddly, for voyaging far afield from the fashion preference.
Picture leads to books leads to toys leads to ... a television that emits heat like a fireplace. I actually kind of want a television that emits heat / cold in response to what's on television (the product was just a tv fireplace that made heat).
Notably, like others have pointed out, and the author already addressed, the model definitely starts giving some rather "strange" results when you travel far away from the central training theme. Just click on a kid's hoverboard, and it gives me vacuum cleaners and leaf blowers. Dinosaur toy, and not a dinosaur in the results.
They're still interesting, and the main idea of having a "variety" slider with nearness of similarity was a cool feature for image browsing like most image searches. Be kind of nice if some browser image searches would let you have a "variety" or "nearness" slider when you're just looking at similar images.
The "shop for random products" direction was actually fun for me too. Reminds me of amazon.com/stream a bit.
I love this kind of “reaction decision” process. I have a hard time styling things until I see them and importantly when I see examples of what I don’t like.
Also this is what I imagine Stitch Fix uses for their stylists. I wish there was a polished stylist service that didn’t also have me buying clothes from them. I don’t need a $60 white T shirt or a $120 basic jean jacket but I do want to have styles that look good specially for me.
This is interesting but seems price independent. I get that since nothing is final, they're all just more nodes, artificial restrictions based on price might limit the pathways to a match that might fit the users price parameters. But if a user has a budget, some items will only ever be intermediary nodes, while others are also potential purchases. I wonder if it would be computational trivial to highlight this distinction visually, so a user can easily distinguish between items to consider for purchase and items they will only see as refinement.
Our feels like there's a bit of a gamification in just clicking one more time, like "I like this, but if I click one more, maybe I'll like something in the next set even more." And repeat forever - like a great (window) shopping tool that doesn't result in much buying. But I'm not a shopper/consumer, so maybe my impression is not representative.
It's interesting to see how some ideas keep looping around. When I see this I'm reminded of a startup Modista from ~2008 which did visual-similarity for browsing fashion products. Here's a writeup from a point when they did only shoes, but I know they expanded beyond shoes at somepoint: https://thenoisychannel.com/2008/11/05/modista-similarity-br...
I think they died over legal issues with rights to product images.
But in the vibewall demo, I wonder if the embedding is capturing the right similarity concept for this application. E.g. in the results most similar to this men's polo, I see a bunch of henleys, a women's quarter-zip pull-over, a women's full-zip fleece, a men's tank, a women's top with a plunging neckline, even a baby wrap! These are appropriate to be worn by different people in different social contexts and in different seasons. The main visual similarity seems to be that they include human upper bodies on white backgrounds? https://vibewall.shop/?id=c43bc222-e68b-11ef-8208-0242ac1100...
You definitely highlighted a shortcoming of the feature vector model in this case. Indeed it's quite a small model trained on a single Mac for about a week, so it's not very "smart".
I'd expect that this is a problem that could be solved by using larger off the shelf models for image similarity. For this project, I thought it would be cooler to train the model end-to-end myself, but doing so has negative consequences for sure.
This is really cool. You might be interested in this talk which shows you can incrementally add some preference vectors as a way to improve the recommendation.
https://haystackconf.com/us2023/talk-20/
Out of curiosity, what's the size of vectors you're using (# of dimensions) and what distance metric are you using? Euclidean?
To optimize for fast nearest neighbors, I chose 256 dims. Notably, this actually hurt some of the pre-training classification losses pretty severely compared to 2k dims, so it definitely has a quality cost.
The site uses cosine distance. The code itself implements Euclidean distance, but I decided to normalize the vectors last minute out of FUD that some unusually small vectors would appear as neighbors for an abnormal number of examples.
If you chose a photo of socks pointing left, then the nearest neighbors are socks also pointing left.
I’d think the model should focus on the patterns and cut, not which way they are laying for the marketing photo.
Probably the same complaint as
https://news.ycombinator.com/item?id=43375415
This is great! I've forwarded the site to my wife.
Would you mind sharing how you trained the model to produce the vectors? Are you using a vision transformer under the hood with contrastive training against price, product category, etc.?
EDIT: I see that the training script is included in the repo and you are using a CNN. Inspiring work!
Yup, it's a small model I trained on my Mac mini! The model itself just classifies product attributes like keywords, price, retailer, etc. The features it learns are then used as embeddings
Visiting the site first I was quite annoyed it always pushed me towards women's fashion, which obviously makes sense reading your statement.
If anyone reimplements this for men's fashion let me know! I think this tool is great for anyone who isn't well educated in terms of fashion and I guess it is safe to say that this applies to men more often than to women.
Nice! I was going to ask about if the nearest neighbour algorithm gives less distance if a model has the same pose and then I realised that similar products (like a t-shirt) are shown with the same pose, so it shouldn't be an issue.
"hat" gives a range of poses
Ideally pose and lighting wouldn't matter as much as it currently does.
I think using a better model to produce feature vectors could achieve this, or perhaps even finetuning the feature model to match human preferences.
Very cool. Have you considered adding text-based search using CLIP-like embeddings?
I think it would be a useful feature. For the sake of being a fun project, I didn't use CLIP because I only wanted to use models that I trained myself on a single Mac. However, to make this more useful, text search would be quite helpful.
So it just automatically uses location data fed by the user and doesn’t prompt? What are the terms of service on data collection?
Not your physical nearest neighbors, rather, product neighbors in similarity space.
I would love a shopping site that tells me what my neighbours are buying. Bonus points if it uses my camera to peek through the windows.