PhD Researcher in Artificial Intelligence and Computer Vision at the University of Würzburg, advised by Prof. Radu Timofte. I’m Computer Vision Scientist at Sony SIE working on problems in computer vision related to my thesis. I am also Senior Data Scientist (Kaggle Grandmaster) at H2O.ai, ranked 42nd of 205.000 data scientists worldwide at Google’s largest data science platform for problem-solving.
My current research interests include neural networks, machine learning, low-level computer vision, inverse problems, computational photography and photorealism.
During 2020-21, I worked on image processing at Huawei Noah’s Ark Lab (London) supervised by Dr. Eduardo Pérez-Pellitero and Prof. Aleš Leonardis.
I obtained my M.Sc. in Computer Vision from the Autonomous University of Barcelona (UAB) with honours for my work Real-time Image Enhancement for Smartphones.
Outside of academia, my hobbies include hiking, photography, history, economy, and art.
Mail
LinkedIn
Lab
Scholar
“Only one who devotes himself to a cause with his whole strength and soul can be a true master. For this reason mastery demands all of a person.” Albert Einstein
Research
I'm interested in neural networks and deep learning to solve computer vision problems in image/video processing, computer graphics and computational photography.
Please see the complete list of publications at Google Scholar.
-
NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
Conde, Marcos V,
Vazquez-Corral, Javier,
Brown, Michael S,
and Timofte, Radu
arXiv
2023
3D lookup tables (3D LUTs) are a key component for image enhancement. Modern image signal processors (ISPs) have dedicated support for these as part of the camera rendering pipeline. Cameras typically provide multiple options for picture styles, where each style is usually obtained by applying a unique handcrafted 3D LUT. Current approaches for learning and applying 3D LUTs are notably fast, yet not so memory-efficient, as storing multiple 3D LUTs is required. For this reason and other implementation limitations, their use on mobile devices is less popular. In this work, we propose a Neural Implicit LUT (NILUT), an implicitly defined continuous 3D color transformation parameterized by a neural network. We show that NILUTs are capable of accurately emulating real 3D LUTs. Moreover, a NILUT can be extended to incorporate multiple styles into a single network with the ability to blend styles implicitly. Our novel approach is memory-efficient, controllable and can complement previous methods, including learned ISPs.
-
Efficient Multi-Lens Bokeh Effect Rendering and Transformation
Seizinger, Tim,
Conde, Marcos V.,
Kolmet, Manuel,
Bishop, Tom E.,
and Timofte, Radu
In CVPR Workshops
2023
Many advancements of mobile cameras aim to reach the visual quality of professional DSLR cameras.
Great progress was shown over the last years in optimizing the sharp regions of an image and in creating virtual portrait effects with artificially blurred backgrounds.
Bokeh is the aesthetic quality of the blur in out-of-focus areas of an image.
This is a popular technique among professional photographers, and for this reason, a new goal in computational photography is to optimize the Bokeh effect itself.
This paper introduces EBokehNet, a efficient state-of-the-art solution for Bokeh effect transformation and rendering.
Our method can render Bokeh from an all-in-focus image, or transform the Bokeh of one lens to the effect of another lens without harming the sharp foreground regions in the image.
Moreover we can control the shape and strength of the effect by feeding the lens properties i.e. type (Sony or Canon) and aperture, into the neural network as an additional input.
Our method is a winning solution at the NTIRE 2023 Lens-to-Lens Bokeh Effect Transformation Challenge, and state-of-the-art at the EBB benchmark.
-
Towards Real-Time 4K Image Super-Resolution
Zamfir, Eduard,
Conde, Marcos V.,
and Timofte, Radu
In CVPR Workshops
2023
Over the past few years, high-definition videos and images in 720p (HD), 1080p (FHD), and 4K (UHD) resolution have become standard.
While higher resolutions offer improved visual quality for users, they pose a significant challenge for super-resolution networks to achieve real-time performance on commercial GPUs.
This paper presents a comprehensive analysis of super-resolution model designs and techniques aimed at efficiently upscaling images from 720p and 1080p resolutions to 4K.
We begin with a simple, effective baseline architecture and gradually modify its design by focusing on extracting important high-frequency details efficiently.
This allows us to subsequently downscale the resolution of deep feature maps, reducing the overall computational footprint, while maintaining high reconstruction fidelity.
We enhance our method by incorporating pixel-unshuffling, a simplified and speed-up reinterpretation of the basic block proposed by NAFNet, along with structural re-parameterization.
We assess the performance of the fastest version of our method in the new NTIRE 2023 Real-Time 4K Super-Resolution challenge and demonstrate its potential in comparison with state-of-the-art efficient super-resolution models when scaled up.
Our method was tested successfully on high-quality content from photography, digital art, and gaming content.
-
Perceptual image enhancement for smartphone real-time applications
Conde, Marcos V,
Vasluianu, Florin,
Vazquez-Corral, Javier,
and Timofte, Radu
In WACV (Oral)
2023
Recent advances in camera designs and imaging pipelines allow us to capture high-quality images using smartphones. However, due to the small size and lens limitations of the smartphone cameras, we commonly find artifacts or degradation in the processed images. The most common unpleasant effects are noise artifacts, diffraction artifacts, blur, and HDR overexposure. Deep learning methods for image restoration can successfully remove these artifacts. However, most approaches are not suitable for real-time applications on mobile devices due to their heavy computation and memory requirements. In this paper, we propose LPIENet, a lightweight network for perceptual image enhancement, with the focus on deploying it on smartphones. Our experiments show that, with much fewer parameters and operations, our model can deal with the mentioned artifacts and achieve competitive performance compared with state-of-the-art methods on standard benchmarks. Moreover, to prove the efficiency and reliability of our approach, we deployed the model directly on commercial smartphones and evaluated its performance. Our model can process 2K resolution images under 1 second in mid-level commercial smartphones.
-
Model-Based Image Signal Processors via Learnable Dictionaries
Conde, Marcos V,
McDonagh, Steven,
Maggioni, Matteo,
Leonardis, Aleš,
and Pérez-Pellitero, Eduardo
AAAI (Spotlight 15%)
2022
Digital cameras transform sensor RAW readings into RGB images by means of their Image Signal Processor (ISP). Computational photography tasks such as image denoising and colour constancy are commonly performed in the RAW domain, in part due to the inherent hardware design, but also due to the appealing simplicity of noise statistics that result from the direct sensor readings. Despite this, the availability of RAW images is limited in comparison with the abundance and diversity of available RGB data. Recent approaches have attempted to bridge this gap by estimating the RGB to RAW mapping: handcrafted model-based methods that are interpretable and controllable usually require manual parameter fine-tuning, while end-to-end learnable neural networks require large amounts of training data, at times with complex training procedures, and generally lack interpretability and parametric control. Towards addressing these existing limitations, we present a novel hybrid model-based and data-driven ISP that builds on canonical ISP operations and is both learnable and interpretable. Our proposed invertible model, capable of bidirectional mapping between RAW and RGB domains, employs end-to-end learning of rich parameter representations, i.e. dictionaries, that are free from direct parametric supervision and additionally enable simple and plausible data augmentation. We evidence the value of our data generation process by extensive experiments under both RAW image reconstruction and RAW image denoising tasks, obtaining state-of-the-art performance in both. Additionally, we show that our ISP can learn meaningful mappings from few data samples, and that denoising models trained with our dictionary-based data augmentation are competitive despite having only few or zero ground-truth labels.
-
Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration
Conde, Marcos V,
Choi, Ui-Jin,
Burchi, Maxime,
and Timofte, Radu
In ECCV Workshops
2022
Compression plays an important role on the efficient transmission and storage of images and videos through band-limited systems such as streaming services, virtual reality or videogames. However, compression unavoidably leads to artifacts and the loss of the original information, which may severely degrade the visual quality. For these reasons, quality enhancement of compressed images has become a popular research topic. While most state-of-the-art image restoration methods are based on convolutional neural networks, other transformers-based methods such as SwinIR, show impressive performance on these tasks. In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution, and in particular, the compressed input scenario. Using this method we can tackle the major issues in training transformer vision models, such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data. We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution. Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR, and is a top-5 solution at the AIM 2022 Challenge on Super-Resolution of Compressed Image and Video.
-
CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification
Conde, Marcos V,
and Turgutlu, Kerem
In CVPR Workshops
2021
Existing computer vision research in artwork struggles with artwork’s fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. In this work, we use CLIP (Contrastive Language-Image Pre-Training) for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels. Model’s zero-shot capability allows predicting the most relevant natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset, which we consider the largest annotated artwork dataset.
-
Multi-attention Networks for Temporal Localization of Video-level Labels
Zhang, Lijun,
Nizampatnam, Srinath,
Gangopadhyay, Ahana,
and Conde, Marcos V
In ICCV Workshops
2019
Temporal localization remains an important challenge in video understanding. In this work, we present our solution to the 3rd YouTube-8M Video Understanding Challenge organized by Google Research. Participants were required to build a segment-level classifier using a large-scale training data set with noisy video-level labels and a relatively small-scale validation data set with accurate segment-level labels. We formulated the problem as a multiple instance multi-label learning and developed an attention-based mechanism to selectively emphasize the important frames by attention weights. The model performance is further improved by constructing multiple sets of attention networks. We further fine-tuned the model using the segment-level data set. Our final model consists of an ensemble of attention/multi-attention networks, deep bag of frames models, recurrent neural networks and convolutional neural networks. It ranked 13th on the private leader board and stands out for its efficient usage of resources. .
-
h2oGPT: Democratizing Large Language Models
H2O.ai,
arXiv
2023
We introduce h2oGPT, a suite of open-source code repositories for the creation and use of LLMs based on Generative Pretrained Transformers (GPTs).
The goal of this project is to boost truly open-source alternatives to closed-source approaches.
We open-source several fine-tuned h2oGPT models from 7 to 70 Billion parameters, under fully permissive Apache 2.0 licenses.