API.md 11 KB

Imagedl APIs

imagedl.imagedl.ImageClient

ImageClient is a high-level interface for searching and downloading images using different backends (e.g., BaiduImageClient, BingImageClient and GoogleImageClient) registered in ImageClientBuilder.REGISTERED_MODULES. Arguments supported when initializing this class include:

  • image_source (str, default: BaiduImageClient): Name of the image client backend to use. Must be one of the registered modules in ImageClientBuilder.REGISTERED_MODULES.

  • init_image_client_cfg (dict or None, default: None): Extra configuration passed to the underlying image client on initialization. It is merged into a default config:

    default_image_client_cfg = {
    "work_dir": "imagedl_outputs",
    "logger_handle": ImageClient.logger_handle,
    "type": image_source,
    "auto_set_proxies": False,
    "random_update_ua": False,
    "enable_search_curl_cffi": False,
    "enable_download_curl_cffi": False,
    "max_retries": 5,
    "maintain_session": False,
    "disable_print": False,
    "freeproxy_settings": None,
    "default_search_cookies": None,
    "default_download_cookies": None,
    }
    
  • search_limits (int, default: 1000): Default maximum number of images to retrieve per search. Can be overridden per call in ImageClient.search().

  • num_threadings (int, default: 5): Default number of threads to use for network requests and downloads. Can be overridden per call in ImageClient.search() and ImageClient.download().

  • request_overrides (dict or None, default: None): Extra keyword arguments forwarded to requests.get in the underlying image client, e.g., proxies and timeout. These are stored and passed to both ImageClient.search() and ImageClient.download() unless overridden inside the backend.

ImageClient.startcmdui

Start an interactive command-line interface (CLI) for searching and downloading images. Intended mainly for end users running imagedl from the terminal.

Behavior:

  • Repeatedly:
    • Prints a banner with basic information (version, work dir, usage help).
    • Prompts the user for a search keyword:
      "Please enter keywords for the image search:"
  • Special inputs:
    • q / Q: exit the program.
    • r / R: restart and return to the main menu.
  • Any other input is treated as a search keyword:
    • Calls the underlying backend’s search():
    • keyword = user input
    • search_limits = ImageClient.search_limits
    • num_threadings = ImageClient.num_threadings
    • request_overrides = ImageClient.request_overrides
    • Immediately calls the backend’s download() on the search results.

Example (CLI usage):

python -m imagedl.imagedl

ImageClient.search

Perform an image search programmatically using the configured backend. This method only retrieves metadata; it does NOT download any images.

Arguments:

  • keyword (str): The search query string, e.g. "Eiffel Tower", "golden retriever".

  • search_limits_overrides (int | None, default: None): Per-call maximum number of images to retrieve. If None, falls back to ImageClient.search_limits.

  • num_threadings_overrides (int | None, default: None): Per-call override for the number of threads. If None, falls back to ImageClient.num_threadings.

  • filters (dict | None, default: None): Optional filter configuration passed directly to the backend (e.g., image size, color, type), if supported by the chosen image_source.

Returns:

  • list: A list of image metadata objects (backend-defined structure) that can be passed directly to ImageClient.download().

Example:

from imagedl.imagedl import ImageClient

client = ImageClient(
    image_source="BaiduImageClient", search_limits=200, num_threadings=10,
)

image_infos = client.search(
    keyword="cute cat", search_limits_overrides=50,
)

ImageClient.download

Download images from a list of image metadata entries, typically returned by ImageClient.search().

Arguments:

  • image_infos (list): A list of image metadata objects returned by ImageClient.search(). Each entry must contain enough information (e.g. URL) for the backend to download the corresponding image.

  • num_threadings_overrides (int | None, default: None): Per-call override for the number of threads used for downloading. If None, falls back to ImageClient.num_threadings.

Returns:

  • list: A list of image metadata objects (backend-defined structure) that can be downloaded successfully.

Example:

from imagedl.imagedl import ImageClient

client = ImageClient(work_dir="my_images")

# 1. Search
infos = client.search("Eiffel Tower", search_limits_overrides=30)

# 2. Download
client.download(infos, num_threadings_overrides=8)

imagedl.imagedl.modules.sources.BaseImageClient

BaseImageClient is the abstract base class for all image search & download clients in this project. Concrete clients inherit from it and reuse its common logic for:

  • Session management (headers, cookies, user-agent, retries)
  • Optional proxy auto-configuration
  • Multithreaded search and download
  • Progress bars and logging
  • Result saving (search_results.pkl, download_results.pkl)

Current implementations built on top of BaseImageClient include:

  • imagedl.imagedl.modules.sources.BaiduImageClient
  • imagedl.imagedl.modules.sources.BingImageClient
  • imagedl.imagedl.modules.sources.DuckduckgoImageClient
  • imagedl.imagedl.modules.sources.DanbooruImageClient
  • imagedl.imagedl.modules.sources.DimTownImageClient
  • imagedl.imagedl.modules.sources.EverypixelImageClient
  • imagedl.imagedl.modules.sources.FoodiesfeedImageClient
  • imagedl.imagedl.modules.sources.FreeNatureStockImageClient
  • imagedl.imagedl.modules.sources.GoogleImageClient
  • imagedl.imagedl.modules.sources.GelbooruImageClient
  • imagedl.imagedl.modules.sources.HuabanImageClient
  • imagedl.imagedl.modules.sources.I360ImageClient
  • imagedl.imagedl.modules.sources.PixabayImageClient
  • imagedl.imagedl.modules.sources.PexelsImageClient
  • imagedl.imagedl.modules.sources.SogouImageClient
  • imagedl.imagedl.modules.sources.SafebooruImageClient
  • imagedl.imagedl.modules.sources.UnsplashImageClient
  • imagedl.imagedl.modules.sources.WeiboImageClient
  • imagedl.imagedl.modules.sources.YandexImageClient
  • imagedl.imagedl.modules.sources.YahooImageClient

In most cases, users do not instantiate BaseImageClient directly. Instead, they use high-level wrappers such as BaiduImageClient. However, the external API surface of all clients is the same as BaseImageClient (search + download). Arguments supported when initializing this class include:

  • auto_set_proxies (bool, default: False): If True, randomly assign a free proxy fetched by freeproxy.ProxiedSessionClient (details refer to FreeProxy) for each request.

  • random_update_ua (bool, default: False): If True, randomly updates the User-Agent header before each request (using fake_useragent.UserAgent().random), providing additional variability.

  • enable_search_curl_cffi (bool, default: False): If True, curl_cffi.requests.Session is adopted for each search request.

  • enable_download_curl_cffi (bool, default: False): If True, curl_cffi.requests.Session is adopted for each download request.

  • max_retries (int, default: 5): Maximum number of retry attempts in BaseImageClient.get() / BaseImageClient.post() when requests fail or return non-200 HTTP status codes.

  • maintain_session (bool, default: False): If False: a new requests.Session is created before each request. If True: the same session is reused across requests. Combined with random_update_ua, this controls how “sticky” your session is.

  • logger_handle (LoggerHandle or None, default: None): Logger used for informational messages and error reporting. If None, a default LoggerHandle instance is created.

  • disable_print (bool, default: False): If True, suppresses console printing in LoggerHandle (logging still happens internally).

  • work_dir (str, default: "imagedl_outputs"): Root directory for all outputs produced by this client. Under this directory, the client will create per-source and per-search subfolders, for example:

    • imagedl_outputs/BaiduImageClient/2025-11-19-18-30-00 cat/
    • Inside each search folder:
    • search_results.pkl
    • download_results.pkl
    • image files: 00000001.jpg, 00000002.png, ...
  • freeproxy_settings (dict or None, default: None): Arguments passed when instantiating freeproxy.ProxiedSessionClient. If None, defaults to dict(disable_print=True, proxy_sources=['ProxiflyProxiedSession'], max_tries=20, init_proxied_session_cfg={}) when auto_set_proxies=True.

  • default_search_cookies (dict or None, default: None): Default cookies used for each search request.

  • default_download_cookies (dict or None, default: None): Default cookies used for each download request.

BaseImageClient.search

Argument:

  • keyword (str): Search keyword / query sent to the image provider (e.g., "Eiffel Tower", "golden retriever").

  • search_limits (int, default: 1000): Target maximum number of image records to retrieve. Exact behavior depends on how BaseImageClient._constructsearchurls is implemented in the subclass.

  • num_threadings (int, default: 5): Number of worker threads used to fetch search pages in parallel. Each thread runs BaseImageClient._search, pulling URLs from the shared search_urls list.

  • filters (dict or None, default: None): Optional filter configuration that subclasses may use to refine search results (e.g., image size, color, type). The structure is client-specific.

  • request_overrides (dict or None, default: None): Extra keyword arguments forwarded to requests.get for search requests (e.g., timeout, headers, proxies). These are merged on top of the session’s default headers and proxy settings.

Returns:

  • list of image_info dicts. The exact keys are determined by the subclass, but BaseImageClient expects at least:
    • identifier: a unique ID used for deduplication.
    • candidate_urls: list of candidate image URLs for downloading.
    • After the search pipeline, it also fills:
    • work_dir: per-search directory.
    • file_path: base file path (without extension) reserved for downloading.

BaseImageClient.download

Argument:

  • image_infos (list): List of image metadata entries produced by BaseImageClient.search(), or loaded from search_results.pkl. Each entry should contain at least:

    • work_dir: directory where the image should be saved.
    • file_path: base file path (without extension).
    • candidate_urls: list of URLs to try when downloading the image.
  • num_threadings (int, default: 5): Number of worker threads to use for downloading images in parallel.

  • request_overrides (dict or None, default: None): Extra keyword arguments forwarded to requests.get for download requests (e.g., timeout, per-request headers or proxies). These options override or extend the session-level defaults.

Returns:

  • list of downloaded_image_info dicts. For each successfully downloaded image:

    • file_path is updated to include the actual file extension (e.g. .../00000001.jpg).
    • Other fields are copied from the original image_info.