trf.py 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587
  1. """Trust Region Reflective algorithm for least-squares optimization.
  2. The algorithm is based on ideas from paper [STIR]_. The main idea is to
  3. account for the presence of the bounds by appropriate scaling of the variables (or,
  4. equivalently, changing a trust-region shape). Let's introduce a vector v:
  5. | ub[i] - x[i], if g[i] < 0 and ub[i] < np.inf
  6. v[i] = | x[i] - lb[i], if g[i] > 0 and lb[i] > -np.inf
  7. | 1, otherwise
  8. where g is the gradient of a cost function and lb, ub are the bounds. Its
  9. components are distances to the bounds at which the anti-gradient points (if
  10. this distance is finite). Define a scaling matrix D = diag(v**0.5).
  11. First-order optimality conditions can be stated as
  12. D^2 g(x) = 0.
  13. Meaning that components of the gradient should be zero for strictly interior
  14. variables, and components must point inside the feasible region for variables
  15. on the bound.
  16. Now consider this system of equations as a new optimization problem. If the
  17. point x is strictly interior (not on the bound), then the left-hand side is
  18. differentiable and the Newton step for it satisfies
  19. (D^2 H + diag(g) Jv) p = -D^2 g
  20. where H is the Hessian matrix (or its J^T J approximation in least squares),
  21. Jv is the Jacobian matrix of v with components -1, 1 or 0, such that all
  22. elements of matrix C = diag(g) Jv are non-negative. Introduce the change
  23. of the variables x = D x_h (_h would be "hat" in LaTeX). In the new variables,
  24. we have a Newton step satisfying
  25. B_h p_h = -g_h,
  26. where B_h = D H D + C, g_h = D g. In least squares B_h = J_h^T J_h, where
  27. J_h = J D. Note that J_h and g_h are proper Jacobian and gradient with respect
  28. to "hat" variables. To guarantee global convergence we formulate a
  29. trust-region problem based on the Newton step in the new variables:
  30. 0.5 * p_h^T B_h p + g_h^T p_h -> min, ||p_h|| <= Delta
  31. In the original space B = H + D^{-1} C D^{-1}, and the equivalent trust-region
  32. problem is
  33. 0.5 * p^T B p + g^T p -> min, ||D^{-1} p|| <= Delta
  34. Here, the meaning of the matrix D becomes more clear: it alters the shape
  35. of a trust-region, such that large steps towards the bounds are not allowed.
  36. In the implementation, the trust-region problem is solved in "hat" space,
  37. but handling of the bounds is done in the original space (see below and read
  38. the code).
  39. The introduction of the matrix D doesn't allow to ignore bounds, the algorithm
  40. must keep iterates strictly feasible (to satisfy aforementioned
  41. differentiability), the parameter theta controls step back from the boundary
  42. (see the code for details).
  43. The algorithm does another important trick. If the trust-region solution
  44. doesn't fit into the bounds, then a reflected (from a firstly encountered
  45. bound) search direction is considered. For motivation and analysis refer to
  46. [STIR]_ paper (and other papers of the authors). In practice, it doesn't need
  47. a lot of justifications, the algorithm simply chooses the best step among
  48. three: a constrained trust-region step, a reflected step and a constrained
  49. Cauchy step (a minimizer along -g_h in "hat" space, or -D^2 g in the original
  50. space).
  51. Another feature is that a trust-region radius control strategy is modified to
  52. account for appearance of the diagonal C matrix (called diag_h in the code).
  53. Note that all described peculiarities are completely gone as we consider
  54. problems without bounds (the algorithm becomes a standard trust-region type
  55. algorithm very similar to ones implemented in MINPACK).
  56. The implementation supports two methods of solving the trust-region problem.
  57. The first, called 'exact', applies SVD on Jacobian and then solves the problem
  58. very accurately using the algorithm described in [JJMore]_. It is not
  59. applicable to large problem. The second, called 'lsmr', uses the 2-D subspace
  60. approach (sometimes called "indefinite dogleg"), where the problem is solved
  61. in a subspace spanned by the gradient and the approximate Gauss-Newton step
  62. found by ``scipy.sparse.linalg.lsmr``. A 2-D trust-region problem is
  63. reformulated as a 4th order algebraic equation and solved very accurately by
  64. ``numpy.roots``. The subspace approach allows to solve very large problems
  65. (up to couple of millions of residuals on a regular PC), provided the Jacobian
  66. matrix is sufficiently sparse.
  67. References
  68. ----------
  69. .. [STIR] Branch, M.A., T.F. Coleman, and Y. Li, "A Subspace, Interior,
  70. and Conjugate Gradient Method for Large-Scale Bound-Constrained
  71. Minimization Problems," SIAM Journal on Scientific Computing,
  72. Vol. 21, Number 1, pp 1-23, 1999.
  73. .. [JJMore] More, J. J., "The Levenberg-Marquardt Algorithm: Implementation
  74. and Theory," Numerical Analysis, ed. G. A. Watson, Lecture
  75. """
  76. import numpy as np
  77. from numpy.linalg import norm
  78. from scipy.linalg import svd, qr
  79. from scipy.sparse.linalg import lsmr
  80. from scipy.optimize import OptimizeResult
  81. from .common import (
  82. step_size_to_bound, find_active_constraints, in_bounds,
  83. make_strictly_feasible, intersect_trust_region, solve_lsq_trust_region,
  84. solve_trust_region_2d, minimize_quadratic_1d, build_quadratic_1d,
  85. evaluate_quadratic, right_multiplied_operator, regularized_lsq_operator,
  86. CL_scaling_vector, compute_grad, compute_jac_scale, check_termination,
  87. update_tr_radius, scale_for_robust_loss_function, print_header_nonlinear,
  88. print_iteration_nonlinear)
  89. from scipy._lib._util import _call_callback_maybe_halt
  90. def trf(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
  91. loss_function, tr_solver, tr_options, verbose, callback=None):
  92. # For efficiency, it makes sense to run the simplified version of the
  93. # algorithm when no bounds are imposed. We decided to write the two
  94. # separate functions. It violates the DRY principle, but the individual
  95. # functions are kept the most readable.
  96. if np.all(lb == -np.inf) and np.all(ub == np.inf):
  97. return trf_no_bounds(
  98. fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev, x_scale,
  99. loss_function, tr_solver, tr_options, verbose, callback=callback)
  100. else:
  101. return trf_bounds(
  102. fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
  103. loss_function, tr_solver, tr_options, verbose, callback=callback)
  104. def select_step(x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta):
  105. """Select the best step according to Trust Region Reflective algorithm."""
  106. if in_bounds(x + p, lb, ub):
  107. p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
  108. return p, p_h, -p_value
  109. p_stride, hits = step_size_to_bound(x, p, lb, ub)
  110. # Compute the reflected direction.
  111. r_h = np.copy(p_h)
  112. r_h[hits.astype(bool)] *= -1
  113. r = d * r_h
  114. # Restrict trust-region step, such that it hits the bound.
  115. p *= p_stride
  116. p_h *= p_stride
  117. x_on_bound = x + p
  118. # Reflected direction will cross first either feasible region or trust
  119. # region boundary.
  120. _, to_tr = intersect_trust_region(p_h, r_h, Delta)
  121. to_bound, _ = step_size_to_bound(x_on_bound, r, lb, ub)
  122. # Find lower and upper bounds on a step size along the reflected
  123. # direction, considering the strict feasibility requirement. There is no
  124. # single correct way to do that, the chosen approach seems to work best
  125. # on test problems.
  126. r_stride = min(to_bound, to_tr)
  127. if r_stride > 0:
  128. r_stride_l = (1 - theta) * p_stride / r_stride
  129. if r_stride == to_bound:
  130. r_stride_u = theta * to_bound
  131. else:
  132. r_stride_u = to_tr
  133. else:
  134. r_stride_l = 0
  135. r_stride_u = -1
  136. # Check if reflection step is available.
  137. if r_stride_l <= r_stride_u:
  138. a, b, c = build_quadratic_1d(J_h, g_h, r_h, s0=p_h, diag=diag_h)
  139. r_stride, r_value = minimize_quadratic_1d(
  140. a, b, r_stride_l, r_stride_u, c=c)
  141. r_h *= r_stride
  142. r_h += p_h
  143. r = r_h * d
  144. else:
  145. r_value = np.inf
  146. # Now correct p_h to make it strictly interior.
  147. p *= theta
  148. p_h *= theta
  149. p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
  150. ag_h = -g_h
  151. ag = d * ag_h
  152. to_tr = Delta / norm(ag_h)
  153. to_bound, _ = step_size_to_bound(x, ag, lb, ub)
  154. if to_bound < to_tr:
  155. ag_stride = theta * to_bound
  156. else:
  157. ag_stride = to_tr
  158. a, b = build_quadratic_1d(J_h, g_h, ag_h, diag=diag_h)
  159. ag_stride, ag_value = minimize_quadratic_1d(a, b, 0, ag_stride)
  160. ag_h *= ag_stride
  161. ag *= ag_stride
  162. if p_value < r_value and p_value < ag_value:
  163. return p, p_h, -p_value
  164. elif r_value < p_value and r_value < ag_value:
  165. return r, r_h, -r_value
  166. else:
  167. return ag, ag_h, -ag_value
  168. def trf_bounds(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev,
  169. x_scale, loss_function, tr_solver, tr_options, verbose,
  170. callback=None):
  171. x = x0.copy()
  172. f = f0
  173. f_true = f.copy()
  174. nfev = 1
  175. J = J0
  176. njev = 1
  177. m, n = J.shape
  178. if loss_function is not None:
  179. rho = loss_function(f)
  180. cost = 0.5 * np.sum(rho[0])
  181. J, f = scale_for_robust_loss_function(J, f, rho)
  182. else:
  183. cost = 0.5 * np.dot(f, f)
  184. g = compute_grad(J, f)
  185. jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
  186. if jac_scale:
  187. scale, scale_inv = compute_jac_scale(J)
  188. else:
  189. scale, scale_inv = x_scale, 1 / x_scale
  190. v, dv = CL_scaling_vector(x, g, lb, ub)
  191. v[dv != 0] *= scale_inv[dv != 0]
  192. Delta = norm(x0 * scale_inv / v**0.5)
  193. if Delta == 0:
  194. Delta = 1.0
  195. g_norm = norm(g * v, ord=np.inf)
  196. f_augmented = np.zeros(m + n)
  197. if tr_solver == 'exact':
  198. J_augmented = np.empty((m + n, n))
  199. elif tr_solver == 'lsmr':
  200. reg_term = 0.0
  201. regularize = tr_options.pop('regularize', True)
  202. if max_nfev is None:
  203. max_nfev = x0.size * 100
  204. alpha = 0.0 # "Levenberg-Marquardt" parameter
  205. termination_status = None
  206. iteration = 0
  207. step_norm = None
  208. actual_reduction = None
  209. if verbose == 2:
  210. print_header_nonlinear()
  211. while True:
  212. v, dv = CL_scaling_vector(x, g, lb, ub)
  213. g_norm = norm(g * v, ord=np.inf)
  214. if g_norm < gtol:
  215. termination_status = 1
  216. if verbose == 2:
  217. print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
  218. step_norm, g_norm)
  219. if termination_status is not None or nfev == max_nfev:
  220. break
  221. # Now compute variables in "hat" space. Here, we also account for
  222. # scaling introduced by `x_scale` parameter. This part is a bit tricky,
  223. # you have to write down the formulas and see how the trust-region
  224. # problem is formulated when the two types of scaling are applied.
  225. # The idea is that first we apply `x_scale` and then apply Coleman-Li
  226. # approach in the new variables.
  227. # v is recomputed in the variables after applying `x_scale`, note that
  228. # components which were identically 1 not affected.
  229. v[dv != 0] *= scale_inv[dv != 0]
  230. # Here, we apply two types of scaling.
  231. d = v**0.5 * scale
  232. # C = diag(g * scale) Jv
  233. diag_h = g * dv * scale
  234. # After all this has been done, we continue normally.
  235. # "hat" gradient.
  236. g_h = d * g
  237. f_augmented[:m] = f
  238. if tr_solver == 'exact':
  239. J_augmented[:m] = J * d
  240. J_h = J_augmented[:m] # Memory view.
  241. J_augmented[m:] = np.diag(diag_h**0.5)
  242. U, s, V = svd(J_augmented, full_matrices=False)
  243. V = V.T
  244. uf = U.T.dot(f_augmented)
  245. elif tr_solver == 'lsmr':
  246. J_h = right_multiplied_operator(J, d)
  247. if regularize:
  248. a, b = build_quadratic_1d(J_h, g_h, -g_h, diag=diag_h)
  249. to_tr = Delta / norm(g_h)
  250. ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
  251. reg_term = -ag_value / Delta**2
  252. lsmr_op = regularized_lsq_operator(J_h, (diag_h + reg_term)**0.5)
  253. gn_h = lsmr(lsmr_op, f_augmented, **tr_options)[0]
  254. S = np.vstack((g_h, gn_h)).T
  255. S, _ = qr(S, mode='economic')
  256. JS = J_h.dot(S) # LinearOperator does dot too.
  257. B_S = np.dot(JS.T, JS) + np.dot(S.T * diag_h, S)
  258. g_S = S.T.dot(g_h)
  259. # theta controls step back step ratio from the bounds.
  260. theta = max(0.995, 1 - g_norm)
  261. actual_reduction = -1
  262. while actual_reduction <= 0 and nfev < max_nfev:
  263. if tr_solver == 'exact':
  264. p_h, alpha, n_iter = solve_lsq_trust_region(
  265. n, m, uf, s, V, Delta, initial_alpha=alpha)
  266. elif tr_solver == 'lsmr':
  267. p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
  268. p_h = S.dot(p_S)
  269. p = d * p_h # Trust-region solution in the original space.
  270. step, step_h, predicted_reduction = select_step(
  271. x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta)
  272. x_new = make_strictly_feasible(x + step, lb, ub, rstep=0)
  273. f_new = fun(x_new)
  274. nfev += 1
  275. step_h_norm = norm(step_h)
  276. if not np.all(np.isfinite(f_new)):
  277. Delta = 0.25 * step_h_norm
  278. continue
  279. # Usual trust-region step quality estimation.
  280. if loss_function is not None:
  281. cost_new = loss_function(f_new, cost_only=True)
  282. else:
  283. cost_new = 0.5 * np.dot(f_new, f_new)
  284. actual_reduction = cost - cost_new
  285. Delta_new, ratio = update_tr_radius(
  286. Delta, actual_reduction, predicted_reduction,
  287. step_h_norm, step_h_norm > 0.95 * Delta)
  288. step_norm = norm(step)
  289. termination_status = check_termination(
  290. actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
  291. if termination_status is not None:
  292. break
  293. alpha *= Delta / Delta_new
  294. Delta = Delta_new
  295. if actual_reduction > 0:
  296. x = x_new
  297. f = f_new
  298. f_true = f.copy()
  299. cost = cost_new
  300. J = jac(x)
  301. njev += 1
  302. if loss_function is not None:
  303. rho = loss_function(f)
  304. J, f = scale_for_robust_loss_function(J, f, rho)
  305. g = compute_grad(J, f)
  306. if jac_scale:
  307. scale, scale_inv = compute_jac_scale(J, scale_inv)
  308. else:
  309. step_norm = 0
  310. actual_reduction = 0
  311. iteration += 1
  312. # Call callback function and possibly stop optimization
  313. if callback is not None:
  314. intermediate_result = OptimizeResult(
  315. x=x, fun=f_true, nit=iteration, nfev=nfev)
  316. intermediate_result["cost"] = cost
  317. if _call_callback_maybe_halt(
  318. callback, intermediate_result
  319. ):
  320. termination_status = -2
  321. break
  322. if termination_status is None:
  323. termination_status = 0
  324. active_mask = find_active_constraints(x, lb, ub, rtol=xtol)
  325. return OptimizeResult(
  326. x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
  327. active_mask=active_mask, nfev=nfev, njev=njev,
  328. status=termination_status)
  329. def trf_no_bounds(fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev,
  330. x_scale, loss_function, tr_solver, tr_options, verbose,
  331. callback=None):
  332. x = x0.copy()
  333. f = f0
  334. f_true = f.copy()
  335. nfev = 1
  336. J = J0
  337. njev = 1
  338. m, n = J.shape
  339. if loss_function is not None:
  340. rho = loss_function(f)
  341. cost = 0.5 * np.sum(rho[0])
  342. J, f = scale_for_robust_loss_function(J, f, rho)
  343. else:
  344. cost = 0.5 * np.dot(f, f)
  345. g = compute_grad(J, f)
  346. jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
  347. if jac_scale:
  348. scale, scale_inv = compute_jac_scale(J)
  349. else:
  350. scale, scale_inv = x_scale, 1 / x_scale
  351. Delta = norm(x0 * scale_inv)
  352. if Delta == 0:
  353. Delta = 1.0
  354. if tr_solver == 'lsmr':
  355. reg_term = 0
  356. damp = tr_options.pop('damp', 0.0)
  357. regularize = tr_options.pop('regularize', True)
  358. if max_nfev is None:
  359. max_nfev = x0.size * 100
  360. alpha = 0.0 # "Levenberg-Marquardt" parameter
  361. termination_status = None
  362. iteration = 0
  363. step_norm = None
  364. actual_reduction = None
  365. if verbose == 2:
  366. print_header_nonlinear()
  367. while True:
  368. g_norm = norm(g, ord=np.inf)
  369. if g_norm < gtol:
  370. termination_status = 1
  371. if verbose == 2:
  372. print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
  373. step_norm, g_norm)
  374. if termination_status is not None or nfev == max_nfev:
  375. break
  376. d = scale
  377. g_h = d * g
  378. if tr_solver == 'exact':
  379. J_h = J * d
  380. U, s, V = svd(J_h, full_matrices=False)
  381. V = V.T
  382. uf = U.T.dot(f)
  383. elif tr_solver == 'lsmr':
  384. J_h = right_multiplied_operator(J, d)
  385. if regularize:
  386. a, b = build_quadratic_1d(J_h, g_h, -g_h)
  387. to_tr = Delta / norm(g_h)
  388. ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
  389. reg_term = -ag_value / Delta**2
  390. damp_full = (damp**2 + reg_term)**0.5
  391. gn_h = lsmr(J_h, f, damp=damp_full, **tr_options)[0]
  392. S = np.vstack((g_h, gn_h)).T
  393. S, _ = qr(S, mode='economic')
  394. JS = J_h.dot(S)
  395. B_S = np.dot(JS.T, JS)
  396. g_S = S.T.dot(g_h)
  397. actual_reduction = -1
  398. while actual_reduction <= 0 and nfev < max_nfev:
  399. if tr_solver == 'exact':
  400. step_h, alpha, n_iter = solve_lsq_trust_region(
  401. n, m, uf, s, V, Delta, initial_alpha=alpha)
  402. elif tr_solver == 'lsmr':
  403. p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
  404. step_h = S.dot(p_S)
  405. predicted_reduction = -evaluate_quadratic(J_h, g_h, step_h)
  406. step = d * step_h
  407. x_new = x + step
  408. f_new = fun(x_new)
  409. nfev += 1
  410. step_h_norm = norm(step_h)
  411. if not np.all(np.isfinite(f_new)):
  412. Delta = 0.25 * step_h_norm
  413. continue
  414. # Usual trust-region step quality estimation.
  415. if loss_function is not None:
  416. cost_new = loss_function(f_new, cost_only=True)
  417. else:
  418. cost_new = 0.5 * np.dot(f_new, f_new)
  419. actual_reduction = cost - cost_new
  420. Delta_new, ratio = update_tr_radius(
  421. Delta, actual_reduction, predicted_reduction,
  422. step_h_norm, step_h_norm > 0.95 * Delta)
  423. step_norm = norm(step)
  424. termination_status = check_termination(
  425. actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
  426. if termination_status is not None:
  427. break
  428. alpha *= Delta / Delta_new
  429. Delta = Delta_new
  430. if actual_reduction > 0:
  431. x = x_new
  432. f = f_new
  433. f_true = f.copy()
  434. cost = cost_new
  435. J = jac(x)
  436. njev += 1
  437. if loss_function is not None:
  438. rho = loss_function(f)
  439. J, f = scale_for_robust_loss_function(J, f, rho)
  440. g = compute_grad(J, f)
  441. if jac_scale:
  442. scale, scale_inv = compute_jac_scale(J, scale_inv)
  443. else:
  444. step_norm = 0
  445. actual_reduction = 0
  446. iteration += 1
  447. # Call callback function and possibly stop optimization
  448. if callback is not None:
  449. intermediate_result = OptimizeResult(
  450. x=x, fun=f_true, nit=iteration, nfev=nfev)
  451. intermediate_result["cost"] = cost
  452. if _call_callback_maybe_halt(
  453. callback, intermediate_result
  454. ):
  455. termination_status = -2
  456. break
  457. if termination_status is None:
  458. termination_status = 0
  459. active_mask = np.zeros_like(x)
  460. return OptimizeResult(
  461. x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
  462. active_mask=active_mask, nfev=nfev, njev=njev,
  463. status=termination_status)