Skip to content

Instantly share code, notes, and snippets.

@bertsky
bertsky / tei2txt.sh
Last active April 3, 2024 18:46
wrapper around dta-tools tei2txt.pl covering dehyphenation
#!/bin/bash
nontext_opts=(
xmlstarlet ed -N tei=http://www.tei-c.org/ns/1.0
-d //tei:note
-d //tei:fw
-d //tei:table
-d //tei:figure
-d //tei:formula
-d //tei:titlePage
@bertsky
bertsky / charfreq.py
Created April 2, 2024 11:39
Aggregate character histogram for the given text files
#!/usr/bin/env python3
import argparse
import os
import sys
import io
from functools import reduce
import json
import unicodedata
@bertsky
bertsky / mlmodel.py
Last active February 25, 2024 16:23
dump user metadata of a kraken model file or fix it
#!/usr/bin/env python3
# Dump user metadata of a kraken model file or fix it.
import click
import json
import os
if not 'TF_CPP_MIN_LOG_LEVEL' in os.environ:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # error
@bertsky
bertsky / metha-dump.py
Last active January 10, 2024 16:20
dump METS files from an OAI harvest (metha-cat output after running metha-sync), with recursive METS downloads for multipart works
#!/usr/bin/env python3
import sys
from lxml import etree as ET
from ocrd_models.constants import NAMESPACES
NAMESPACES['oai'] = "http://www.openarchives.org/OAI/2.0/"
for curie in NAMESPACES:
@bertsky
bertsky / cudatest.sh
Last active June 10, 2023 23:51
OCR-D workflow for coverage tests, esp. CUDA support
set -e
# select first CUDA device (in case there are multiple, which may fail due to [a recent Tensorflow problem](https://github.com/qurator-spk/eynollah/issues/99))
export CUDA_VISIBLE_DEVICES=0
# check we are not running into [this bug](https://github.com/shapely/shapely/issues/1598)
python3 -c "from shapely.geometry import Polygon; import torch; torch.randn(10).cuda()"
# validate CUDA support is working in TF and Torch (not an exhaustive test)
python3 -c "import torch; print(torch.cuda.is_available())"
@bertsky
bertsky / workflow.py
Last active December 1, 2020 00:41
proof of concept for an OCR-D workflow engine that uses strong (API instead CLI) integration of processors and acts as a server
import os
import sys
import re
import click
import json
from time import time
import flask
from distutils.spawn import find_executable as which
import ocrd
@bertsky
bertsky / EVAL-CLIP-TESS-vs-OCRO_0005.html
Created December 5, 2019 15:33
dinglehopper output
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<style type="text/css">
.gt .diff {
color: green;
@bertsky
bertsky / preprocess-ocrd-gt.sh
Last active October 18, 2019 12:45
Commands to prepare pixel classifier training data from OCR-D GT
# Needs OCR-D/core#327 OCR-D/ocrd_olena#10 OCR-D/ocrd_segment#11 bertsky/ocrd_cis
# Runs a preprocessing and resegmentation workflow for GT annotation,
# then extracts page images along JSON descriptions of region polygons and classes;
# finally, creates a flattened directory under $TARGET.
# Run: preprocess-ocrd-gt.sh [TARGET-DIRECTORY [METS-FILE]]
# (default is all METS files anywhere under CWD)
TARGET=${1:-../1000pages-crop-sauvola-denoise-deskew-repair}
WORKSPACES=${2:-$(find . -name mets.xml)}