Skip to content

Instantly share code, notes, and snippets.

View kba's full-sized avatar

Konstantin Baierer kba

View GitHub Profile
@bertsky
bertsky / cudatest.sh
Last active June 10, 2023 23:51
OCR-D workflow for coverage tests, esp. CUDA support
set -e
# select first CUDA device (in case there are multiple, which may fail due to [a recent Tensorflow problem](https://github.com/qurator-spk/eynollah/issues/99))
export CUDA_VISIBLE_DEVICES=0
# check we are not running into [this bug](https://github.com/shapely/shapely/issues/1598)
python3 -c "from shapely.geometry import Polygon; import torch; torch.randn(10).cuda()"
# validate CUDA support is working in TF and Torch (not an exhaustive test)
python3 -c "import torch; print(torch.cuda.is_available())"
@aelkiss
aelkiss / mets1to2.xsl
Last active September 11, 2023 20:37
Transformation from METS1 to METS2
moved to https://github.com/mets/METS1to2
@jbarth-ubhd
jbarth-ubhd / abbyy2page.pl
Created January 14, 2022 12:46
minimalistic ABBYY XML to PAGE XML
#!/usr/bin/perl
use strict;
use utf8;
use XML::LibXML;
use XML::Quote;
binmode STDOUT, ":utf8";
my $dom=XML::LibXML->load_xml(location=>$ARGV[0]);
my $root=$dom->documentElement;
import {default as fetch} from 'node-fetch';
const { pdf } = require("pdf-to-img");
import {tmpdir} from "os";
import {createWriteStream, createReadStream} from 'fs';
import * as fsp from 'fs/promises'
import * as archiver from 'archiver';
import {ArchiverError} from "archiver";
import * as path from "path";
import {Parser, Builder} from "xml2js";
@cboulanger
cboulanger / create-corpus-from-best-ocr-result.sh
Last active May 19, 2021 17:26
Selects from different UTF-8 documents that are the result of OCR processing of the same source document, choosing the one with the highest quality (i.e. highest language recognition confidence)
#! /usr/bin/env bash
# see https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html
# using https://github.com/saffsd/langid.py
# install with pip install langid and add the scorelines.sh & ocrquality.rb scripts from the blog entry in the same directory
# The PDF source files, which start with a DOI, adapt this for your case
FILE_SELECTOR=/path/to/source/dir/*.pdf
# The path to the directory to which the selected documents should be copied
TARGET=/path/to/target/dir
@b2m
b2m / Performance-dinglehopper.md
Last active November 18, 2020 10:17
Performance benchmarking for dinglehopper using hyperfine

Performance analysis for dinglehopper using hyperfine

To estimate the impact of some changes to dinglehopper I used hyperfine to benchmark the behaviour.

The commands to run a docker container to execute the benchmarks are listed in performance-docker.sh.

The commands needed to prepare, execute and analyse the benchmark are listed in performance.sh.

@b2m
b2m / Dockerfile
Last active March 25, 2022 10:27
Put browse-ocrd into a docker container
FROM python:3.7
RUN apt-get update \
&& apt-get install -y --no-install-recommends libcairo2-dev libgtk-3-bin libgtk-3-dev libglib2.0-dev libgtksourceview-3.0-dev libgirepository1.0-dev gir1.2-webkit2-4.0 pkg-config cmake \
&& pip3 install -U setuptools --use-feature=2020-resolver \
&& pip3 install browse-ocrd --use-feature=2020-resolver
ENV GDK_BACKEND broadway
ENV BROADWAY_DISPLAY :5
@Witiko
Witiko / evaluate-speed-pie-chart.py
Created October 16, 2020 21:53
Creates a pie chart from a GNU Parallel joblog after running OCR-D
# -*- coding:utf-8 -*-
from itertools import dropwhile
import json
import re
import sys
import matplotlib.pyplot as plt
@mikegerber
mikegerber / jpageviewer-profile.sh
Last active July 13, 2021 15:18
jpageviewer alias/function that looks for a mets.xml, i.e. in a OCR-D workspace. For use in `~/.zshrc` or similiar.
_jpageviewer_jar=~/opt/jpageviewer/JPageViewer.jar
if [ -e "$_jpageviewer_jar" ]; then
jpageviewer() {
# --resolve-dir defaults to the file's directory
_jpageviewer_resolve_dir=`dirname "$1"`
# ... unless a mets.xml file exists one directory down (OCR-D workspace)
if [ -e "$_jpageviewer_resolve_dir"/../mets.xml ]; then
_jpageviewer_resolve_dir="$_jpageviewer_resolve_dir"/..
fi
@PonteIneptique
PonteIneptique / hocr_to_kraken_transcribe.xsl
Last active March 21, 2020 11:25
XSL for transforming (need Saxon-EE > 9.8) HOCR from tesseract to transcribing file for Kraken (à la ketos prefill)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:my="foo.bar"
exclude-result-prefixes="xs my saxon uuid"
xpath-default-namespace="http://www.w3.org/1999/xhtml"
version="2.0"
xmlns:uuid="java:java.util.UUID">