diff --git a/olmocr/bench/README.md b/olmocr/bench/README.md index ea3fea1..eb72739 100644 --- a/olmocr/bench/README.md +++ b/olmocr/bench/README.md @@ -52,5 +52,12 @@ Step 2. Run your extraction on it, point output to folder, ex. olmocr-v2_1/ wher Step 3. Run the evaluation script Step 4. Get results, and use tinyhost to view all failing examples -## TODO +### Running existing scripts +```bash +pip install marker-pdf==1.5.4 +python olmocr/bench/runners/run_marker.py olmocr/bench/sample_data/pdfs + +pip install verovio torchvision +python olmocr/bench/runners/run_gotocr.py olmocr/bench/sample_data/pdfs +``` diff --git a/olmocr/bench/runners/run_gotocr.py b/olmocr/bench/runners/run_gotocr.py index a891010..22efcbd 100644 --- a/olmocr/bench/runners/run_gotocr.py +++ b/olmocr/bench/runners/run_gotocr.py @@ -27,14 +27,14 @@ def run(pdf_folder): """ Convert all PDF files in the specified folder to markdown using GOT-OCR. Each page of a PDF is converted to an image and processed with OCR. - The markdown files are saved in a folder called "marker" located alongside the pdf_folder. + The markdown files are saved in a folder called "got_ocr" located alongside the pdf_folder. :param pdf_folder: Path to the folder containing PDF files. """ # Resolve absolute paths and prepare destination folder pdf_folder = os.path.abspath(pdf_folder) parent_dir = os.path.dirname(pdf_folder) - destination_folder = os.path.join(parent_dir, "marker") + destination_folder = os.path.join(parent_dir, "got_ocr") os.makedirs(destination_folder, exist_ok=True) # List all PDF files in the folder diff --git a/olmocr/bench/sample_data/got_ocr/multi_column_miss.md b/olmocr/bench/sample_data/got_ocr/multi_column_miss.md new file mode 100644 index 0000000..7ed8f1e --- /dev/null +++ b/olmocr/bench/sample_data/got_ocr/multi_column_miss.md @@ -0,0 +1,105 @@ +4.47 +www.tobaccocontrol.com +Advocacy in Action +stakeholders has occurred in other nations, with groups and +individuals refusing to risk being appropriated into the +industry’s public relations ambitions. It now looks like that +with vigilance, tobacco control advocates can easily foment +similar distaste in many areas of the business community. +Our actions sought to demolise the tobacco industry by +disrupting its efforts to take its place alongside other +publications, including the public and social credit—in the +hope that it might gain by association. +Tobacco industry posturing about its corporate responsi- +bility can never hide the ugly consequences of its ongoing +efforts to “work with all relevant stakeholders for the +environment” and the government’s “environmental” and +tobacco products”1 (translation: “we will build alliances with +others who want to profit from tobacco use, to do all we can +to counteract effective tobacco control”). BAT has 15.4% and +Philip Morris 16.4% of the global cigarette market: “With 4.9 +million smokers currently dying from tobacco use each year, +and the industry unblinkingly concurring that its products +are addictive, this leaves BAT to argue why it should not +be held to be largely accountable for the annual deaths of +some 754 600 smokers, and Philip Morris some 803 600 +smokers. +REFERENCES +1 +Bash Arrington Tobacco. Social Report. http://www.bash.com/20400g. +2 +Tree B Tobacco.com copyright angers. MPs. The Age (Methanone) 2004, May +17 http://www.bango.com.au/articles/2004/05/16/ +3 +Hirschhorn. A Report. http://www.bango.com.au/articles/2004/05/16 +4 +Rishidson N. Corporate social responsibility and the tobacco industry. hope +or hypo? Tobacco Control 2004;13.447–53. +5 +Buch, Michael, and Michael Scherer. “Healthcare website. http:// +www.ethalicorg.com/asia2004/. +6 +Chopman S, Shatenstein S. Eterne corporate makeover tobacco companies, +and the industry’s public relations ambitions. http://www.bishair.com/2004/ +http://petition globalink.org/view.php/roades-entree-entree/. +7 +6 +Mockay J, Erikson M. The Tobacco Effects. Green: World Health +Organization, 2002. +INDUSTRY WATCH +Corporate social responsibility and the tobacco industry: +hope or hype? +N Hirschhorn +Corporate social responsibility (CSR) emerged from a +realisation among transnational corporations of the need +to account for and redress their adverse impact on society: +specifically, on human rights, labour practices, and the +environment. Two transnational tobacco companies have +recently adopted CSR: Philip Morris, and British American +Tobacco. This report explains the origins and theory +behind CSR; examines internal company documents from +Philip Morris showing the company’s deliberations on the +matter, and the company’s perspective on its own +behaviour; and reflects on whether marketing tobacco is +antithetical to social responsibility. +Correspondence to: +Dr Norbert Hirschhorn, +Nostalonte 6, A3 00600 +Helsinki, Finland, 000000. +Received +13 November 2003 +Accepted 15 July 2004 +Tobacco Control 2004;13.447–453. doi: 10.1136/rlc.2003.006676 +tobacco company espousing CSR should be +judged simply as a corporate entity along +standards of business ethics, or as an irretrie- +vably negative force in the realm of public health, +thereby rendering CSR an oxymoron. +CORPORATE SOCIAL RESPONSIBILITY: +THE CONTEXT +The term “corporate social responsibility” is in +vogue at the moment bounds a concept it is vague +and means different groups to different people.4 +The report from CSR trace its American roots +to the 19th century when large industries +engaged in philanthropy and established great +public institutions, a form of “noblesse object”. +But the notion that corporations should be +noted to the extent to the two society because of +their impact on society, and the environment +from the civil rights, peace, and environmental +movements of the last half century.2 The +unprecedented expansion of power and influ- +ence of TNCs over the past three decades has +accelerated global trade and development, but +also environmental damage and abuses of +the world. +Abbreviations: ASH, Action on Smoking and Health, +and Health, and Health, and Health, and Health +Environmentally Responsible Economies; CSR, corporate +social responsibility, DJSI, Dow Jones Sustainability Index; +CACA, Global Corporate Affairs Council; GRI, Global +Health, and Health, and Health, and Health, and +NGOs, non-governmental organisations; PM, Philip +Morris; TNCs, transnational corporations; UNEP, United +Nations Environment Program \ No newline at end of file