Page 1 of 1

[Solved] CSV : counting number of lines

PostPosted: Sun Aug 30, 2020 3:53 pm
by Mr.Dandy
Hello forum,

I have a huge CSV file with more 1 million of lines.
If I use a loop to read it, this take a long time.

Do you know a way to count total of lines instantanely?

Re: CSV : counting number of lines

PostPosted: Sun Aug 30, 2020 5:22 pm
by JeJe
Store a header at the start of the file that records the number of lines - whenever you modify the file you update it.

Re: CSV : counting number of lines

PostPosted: Sun Aug 30, 2020 5:45 pm
by RoryOF
In this posting
https://www.codespeedy.com/count-the-number-of-lines-in-a-text-file-in-python/
you will find code for line-counting in files.

Re: CSV : counting number of lines

PostPosted: Sun Aug 30, 2020 5:55 pm
by JeJe
You could get an instant answer if you used fixed width fields instead of CSV - simple calculation from the size of the file.

Re: CSV : counting number of lines

PostPosted: Sun Aug 30, 2020 6:25 pm
by Villeroy

Re: CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 11:01 am
by Mr.Dandy
Thanks folks
I try python solution.
Powershell is only Windows-compatible.

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 11:47 am
by Villeroy
On Mac, Linux and WIndows with Linux tools:
Code: Select all   Expand viewCollapse view
wc -l file.csv

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 5:12 pm
by Villeroy
Just in case you access database data with the database component:
Code: Select all   Expand viewCollapse view
Sub test_RecordCount()
   print getRecordCount(sDB:="Bibliography", sTBL:="biblio", sUSR:="", sPWD:="")
End Sub

Function getRecordCount(sDB, sTBL, sUSR, sPWD) AS Long
on error goto exitErr
   dbc = CreateUnoService("com.sun.star.sdb.DatabaseContext")
   db = dbc.getByName(sDB)
   s = "SELECT COUNT(*) AS ""C"" FROM """& sTBL &""""
   con = db.getConnection(sUSR, sPWD)
   stm = con.prepareStatement(s)
   q = stm.executeQuery()
   q.next()
   n = q.getLong(1)
exitErr:
   con.close()
   getRecordCount = n
End Function

I believe this works with any table in any type of database connection. Tested with Bibliography of type dBase and HSQL2.

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 6:41 pm
by Villeroy
OK, the above method is certainly the most complicated method to count lines of a csv file. However, it would work with any table that belongs to any type of database connection.

This is how to read lines with any descendant of (Visual) Basic since the early 90ies:
Code: Select all   Expand viewCollapse view
Sub test_RecordCount()
   Msgbox getLineCount("file:///tmp/Untitled.csv")
End Sub

Function getLineCount(sURL As String) As Long
Dim n As Long, s As String
   FN = Freefile()
   Open sURL For Input As #FN
   while not eof(#FN)
      n = n+1
      line input #FN, s
   wend
   Close #FN
   getLineCount = n
End Function

Why does nobody read any books before writing programs?

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 6:58 pm
by JeJe
Do you know a way to count total of lines instantanely?


I took the issue to be the last word in the question. The answer is, for a million line CSV file, there isn't one. Get a supercomputer (maybe...)

Edit:
My other thought was a hybrid of a CSV file/fixed width record file... where you make all the rows of equal size by adding a last record that pads the row size to a fixed width. You could then find the count instantly from the file size and load it as a CSV... but you'd need to adjust the last record to any other record changes and so on and that would be some work...

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 8:56 pm
by Villeroy
On my notebook, the Basic loop takes 7 seconds to count a million lines. wc -l takes a fraction of a second.

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 9:46 pm
by JeJe
Villeroy wrote:On my notebook, the Basic loop takes 7 seconds to count a million lines. wc -l takes a fraction of a second.


That's meaningless without the line/file size (?)

(I do suspect I'm going to have to concede here... I ran some tests with my laptop and got fed up with the wait... but mine's a Pentium 2020M)

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 10:04 pm
by JeJe
MrProgrammer wrote:
$time perl -nE '}{say " $. $ARGV"' Million.txt


Don't know perl - I'm guessing that creates a file? With a million unrealistically small 1 record ("say ") size lines?

Edit, no say looks to be a command? Anyway - a million lines is meaningless without knowing the file size.

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 10:35 pm
by Villeroy
Villeroy wrote:On my notebook, the Basic loop takes 7 seconds to count a million lines. wc -l takes a fraction of a second.

My file is 97MB in size. About 100 chars per line.
wc -l takes 0.14 seconds

Re: [Solved] CSV : counting number of lines

PostPosted: Wed Sep 02, 2020 10:46 pm
by MrProgrammer
Mr.Dandy wrote:Do you know a way to count total of lines instantanely?
One cannot accomplish a task in zero time. How quickly is "instantly"? Asking in an OpenOffice forum about how to count the lines in a file suggests to me that this is a case of XY Problem.

JeJe wrote:Don't know perl
Perl is a general-purpose Unix scripting language originally developed to make report processing easier which is also available on Linux, MacOS, and Windows. Many tasks can be accomplished quickly and easily with short programs. Perl is included with MacOS, and is included with many Linux distributions.

JeJe wrote:Anyway - a million lines is meaningless without knowing the file size.
Yes, I should have given the file size. Million0004.txt has 4-byte lines. Million1024.txt has 1K lines. On my machine, the larger file takes about a tenth of a second with wc and less than a second with perl.

$ ls -ln Million*.txt
-rw-r--r-- 1 503 20 4000000 Sep 2 15:25 Million0004.txt
-rw-r--r-- 1 503 20 1024000000 Sep 2 15:22 Million1024.txt
$ for f in Million*.txt; do time wc -l $f; done
1000000 Million0004.txt

real 0m0.006s
user 0m0.003s
sys 0m0.002s
1000000 Million1024.txt

real 0m0.713s
user 0m0.598s
sys 0m0.114s
$ for f in Million*.txt; do time perl -nE '}{say " $. $ARGV"' $f; done
1000000 Million0004.txt

real 0m0.086s
user 0m0.080s
sys 0m0.004s
1000000 Million1024.txt

real 0m0.902s
user 0m0.727s
sys 0m0.171s

JeJe wrote:Store a header at the start of the file that records the number of lines - whenever you modify the file you update it.
Or, if speed is of the essence, keep the count in a separate file which is created (using one of the methods above) by the process which generates the CSV file. Then the lines only need to be counted once and any other times the count can be accessed quickly.
wc -l <Million1024.txt >Million1024_Count.txt

Re: [Solved] CSV : counting number of lines

PostPosted: Thu Sep 03, 2020 12:12 am
by Villeroy
Code: Select all   Expand viewCollapse view
$ time python countlines.py
1051178

real   0m0,286s
user   0m0,216s
sys   0m0,069s

countlines.py:
Code: Select all   Expand viewCollapse view
with open('/tmp/test/test.csv') as my_file:
    print(sum(1 for _ in my_file))